[16:21:44] I'll remove the notice from cluebot's talk page [16:21:56] GEOFBOT: thank you :) [16:25:40] YuviPanda: also we have instance newsletter-test too [16:26:59] tonythomas: yeah, I am cleaning up on everything in both instances [16:27:30] tonythomas: also I found you and your student(I think?) had private keys on bastion [16:27:40] please never do that. use proxycommand instead. details in Help:Access page on wikitech [16:28:11] Okey. I had that [16:28:14] :o [16:28:57] tonythomas: yeah, and if you have used that private key elsewhere please go and revoke that [16:29:01] you should treat that as compromised [16:29:12] private keys should never leave a machine under your control (laptop / desktop) [16:29:37] tonythomas: verpremotemx.eqiad.wmflabs seems dead? [16:29:49] Yeah. Will do that fast. [16:29:57] YuviPanda: is it possible to recover some data from old FS? [16:30:13] YuviPanda: it is :( [16:30:30] tonythomas: self hosted puppetmasters are kind of a PITA sadly [16:30:41] petan: we are running an fsck on that at the moment, we'll know whenever it completes [16:30:48] ok [16:30:49] probably going to be a while - it was 40T [16:31:25] YuviPanda: I will kill that once I get hold of a PC. ( in my phone now ) [16:31:31] tonythomas: cool [16:31:40] tonythomas: your other instances and newsletter instances should be back up now [16:31:55] I think I still have some very old and slow 20MB HDD drive at home which might take even longer to fsck than this 40TB one :P [16:31:58] YuviPanda: yup. I can ssh [16:33:38] !log editor-engagement rebooting all instances slowly [16:33:40] Logged the message, Master [16:33:52] legoktm: ^ (you're the closest to someone involved with editor-engagement online now) [16:35:18] YuviPanda: ok, thanks [16:43:32] 6Labs, 6Multimedia: Disable NFS on multimedia project - https://phabricator.wikimedia.org/T103126#1382919 (10yuvipanda) 3NEW a:3yuvipanda [16:45:08] versatel [16:47:37] 6Labs, 6Collaboration-Team: Investigate and remove NFS from editor-engagement project - https://phabricator.wikimedia.org/T102663#1382941 (10yuvipanda) I've done this now, and instances should be back up post outage. Should get rid of /data/project too. [16:48:43] !log multimedia rebooting all instances in multimedia project disabling NFS [16:48:47] Logged the message, Master [16:49:06] YuviPanda: Working 24x7? [16:49:14] some variant thereof [16:49:46] To get homedir fully restart, what do I have to do? 1. Is already done 2. Will be done soonish 3. Bribe someone? [16:49:51] *restored [16:51:07] YuviPanda: ? [16:51:12] multichill: on tools? [16:51:21] multichill: we're running an fsck on the entire volume and can see if it can be rescued. [16:51:35] so it's option 4. we are checking to see how possible tha tis [16:51:57] YuviPanda: I'm quite sure I only wrote a couple of sql queries, would like to have those back. Don't care about the results, that's just rerunning them [16:52:17] multichill: filesystem errors, 1 file, 100, doesn't matter :) I'll keep you informed [16:52:22] ok [16:53:31] Good luck with "puinruimen" ;-) [16:54:12] multichill: what does that mean? [16:56:34] Cleaning up the mess [16:57:48] ah [16:59:05] !log mediawiki-core-team disabling NFS, rebooting everything [16:59:07] Logged the message, Master [16:59:12] YuviPanda: deployment-prep seems mostly good except I can't seem to login to deployment-cache-text02 but it seems to be up (only one I've run across so far) [16:59:34] thcipriani: probably because it hasn't had puppet run in forever. looking [16:59:40] thanks [17:00:50] thcipriani: I can get in (root key) [17:01:16] YuviPanda: hmm, tried with my key and the keyholder/mwdeploy key [17:01:31] thcipriani: yup, root key is different [17:01:32] thcipriani: https://dpaste.de/hArH/raw [17:01:40] thcipriani: need to fix that error for that instance to be available fully again [17:01:53] thcipriani: I can add your key there if you'd like? [17:02:01] thcipriani: (just point me to your key) [17:03:14] YuviPanda: https://gist.github.com/thcipriani/2eb5cfea530466ac7adf that'd be awesome thanks. [17:04:00] thcipriani: try sshing in as root@ [17:04:33] YuviPanda: worked, thanks :) [17:05:47] I was just working on the bug for that puppet error (which is why I needed access to this box), thanks again. [17:05:54] heh [17:08:39] when i do crontab -e it seems that all the tasks are within comment and I think i didn't add them to comment before (at least most of them). did it changed during the outage? [17:09:25] eranroz: I think they commented out all the cron jobs [17:09:47] So things wouldn't run amok when tools was brought back to its feet [17:10:11] ok, so is it ok that I reenable it? ;) [17:10:14] eranroz: yes it is :) [17:10:20] thanks [17:12:16] !log search disabling NFS and rescuing search project instances [17:12:19] Logged the message, Master [17:13:57] and what about the bigbrother files? I see it was changed too, but with no comments. is there backups for it? [17:15:12] eranroz: should just modify it I guess. we'll try to bring it back but not sure [17:15:18] eranroz: but not needed for webservices anymore [17:16:22] YuviPanda: 10x [17:26:32] 6Labs: Investigate if NFS is needed on the language project - https://phabricator.wikimedia.org/T103130#1383042 (10yuvipanda) 3NEW a:3yuvipanda [17:27:46] * andrewbogott anxiously awaits the return of his homedir in ‘testlabs’ [17:28:47] 6Labs: Investigate if NFS is needed on the language project - https://phabricator.wikimedia.org/T103130#1383060 (10yuvipanda) ccing people who had home directories on the project. Anything you would like rescued? [17:29:13] andrewbogott: ah, a reboot should fix that. I'm doing reboots project by project with 30s intervals [17:29:27] YuviPanda: I mean the /contents/ of my homedir [17:29:33] a new empty one is less useful :) [17:29:43] andrewbogott: ah, you can see them already - root@labstore1002.eqiad.wmnet, /mnt/testlabs [17:29:48] andrewbogott: to see what went missing [17:31:23] YuviPanda: is nfs never returning to non-tools projects? I thought that was still in the works... [17:31:30] andrewbogott: nope, it is returning [17:31:55] which I anxiously await :) [17:34:36] andrewbogott: I'm doing testlabs now [17:34:43] thank you! [17:34:55] I didn’t mean to complain :p [17:39:12] andrewbogott: doing testlabs now [17:39:26] and rebooting things, I see :) [17:45:02] andrewbogott: all done [17:47:30] I think I choose the wrong week to start with my project on labs [17:53:45] Polsaker: yeah :( [17:59:38] andrewbogott: for these shared scripts - I'd rather we put them in /data/project and disable homedirs so ssh is better but later :) [18:00:43] 6Labs: Investigate if NFS is needed on the language project - https://phabricator.wikimedia.org/T103130#1383251 (10yuvipanda) Disabled for now. [18:00:54] YuviPanda: if you find yourself with downtime whilst waiting for NFS, try creating an instance on horizon. And if you have concerns, file ‘em under https://phabricator.wikimedia.org/T87279 [18:00:57] 6Labs, 6Multimedia: Disable NFS on multimedia project - https://phabricator.wikimedia.org/T103126#1383252 (10yuvipanda) 5Open>3Resolved Disabled [18:00:58] 6Labs, 3Labs-Sprint-102: Audit projects' use of NFS, and remove it where not necessary - https://phabricator.wikimedia.org/T102240#1383254 (10yuvipanda) [18:03:57] andrewbogott: w00t [18:04:00] andrewbogott: probably not this week [18:04:06] ok [18:04:21] I wasn’t clear on if you were mostly typing or mostly waiting during this NFS rampage [18:04:24] andrewbogott: can you help with some of the slog? [18:04:36] yes, what can I do? [18:04:38] andrewbogott: both! I am disabling NFS on projects that people said is ok to disable them and bringint hem bakc up [18:05:10] andrewbogott: https://wikitech.wikimedia.org/wiki/Recover_instance_from_NFS [18:06:08] andrewbogott: i'm just going through projects. can you do that for the 'marathon' project? [18:06:36] yep [18:06:53] 6Labs, 10Tool-Labs: Install Scipy for Python 3 on Tool Labs - https://phabricator.wikimedia.org/T103136#1383280 (10Nettrom) 3NEW [18:08:46] 6Labs: Disable NFS on the orgcharts project - https://phabricator.wikimedia.org/T103137#1383291 (10yuvipanda) 3NEW a:3yuvipanda [18:11:02] 6Labs: Disable NFS on the orgcharts project - https://phabricator.wikimedia.org/T103137#1383304 (10Dzahn) It would be AWESOME if this tool could be updated and used again. I'm afraid though the problem is how to get changes synced with HR. [18:11:31] YuviPanda: disable NFS for marathon, yes? [18:11:33] andrewbogott: yes [18:22:05] 6Labs, 6Release-Engineering, 6operations, 10wikitech.wikimedia.org: silver / scap - Could not get latest version: 403 Forbidden - https://phabricator.wikimedia.org/T103138#1383336 (10Dzahn) [18:25:35] 6Labs: Kill NFS in scrumbugz project - https://phabricator.wikimedia.org/T102704#1383346 (10yuvipanda) Killing NFS now and rebooting it [18:36:12] 6Labs: Investigate NFS alternatives to the wikistream project - https://phabricator.wikimedia.org/T103148#1383442 (10yuvipanda) 3NEW a:3yuvipanda [18:38:19] 6Labs, 10Labs-Other-Projects: investigate/clean up 'servermon' project - https://phabricator.wikimedia.org/T103149#1383459 (10Andrew) 3NEW a:3Andrew [18:42:28] Strange, I just added ejegg as a project-admin for "integration", but he cannot ssh into the nodes and I can. How does PAM stuff propagate, or is there a step we're overlooking? [18:42:55] On the Nova Instance list, all I see is 'bastion' Should I see integration there too? [18:43:24] The filtering is really weird. Try the "manage projects" page [18:43:46] type "integration" (autocomplete) into the filters and click the button [18:43:49] try logging out and back in [18:43:57] Logging out of what? [18:44:04] wikitech [18:45:44] oh hey, on the integration project page I'm not listed under admins or members [18:46:00] awight: maybe you need to log out and back in and see if the add 'took'? [18:46:02] oh, derp [18:46:33] nvm, awight added Eeggleston, and the rest of my labs access is under ejegg [18:46:59] haha [18:47:02] ok [18:47:15] darn staff accounts... [18:47:23] ejegg: ok, I've added ejegg [18:47:23] thanks! [18:47:23] u should get that fixed, though [18:47:29] yeah, i guess so [18:47:44] ejegg: eugh, staff accounts on wikitech? [18:47:45] boo [18:48:10] heh, i agree [18:48:42] woohoo, that did it. thanks awight + YuviPanda ! [18:49:17] YuviPanda: What timezone u in these days? London?? [18:50:22] * awight glances nervously at the clock. Aren't hippie friends awake and something fun at this hour? :p [18:50:56] awight: :D [18:51:03] awight: I'm actually at a hippie festival of sorts [18:51:11] surprise... ;) [18:51:11] awight: https://www.barncamp.org.uk [18:51:20] awight: but the day has been spent recovering labs from NFS [18:51:33] awww that seems to be a recurrent thing [18:51:58] Augh Barncamp looks totally enviable [18:52:05] U teaching? [18:52:05] YuviPanda, is there any update on the state of the restore? Is it possible for me to re-enable NFS for a single instance on a project to check if a specific file has been restored yet? [18:52:26] hehe, or too horizontal for there to be teachers [18:52:44] awight: heh :D Am fixing labs :( [18:52:46] stwalkerster: hey! [18:52:52] stwalkerster: which instance / project? [18:54:07] um, account-creation-assistance, I'm just looking for a couple of files from /data/project/config - accounts-mwoauth is a development instance and doesn't need to be up [18:54:39] I was thinking if nfs /data/project was reenabled for that instance only, I'd be able to reclaim those files once they'd been restored? [18:54:43] stwalkerster: alright, let me bring them up [18:54:47] (or is my thinking completely flawed?) [18:55:12] stwalkerster: the instances are rebooting now, should be up in a few minutes [18:55:45] 6Labs: Investigate account-creation-assistance project's use of NFS and look for replacements - https://phabricator.wikimedia.org/T103156#1383548 (10yuvipanda) 3NEW a:3yuvipanda [18:55:48] stwalkerster: https://phabricator.wikimedia.org/T103156 as well :) [18:56:15] ty :) [19:10:29] ragesoss: hey? [19:10:34] ragesoss: the globaleducation project - do you guys need NFS? [19:59:31] bd808: any idea why i can't log into deployment-cache-text02 in deployment-prep project? [19:59:34] i can log into other hosts [19:59:45] in that project [20:22:45] ottomata: I can't either. My guess would be puppet broken there and bad config because of it [20:23:45] bwooo, ok [20:25:24] Hprmedina: >_> [20:26:42] ottomata: bd808 yeah, puppet is broken there https://phabricator.wikimedia.org/T102570 [20:27:55] aye ok, but why would that keep me from logging in? i guess because of broken puppet + nfs problems? [20:28:01] ssh key isn't present or something? [20:28:09] (dunno how labs ssh auth works) [20:30:08] ottomata: guessing it has something to do with /public/keys not being mounted (it's where the ssh config checks first for keys) [20:31:00] ottomata: try it now [20:31:33] IN! [20:31:33] thank you [20:31:36] yup [20:40:47] thanks thcipriani, there was an eventlogging problem there, varnishncsa was not running. started it, looking ok now [20:40:59] probalby it died for some reason (NFS stuff too?) and puppet never ran so it never started it [20:41:26] ottomata: cool, thanks! Yeah, that makes sense. [20:58:41] 6Labs: Kill NFS in scrumbugz project - https://phabricator.wikimedia.org/T102704#1383941 (10Christopher) You killed all of the instances on the project. There was a lot of stuff there. I am not happy about this at all. [21:13:27] 6Labs: Kill NFS in scrumbugz project - https://phabricator.wikimedia.org/T102704#1383979 (10Christopher) oh, nvm. they are still there, (wikitech page cache delusion) [21:32:01] 6Labs, 10Labs-Infrastructure: new security rule not applied - https://phabricator.wikimedia.org/T42526#1384043 (10hashar) 5Open>3Resolved a:3hashar I have added a new security rule to the default group in wikitech. It has been applied on existing instances. Probably fixed by the recent OpenStack upgrade. [21:32:04] one less bug! [21:34:07] YuviPanda: Is it possible something removed .bigbrotherrc from tools.krinklebot's home directory? [21:34:14] It used to contain a jsub command [21:34:22] or service.manifest [21:34:28] I don't know where it was, it was somewhere [21:37:35] Yeah, in bigbrother I suppose [21:38:22] Or... no it was in crontab [21:38:53] Right. [21:38:56] https://lists.wikimedia.org/pipermail/labs-announce/2015-June/000041.html [21:42:24] (03CR) 10Sitic: [C: 032 V: 032] "IE performance issues still exist. But after days of debugging it seems it's problem with the rendering engine of IE, not a JS issue." [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/218006 (https://phabricator.wikimedia.org/T100341) (owner: 10Sitic) [21:42:41] (03PS1) 10Sitic: JS performance optimization [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/219477 (https://phabricator.wikimedia.org/T102569) [21:42:48] 6Labs, 10Beta-Cluster, 10Wikimedia-Logstash, 5Patch-For-Review: Logstash on beta yields 500 due to NFS outage (can't open /data/project/logstash/.htpasswd) - https://phabricator.wikimedia.org/T102962#1384071 (10bd808) 5Open>3Resolved [21:42:50] 6Labs, 10Beta-Cluster: Things broken by betacluster suddenly being moved off NFS - https://phabricator.wikimedia.org/T102953#1384072 (10bd808) [21:43:05] (03CR) 10Sitic: [C: 032 V: 032] JS performance optimization [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/219477 (https://phabricator.wikimedia.org/T102569) (owner: 10Sitic) [21:58:41] YuviPanda: can you please either fix or delete quarry-runner-test [21:58:42] ? [21:59:52] also marathon-bastion-01 [22:00:18] these are instances that are still trying to connect to virt1000 salt master [22:09:53] andrewbogott: oh, hmm. let me look at quarry-runner-test now [22:15:33] I can't see instance proxies (https://wikitech.wikimedia.org/wiki/Special:NovaProxy) for any of my projects [22:15:43] Tried logging out and logging back in [22:16:04] ah [22:16:18] let me fix that [22:16:34] bd808: try now [22:16:41] !log project-proxy restarted dynamicproxy-api [22:16:43] Logged the message, Master [22:16:47] \o/ thank YuviPanda [22:18:32] mutante: any idea what’s wrong with hiera on sensu-01? [22:20:27] andrewbogott: i had to look up what sensu was :p .. am i listed as a project admin or something? [22:20:36] you created the instance [22:22:35] i think i did it for somebody else.. doing support [22:22:39] or forgot it [22:22:56] if it doesnt have any other admins, we can delete it [22:23:15] where do you see a hiera error though [22:23:43] puppet fails [22:24:21] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1384225 (10Platonides) Requesting a second instance for development should be no problem. It can probably use a smaller disk, too. Regarding the public IP request, it looks like web proxied access to your instance would be enough. See http... [22:27:19] looking [22:27:59] Could not find data item labs_recursor in any Hiera data file [22:28:12] eh , yea, i don't know. does _only_ sensu have that? [22:28:19] that sounds so DNS change related [22:29:10] maybe because it was created a while ago [22:30:08] this is what i see in hiera: [22:30:21] codfw.yaml:labs_recursor: "labs-recursor0.wikimedia.org" [22:30:21] eqiad.yaml:labs_recursor: "labs-recursor0.wikimedia.org" [22:30:22] labs.yaml:labs_recursor: "labs-recursor0.wikimedia.org" [22:30:40] so it should get it based on $realm [22:31:23] but for some reason the realm lookup fails? [22:31:35] "codfw.yaml:labs_recursor: "labs-recursor0.wikimedia.org" [22:31:35] eqiad.yaml:labs_recursor: "labs-recursor0.wikimedia.org" [22:31:35] labs.yaml:labs_recursor: "labs-recursor0.wikimedia.org" [22:31:42] arr, sorry, bad paste [22:33:50] the error is on line 55, that is $nameservers = [ ipresolve(hiera('labs_recursor'),4) ], in the $realm = labs section [22:34:59] i am not concerned about this specific instance because it is not used [22:35:21] but i would wonder if that really happens only here [22:36:49] (03PS1) 10Sitic: Fix filtering [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/219486 [22:37:07] (03CR) 10Sitic: [C: 032 V: 032] Fix filtering [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/219486 (owner: 10Sitic) [23:18:55] 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 6operations, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384345 (10Dzahn) checked access logs on silver. yes, wikitech-static tries getting the files: ``` 16:06 wikitech-stat... [23:21:53] 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 6operations, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384349 (10Dzahn) on the wikitech-static side: in /srv/imports ``` 0 -rw-r--r-- 1 root root 0 Jun 13 16:43 labswiki-20150... [23:30:47] Hi everyone, some contributors are asking on frwiki why xtools-articleinfo is down ; did someone know the reasons? maybe a side effect of the change to full https on wikipedia? [23:35:09] check the topic [23:42:10] 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 6operations, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384404 (10Dzahn) the import script was running several times: ``` root@wikitech-static:/srv/imports# ps aux | grep import-wikit... [23:42:17] Polsaker: Oh, I missed it, thanks [23:46:21] Hi, is it just me or curl under php isn't working (don't know if related to the NFS failure)? [23:47:26] (in tool labs) [23:47:59] jem_wikipremia2, https migration? [23:50:07] Ah :) [23:50:15] MaxSem: Thanks, that was it [23:50:46] Since how long is https mandatory? [23:51:14] ~1-2 weeks, depending on wiki [23:51:21] Good [23:51:44] I mean, I've fixed the spellchecker for eswiki, but it is used in 4 other projects [23:56:38] Ok, I'll try to reach APPER, the original developer, to coordinate [23:56:44] Thanks again