502 Bad Gateway

[00:16:00] are there any known gluster issues currently? [00:16:28] the instance parsoid-spof in the visualeditor project does not seem to come up with /data/project as expected [00:19:13] Coren: ^^ [00:19:58] gwicke: No known issues, but it's gluster so it doesn't mean it's not broken. [00:20:29] yeah ;( [00:22:37] anybody I can ping to look into it? [00:22:46] Me. [00:22:48] :-) [00:23:02] ok, done ;) [00:25:06] I've beaten gluster into compliance. It should be up. [00:25:50] yes, I'm seeing /data/project now. thanks! [00:27:22] ... did you just reboot the box? [00:27:49] yes, was too lazy to manually restart everything that depends on the fs [00:31:27] a few minutes later some services are still blocked on glusterfs.. [00:32:44] gluster is constantly using ~20% cpu [00:38:42] Coren, something about gluster is still not normal [00:38:56] it has always been slow, but not *that* slow [00:39:16] still waiting for a few js files to load.. [00:40:22] gwicke: The only normal thing about gluster is that it sucks tremendously. [00:40:52] gwicke: That said, being able to restart a volume (as I just have) is the extent of my skill at fixing it. [00:41:13] k [00:41:46] the last bytes of js seems to just have arrived, so the service is back up now [00:42:27] thanks for your help! [00:50:50] Coren: JFI: The last Puppet run was at Tue Feb 18 15:03:53 UTC 2014 (583 minutes ago) login / exec nodes [00:52:58] ... and a dozen scripts running on tools-login ... [02:46:45] petan, is the bots-sql2 instance still in active use? [02:52:07] hm… is anyone about who knows about the editor-engagement project? legoktm for example? [03:12:08] andrewbogott: hey you there? [03:12:16] I am! What's up? [03:12:31] I wanted to start helping out with wikimedia projects [03:12:41] someone mentioned you were a good person to talk to about this [03:14:22] andrewbogott: i have this link https://bugzilla.wikimedia.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=PATCH_TO_REVIEW&bug_status=REOPENED&component=General&component=Infrastructure&component=tools&component=wikitech-interface&list_id=278391&product=Wikimedia%20Labs&query_format=advanced&resolution=---&resolution=LATER&resolution=DUPLICATE [03:14:34] !log editor-engagement moved ee-prototype to virt12 due to space issues on its old host [03:14:57] pancakes9: just a second, have to fix the logbot :/ [03:15:08] andrewbogott: sure, take your time, will be here [03:20:54] !log editor-engagement moved ee-prototype to virt12 due to space issues on its old host [03:21:22] labs-morebots, what is your deal? step up your game! [03:23:05] I am a logbot running on tools-exec-08. [03:23:05] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [03:23:05] To log a message, type !log . [03:26:59] pancakes9: OK, so 'wikimedia projects' is a very broad scope! Labs is a platform for volunteers to use when working on wikimedia projects, but that doesn't mean that you necessarily would want to work on labs bugs in particular... [03:27:03] does that distinction make sense? [03:27:31] andrewbogott: yes [03:28:32] andrewbogott: where would you recommend I start? [03:28:54] I'm happy to provide resources to move you towards your goal, but am not necessarily so good at providing you with a goal :) I'm looking at a mailing list now, trying to see what has been recommended to other volunteers. [03:29:21] Here's a good starting point: http://lists.wikimedia.org/pipermail/wikitech-l/2014-January/073895.html [03:29:38] That is to work on the mediawiki software itself, which is largely in php. [03:30:26] andrewbogott: do you guys need any help building a central page to help new people get onboarded? [03:31:46] Yes, probably :) For instance, the mediawiki page has that 'how to become…' link right on the front: https://www.mediawiki.org/wiki/MediaWiki [03:31:59] We could use a similar thing on https://wikitech.wikimedia.org/wiki/Main_Page [03:32:12] It's something I ponder now and then, but I don't have a great idea about how it would look. [03:32:28] What we really need is a way for volunteers with existing projects to advertise for help [03:33:01] andrewbogott: about to go to sleep but, FYI, new instances in eqiad get the NFS treatment now. [03:33:08] andrewbogott: also, i wanted to help out on puppet-related projects [03:33:19] Coren: Great! does that mean we're done? [03:33:33] andrewbogott: what is eqiad? [03:33:59] pancakes9: Ah, in that case you are in the right place :) user matanya is doing lots of puppet cleanup work, and I'm sure he has tasks he could suggest. [03:34:30] I think he's sleeping and/or working right now, though. [03:34:44] pancakes9: eqiad is the name of our current primary datacenter. [03:35:12] We're in the process of cleaning up/shutting down our old one (pmtpa) and are going to set up a new one (name tbd) [03:35:47] andrewbogott: I think. I mean, as far as I can tell instances are fully functional but we need to switch the class in LDAP for the older ones. Also, what happens about names between the two sites? Can you gave a 'foo-instance' in eqiad and a 'foo-instance' in pmtpa? [03:36:02] Coren: fyi, virt12 still has some wiggle room, so I'm moving an instance or two from virt6 to there. [03:36:21] ... I didn't even know we had a virt12. :-) [03:36:41] Coren: duplicate names should be supported. Indeed, hopefully we can migrate instances by creating duplicates in eqiad and the clobbering their virt file with a copy from tampa. [03:36:56] I haven't tried that yet, but it should be a good trick to keep respective nova DBs happy [03:37:17] andrewbogott: That's great because I'll really want to set up a parallel tools. [03:38:13] andrewbogott: We can't copy between the two fileservers yet (no hole in the firewall) but I think we're golden otherwise. For labs in general anyways. I still need the new tools-db for tools but that's on my plate tomorrow. [03:38:22] pancakes9: The thing I would be working on (if I weren't fussing with datacenter stuff) is converting our existing puppet code to modules and linting it and such. Some of that is very dicey and affects production machines, but some of it is straightforward. That's the kind of work that matanya has mostly been doing. [03:38:31] andrewbogott: You need to tell me when instances created in eqiad will stick though. [03:38:41] Coren: ok, so we need network changes before we can migrate shared storage? [03:38:51] I will ask mark about that when he's awake. [03:39:14] andrewbogott: labstore4.pmtpa (10.0.0.44) <-> labstore1001.eqiad (10.64.37.6) [03:39:19] ok. [03:39:54] andrewbogott: what is your role? [03:40:05] Coren: before we turn this over to volunteers I'm really hoping to have wikitech manage both DCs. Ryan is going to do some work in the next day or so to enable that. [03:40:48] pancakes9: Coren and I are the staff for Labs. Coren mostly works on toollabs (which is a subproject of labs) and I work on the general infrastructure/user interface/etc. for labs. [03:41:09] Coren: But as for which instances will stick… well, let's do a bit more testing and a couple of practice migrations. [03:42:08] what are nova DBs [03:42:41] Coren: Then it might be nice to get a clean re-install of the compute nodes, just for good measure... [03:43:03] pancakes9: We're talking about arcane openstack/labs design stuff. Nova is the service that manages virtualization in labs. [03:43:22] andrewbogott: If you think that's desirable, it absolutely is the right time for it. :-) [03:43:50] Well, only because there are some bits and bobs of old VMs on there that I don't know how to clean up properly [03:44:01] Ah, reinstalling the compute nodes won't help with that, though... [03:44:05] so nevermind re: reinstall [03:44:31] At some point the nova dbs will get out of sync, I guess declaring that yesterday was the last sync point is as good as anything. [03:44:53] The only issue here really is that security group changes will diverge between the dcs [03:46:00] So, Coren, the answer is that we probably won't need to delete any eqiad instances. But I don't want to promise that we won't, quite yet. [03:47:21] But we can't sincerely migrate until we do, really. [03:48:03] Right -- but… needs more testing! [03:48:57] pancakes9: do you know where to find the WMF puppet repo? Have you created a Labs account yet? [03:52:11] andrewbogott: i have created an account [03:52:27] andrewbogott: a bugzilla account at least [03:52:36] oh, that's different. [03:52:47] You'll want an account here: http://wikitech.wikimedia.org/ [03:54:10] andrewbogott: done [03:54:45] Bed beckons. [03:54:48] * Coren waves. [03:54:52] 'night [03:55:10] pancakes9: I'm too distracted to suggest anything specific, but if you want to poke around in our puppet repo, it is here: https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet [03:56:12] andrewbogott: cool, i'll poke around and come back with questions [03:56:31] andrewbogott: what time zone are most people in this room in? [03:56:36] i'm assuming EST? [03:56:39] lots of 'em [03:56:54] right now I'm in GMT+8, which is in SE asia. [03:57:05] Most volunteers are in western Europe or the US. [03:57:13] So that narrows it down to a 16-hour span :) [03:57:26] It's pretty quiet here during my daytime (which, it's noon now for me) [04:10:00] dammit [04:34:10] andrewbogott: what does Yuvi Panda do? [04:34:43] pancakes9: the full rundown is here: https://wikimediafoundation.org/wiki/Staff_and_contractors [04:35:02] Which doesn't tell you what anyone does. [04:35:20] ;-) [04:36:58] Yuvi works on native apps for mobile and lots and lots of other stuff [08:25:01] !log moved bots-sql to virt12 to free up space on virt6 [08:25:01] moved is not a valid project. [08:25:08] !log bots [08:25:17] !log bots moved bots-sql to virt12 to free up space on virt6 [10:11:56] !jira is To request bugs to be moved from JIRA to Bugzilla, please file a bug using http://tinyurl.com/jirabugzilla [10:11:57] Key was added [11:00:10] valhallasw: hey :] I got some Jenkins slaves in labs that let us run tox and pip :-] [11:00:26] gotta write some documentation [12:02:04] hedonil: cool [12:02:09] er, hashar [12:23:43] http://ganglia.wmflabs.org/ down? [12:25:45] Krinkle: A bit slower than what I would hope for, but works for me? [12:26:20] It was unavailable completely a few minutes ago (no http response at all) [13:36:13] And ganglia web server seems down again.. [13:37:22] Puppet is failing for cvn instances (possibly other projects as well) according to Nova web dashboard [13:37:23] Not sure what to do about it.. [13:37:32] andrewbogott: CorenAFK: [13:38:02] Krinkle: Willl look into it shortly. [13:38:08] https://wikitech.wikimedia.org/wiki/Special:NovaInstance?projects[]=cvn [13:39:06] Krinkle: I just now caused a brief, partial DNS outage. Are things coming back now? [13:39:48] andrewbogott: Yep, ganglia is back up [13:39:57] cool. Sorry... [13:40:01] and it seems the cvn is back up as well [13:40:31] I only barely understand this issue… part of it is that dns can't come up on virt0 or virt1000 without hand intervention... [13:40:31] andrewbogott: Do yo know how the web proxy service (the one that is "native" to the nova / OSM interface) is configured? e.g. how much delay should I expect for itt o take effect? [13:40:41] I rebooted virt100, which was the source of the original issue. [13:41:04] But I don't quite understand why having two dns servers means that when one is down dns fails. Seems like the opposite of the redundancy we would want [13:41:15] indeed [13:41:22] 'twin engine problem' [13:41:30] The dynamic proxy should work more-or-less immediately. [13:41:37] Well, modulo dns prop [13:41:48] OK, maybe it was not working because of the dns issue? [13:41:56] probably! [13:42:02] Contrary to ganglia, cvn.wmflabs.org was responding but with gateway timeout [13:42:14] One more thing, https://wikitech.wikimedia.org/wiki/Special:NovaInstance?projects[]=cvn shows me the image ID is deprecated. [13:42:34] One says "(deprecated)" the other says "(deprecated 2014-02-18)" [13:42:40] I only created that instance fresh yesterday [13:42:43] already deprecated? [13:43:02] Well… each time we change the base isntance we need to set the old one aside. [13:43:07] So 'deprecated' just means there's a new one. [13:43:08] and puppet is apparently "failing" though it hasn't affected me in any way yet (glusterfs and ssh work fine) [13:43:13] Instances that rely on it are fine. [13:43:24] does ubuntu get security updates? [13:43:47] I don't actually know how instances are configured by default. I think they update but I'm not positive. [13:44:06] what kind of changes do you make to the base, and to what extend can/should/are they applied to existing ones? I would hope that that is mostly possible or being done, otherwise instances would be all very different in the long term, difficult to maintain also. [13:44:15] I fixed a labs-wide puppet but earlier today… but instances should be happy by now. I'll log in to see what's going on. [13:44:33] OK, thanks :) [13:44:40] The changes generally relate to how the instances act when they first come up. Once puppet has applied most instances should be uniform regardless of their base instance. [13:52:05] andrewbogott: /Should/ be. A thing to watch out for in the base images is things that are installed there but not mentionned in puppet. [13:53:03] CorenAFK: sure. [13:54:54] Krinkle: your instances look fine. I don't know why puppet was marked as 'failed' -- everything is green now [13:55:14] andrewbogott: What is your current status? I see the eqiad bastions were destroyed and one recreated (but broken), and manage-nfs-volumes on storage1001 doesn't seem to be able to query the projects from LDAP anymore. You're in the middle of something? [13:55:54] CorenAFK: In theory I'm just reconciling eqiad glance with pmtpa glance. [13:55:58] It's not going that great though. [13:56:03] The ldap thing… I can offer no explanation :( [13:56:26] File "/usr/local/sbin/manage-nfs-volumes-daemon", line 276, in get_hosts [13:56:26] host_ip = host[1]["aRecord"][0] [13:56:26] KeyError: 'aRecord' [13:56:39] I know that worked yesterday. [13:56:50] * CorenAFK looks for a be more verbose toggle in that script. [13:56:56] Maybe because there are 0 hosts? [13:58:38] Ah, projects with no hosts? Hm. [13:58:40] * CorenAFK reads code. [13:59:07] Glance is killing me. It works great for most images, just not the ones that I actually care about [13:59:19] Does python have a perl Data::Dumper()-like thing or PHP print_r equivalent? [13:59:41] "print repr(o)" gets you somewhere. [14:00:54] is there a wiki page for a checklist of things to do wrt the server move? [14:01:12] scfc_de: Indeed it does. It's kinda jsonish, but it's better than nothing. [14:01:37] scfc_de: Thanks. [14:02:19] andrewbogott: Aha. There are hosts entries in LDAP with no aRecord entry. [14:02:20] chippy: I think we're not at a stage where user action is needed. So I would wait for the official "Go!" :-). [14:02:33] scfc_de, okay thanks. [14:02:52] ('dc=i-00000001.eqiad.wmflabs,ou=hosts,dc=wikimedia,dc=org', {'puppetvar': ['realm=labs', 'instanceproject=bastion', 'instancename=bastion-eqiad']}) [14:03:35] andrewbogott: Which may be (part of) the reason why it's listed as 'instance state ERROR' in the UI. [14:03:52] CorenAFK: it's the other way round I think [14:04:01] instances don't come up because glance won't serve the image [14:04:07] hence, ERROR! [14:04:23] And, and since it's not up it never gets an IP. [14:04:34] right [14:04:42] manage-nfs-volumes needs to be able to cope with that. [14:04:45] * CorenAFK goes and updates it. [14:05:33] OK, I feel pretty strongly that if a server is going to return 500 there should be some kind of error message in the log file. Not just a happy 'Successfully retrieved image' [14:06:54] ESPECIALLY if the problem is that the image file can't be read [14:06:58] * andrewbogott yells at glance [14:07:32] Thanks, did a full check and things are running smoothly. It seems the new instance I created is a lot more stable than the old one I created ~ 10 months ago. I'll migrate stuff over and delete the old one. [14:07:34] coren, want me to leave that broken instance in place as a test case? [14:08:17] andrewbogott: I still see one last trace of the instance I deleted yesterday, in ganglia it is marked as 'down'. No worries, but just letting you know. I'll file a bug if you think its a bug. [14:08:35] http://ganglia.wmflabs.org/latest/?c=cvn vs https://wikitech.wikimedia.org/wiki/Nova_Resource:Cvn [14:08:41] Krinkle: I don't have much if anything to do with labs ganglia. I'm sure that whoever does would consider it a bug :) [14:09:48] andrewbogott: The fix is trivial, I'll be able to test in it a few minutes at most. [14:10:05] ok, let me know when I can rebuild. [14:11:40] andrewbogott: Is this sane? https://gerrit.wikimedia.org/r/#/c/114146/ [14:12:14] help. My labs instance fastcci-worker2 is unreachable. ssh kicks me out. Probably the home dir is not there (console in nova shows a login prompt) [14:12:29] dschwen: Yeay gluster [14:12:34] andrewbogott, you fixed it last time [14:12:38] hi coren [14:12:50] dschwen: andrewbogott is in the middle of something. Lemme go kick your gluster. [14:12:59] I'm on travel and it apparentky has been down for 48h [14:13:01] What's the project name? [14:13:03] k thx [14:13:04] Coren: yeah, looks good [14:13:06] fastcci [14:13:20] worker2 is the instance [14:13:31] worker1 instance is fine [14:13:48] Coren, I think we have one more volume than gluster can handle… every time I kick one, another one falls off the edge :( [14:14:33] andrewbogott: How "fun". The good news is that gluster is on its way out. [14:14:37] dschwen: Try now? [14:14:42] Coren: yep [14:15:09] hm [14:15:14] login hangs [14:15:27] at least I'm not kicked imediately [14:15:28] dschwen: The gluster mount takes some time. Often over 30s [14:15:40] ok, waiting [14:15:43] Coren: ok, glance is working again, so I think eqiad is back in business. [14:16:03] Lemme +2 my bugfix, test it, and you can redo that bastion. [14:16:08] And, I'll try to resist the urge to purge existing instances in the future :) Sorry if I messed up your workflow [14:17:16] andrewbogott: No worries, this allowed me to find and squish a bug. :-) [14:17:27] Coren: did you miss breakfast? [14:17:51] andrewbogott: Nah-- breakfast is ongoing. :-) [14:18:10] dschwen: Success? [14:18:22] hmm, my login is still hanging [14:18:28] SyntaxError: invalid syntax [14:18:28] I won't blame you if you step away and re-start your workday in an hour... [14:18:42] after the second "If you are having access problems..." [14:18:42] if "aRecord" in host[1] [14:18:42] ^ [14:18:47] ok, clearly I'm of no use doing python code review... [14:18:56] missing a : [14:19:03] Oh duh. Even *I* knew that! [14:20:40] should I try rebooting the instance? [14:21:06] dschwen: Lemme go take a look at it first. [14:22:27] shouldn'thave chosen shared home dirs... [14:23:09] dschwen: No, shared home dirs is good. It's gluster that is evil. AFAICT, it's stuck hard; probably because there is a broken gluster mounting process that's not recovering. A reboot will probably fix you. [14:23:34] Coren, sometimes you have to kill-9 those gluster processes. [14:23:40] And/or restart autofs on the instance [14:23:53] andrewbogott: I know, but not being able to log in as root on the instance makes that difficult. :-) [14:24:01] True! [14:24:19] andrewbogott: Bugfix alpplied and working. Feel free to nuke the broken bastion. [14:24:28] ok, thanks. [14:25:04] Failed to reboot instance fastcci-worker2. [14:25:09] :-( [14:26:57] It lies. It rebooted succesfully. [14:27:06] 14:26:47 up 0 min, 1 user, load average: 1.31, 0.43, 0.15 [14:27:30] With working /home to boot [14:27:33] ha [14:27:35] yes! [14:28:24] dschwen: Gluster is on its way out. Users will get an opportunity to spit on its grave when we decomission it. :-) [14:28:42] haha, sign me up [14:30:06] With the new system, when there's a failure all projects will fail at once! So we'll notice right away [14:30:21] looks like my /data/project stuff is not there though [14:30:21] … as I understand it :/ [14:31:24] what happens when you cd /data/project? [14:31:34] nothng, hangs [14:31:48] hey, that's something! [14:32:01] :-D [14:32:15] (I know I hate that answer, too) [14:32:25] -bash: cd: /data/project: No such file or directory [14:32:33] got this now [14:32:44] yeah, I just killed the volume. [14:32:54] Cd out of the dir, then do 'sudo service autofs restart' [14:33:21] ok, done, trying to cd [14:33:27] works [14:33:59] Sorry this is so fragile. The future will be better, or at least different! [14:34:16] and I'm back in business [14:34:23] ok [14:36:12] thanks guys! [14:38:33] andrewbogott: It'd certainly be noticable, but I'm expecting failures to be rare now that there is no XFS in the way. :-) [14:42:37] andrewbogott: I note your bastion images are still fuxx0red? [14:43:15] Coren: Hard to say. I can't log in, that's for sure. [14:43:22] Hoping that it's just a slow puppet run... [14:43:27] although that seems less and less likely [14:43:34] Ah, true, that might be it. [14:43:47] Does this image still install autofs because that might interfere. [14:44:06] yes, this is the same image that tampa uses. [14:44:15] Hm. [14:44:20] That worked yesterday. [14:44:24] yeah [14:44:33] They respond to ping, so that's something [14:46:22] well, lemme try to build a new image. Although that might just get us two problems [14:48:27] actually… Coren, what would removing autofs from an instance entail? Just not installing the package? [14:48:56] andrewbogott: Since the new class doesn't install it, yes. [14:49:05] hm... [14:50:50] but is the issue really that the package is present, or that there's an entry someplace telling it to try to mount things? [14:51:03] (I ask because the latter would be easier to remove) [14:52:08] e.g. [14:52:11] andrewbogott: ... in *theory* there shouldn't be anything put in auto.master by the new setup, but having the package installed means the daemon is running by default which might be an issue. [14:52:13] /etc/autofs_ldap_auth.conf /etc/autofs_ldap_auth.conf [14:52:26] andrewbogott: Also, I'm pretty sure the current image contains stuff in auto.master [14:52:27] ^ those two files are currently present [14:58:51] andrewbogott: In the image? [14:59:12] yeah, I'm building an image without them. I also removed the line that explicitly starts autofs [15:02:30] "Interesting"; I see 10.68.16.2 10.68.16.3 10.68.16.4 10.68.16.5 as IPs for bastion-home in eqiad, but there are only two instances. Leftover crud in LDAP? [15:03:42] Also: do we want a Trusty image? I think I'd like to set up the new tools with them if we have one. [15:04:16] We want one but I'll need to build it. [15:04:36] .2 and .3 are real bastions, I don't know why .4 and .5 are still in there... [15:04:51] our code links ldap host entries sometimes. I've worked on that off and on but haven't found a good solution [15:05:16] well, wait, .4 and .5 are in testlabs [15:05:59] Ah, huh. And if the project name is the same, it'll leak into manage-nfs-volumes. Is this an issue? [15:06:30] Can random passerbys get privileges on testlabs? [15:07:12] hm, maybe. [15:09:48] If so, that's an issue because it could be misused to elevate on the real labs. The best solution would be for testlabs to use a different base DN, or to prefix project names. [15:10:30] Alternately, never give bits on testlabs to people who don't have that same bit on the real one. [15:11:22] wait, what's special about testlabs? Seems like this problem could allow random cross-project access in general [15:12:29] Well no; normally only people with admin on project X can create instances in that project. But if testlabs is looser about privileges, someone could create an instance /there/ and gain access to the filesystem. [15:13:13] You'll only be put in the exports if the instance is in the same project. [15:14:18] The issue is "being able to create an instance in testlabs you wouldn't have been allowed to create in the real labs" [15:16:41] Oh… I don't think testlabs does anything that regular labs can't do [15:16:47] it's just another project, isn't it? [15:24:09] It should be. It's important to be aware though. [15:31:32] hashar: Do you have time to help me figure out how to get started testing scap in beta today? [15:38:11] yeahhh [15:38:20] * hashar gives donut and coffee to bd808 [15:38:46] bd808: we might want to spam #wikimedia-qa [15:39:01] * bd808 will join there [15:48:11] hashar: @tox: cool, nice to hear! [15:49:22] andrewbogott: Coren: looks like something got broken on deployment-prep project. I can't log on instance (examples: deployment-bastion.pmtpa.wmflabs ) [15:50:11] I can log on deployment-prep-master.pmtpa.wmflabs , I think it is running an old fork of operations/puppet.git though [15:54:17] hashar, any better now? [15:54:27] trying [15:54:37] nop :( [15:54:54] wondering if it is related to NFS [15:55:10] dunno [15:58:52] ah [15:58:58] managed to log on deployment-sql.pmtpa.wmflabs [15:59:09] andrewbogott: I'm having the same problems as hashar. I can ssh directly to bastion[123].pmtpa.wmflabs but proxying through to any internal labs instance is failing. [15:59:10] in my home dir the files belong to UID and GID 4294967294 :D [15:59:17] labnfs.pmtpa.wmnet:/deployment-prep/home on /home type nfs4 (rw,port=0,nfsvers=4,hard,rsize=8192,wsize=8192,sec=sys,clientaddr=10.4.0.53,sloppy,addr=10.0.0.45) [15:59:39] same problem on /data/project which is NFS as well [15:59:48] labnfs.pmtpa.wmnet:/deployment-prep/project on /data/project type nfs4 (rw,port=0,nfsvers=4,hard,rsize=8192,wsize=8192,sec=sys,clientaddr=10.4.0.53,sloppy,addr=10.0.0.45) [15:59:55] This is definitely a Coren issue, I can't really help [15:59:57] deployment-sql.pmtpa.wmflabs works for me too [16:00:02] andrewbogott: thanks :-] [16:00:12] Ah, hm. Lemme check. [16:00:21] Coren: NFS broke somehow. /home /data/project yields UID and GID 4294967294 on deployment-prep. [16:01:12] bd808: so yeah beta access is "restricted" from time to time :-] I would say once per month [16:01:38] Keeping the riff-raff out :) [16:01:40] bd808: it uses to be two or three times per week, often for a days or two [16:01:43] used [16:01:50] GlusterFS was evil :] [16:02:03] I'be been doing a few changes in preparation to the move to eqiad; none of them should have affected pmtpa but apparently something slipped in. (Or it's a coincidence but I don't believe in those). [16:02:15] anyway nowadays it is surprisingly stable. Most issues with beta are usually in mw code and sometime in varnish code. [16:02:28] Coren: I can attempt rebooting / running puppet [16:02:44] hashar: Gimme a few minutes to try to see what's up first. [16:02:51] okkk [16:03:25] Coren: If it helps debug scholarship-alpha.pmtpa.wmflabs seems to be effected as well [16:04:36] deployment-sql.pmtpa.wmflabs lets me in but shows weird uid/gid ownership. scholarship-alpha.pmtpa.wmflabs and deployment-bastion.pmtpa.wmflabs say that my key is denied. [16:08:16] Ima reboot scholarship-alpha for testing. [16:08:30] * bd808 approves [16:10:20] scholarship-alpha was made all better with a simple reboot (it had not been rebooted for 120-odd days, and there were a few NFS outages in that interval) [16:10:26] deployment-bastion next. [16:10:53] autofs really doesn't cope well with change; it's generally the first things to go. [16:11:48] NFS is fine on deployment-bastion, but again autofs got confused and is no longer able to mount the SSH keys. rebooting. [16:12:03] * Coren tries something else first. [16:12:35] !log wikimania-support ssh to scholarship-alpha restored after Coren rebooted instance [16:12:54] * bd808 looks around for logmsgbot [16:13:09] Nope. autofs won't even restart. [16:13:30] In other good news, eqiad instances no longer rely on autofs either. [16:14:20] Yep. Rebooting deployment-bastion has autofs working again and being able to mount keys. [16:15:46] Coren, I'm still working on a new image, but you might also want to look at https://gerrit.wikimedia.org/r/#/c/114127/ with a suspicious eye [16:15:52] (after you resolve current issue) [16:16:53] andrewbogott: Wouldn't it be better to remove the explicit includsion from LDAP in the first place? [16:17:26] hashar: I think deployment-sql will need to suffer the same fate. Anything I need to do before I can safely reboot it? [16:17:29] maybe? If you have a way of doing that that won't take a day of work, please do so! [16:17:56] andrewbogott: I'm thinking a ldapvi with a global search-and-replace would work. [16:18:09] andrewbogott: I don't know if I've got credentials for it though. [16:18:13] I've never used ldapvi, but, sure, that seems reasonable [16:18:32] novaproxy probably has [16:19:34] * Coren tries while hashar gives the all-clear for a reboot. [16:20:28] Wait, novaproxy from where? :-) [16:21:20] Coren: rebooting an instance :] [16:21:24] deployment-apache32.pmtpa.wmflabs [16:21:33] hashar: I think deployment-sql will need to suffer the same fate. Anything I need to do before I can safely reboot it? [16:21:35] ^^ [16:21:42] na just reboot [16:21:46] it is the main database server [16:21:53] hopefully mysql can suffer a reboot :-] [16:22:31] !log deployment-prep rebooting apache32 and apache33 breaking beta :-] [16:23:12] Coren: apache32 reachable again :] [16:23:29] !log deployment-prep rebooting -bastion [16:23:36] So yeah. Autofs breakage precipitated by gluster breakage. [16:23:39] :-( [16:23:52] bluster must dieeee [16:24:14] We got a gun to its head now, just waiting for the right time to pull the trigger. [16:24:41] !log deployment-prep -bastion : /etc/init.d/udp2log stop && /etc/init.d/udp2log-mw start (known bug) [16:25:00] Does anybody know how to restart the log bot? It seems to be awol [16:25:01] Coren: so all fixed up by rebooting instances. Thank you :-] [16:25:28] bd808: https://wikitech.wikimedia.org/wiki/Morebots#Example:_restart_the_ops_channel_morebot [16:25:44] that is all I know [16:28:03] So we need somebody in the morebots tool group... [16:28:15] labs-morebots, are you here? [16:28:15] I am a logbot running on tools-exec-08. [16:28:15] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:28:15] To log a message, type !log . [16:28:37] bd808: I restarted it earlier but it's having trouble accessing the wiki. I don't know why and can't really debug it right now [16:28:44] * bd808 nods [16:31:33] andrewbogott: ldapvi seems to be working. Do I save? I did a g/ldap::role::client::labs/s//role::labs::instance/ [16:31:53] 495 hosts updated that way. [16:32:39]