[00:16:00] are there any known gluster issues currently? [00:16:28] the instance parsoid-spof in the visualeditor project does not seem to come up with /data/project as expected [00:19:13] Coren: ^^ [00:19:58] gwicke: No known issues, but it's gluster so it doesn't mean it's not broken. [00:20:29] yeah ;( [00:22:37] anybody I can ping to look into it? [00:22:46] Me. [00:22:48] :-) [00:23:02] ok, done ;) [00:25:06] I've beaten gluster into compliance. It should be up. [00:25:50] yes, I'm seeing /data/project now. thanks! [00:27:22] ... did you just reboot the box? [00:27:49] yes, was too lazy to manually restart everything that depends on the fs [00:31:27] a few minutes later some services are still blocked on glusterfs.. [00:32:44] gluster is constantly using ~20% cpu [00:38:42] Coren, something about gluster is still not normal [00:38:56] it has always been slow, but not *that* slow [00:39:16] still waiting for a few js files to load.. [00:40:22] gwicke: The only normal thing about gluster is that it sucks tremendously. [00:40:52] gwicke: That said, being able to restart a volume (as I just have) is the extent of my skill at fixing it. [00:41:13] k [00:41:46] the last bytes of js seems to just have arrived, so the service is back up now [00:42:27] thanks for your help! [00:50:50] Coren: JFI: The last Puppet run was at Tue Feb 18 15:03:53 UTC 2014 (583 minutes ago) login / exec nodes [00:52:58] ... and a dozen scripts running on tools-login ... [02:46:45] petan, is the bots-sql2 instance still in active use? [02:52:07] hm… is anyone about who knows about the editor-engagement project? legoktm for example? [03:12:08] andrewbogott: hey you there? [03:12:16] I am! What's up? [03:12:31] I wanted to start helping out with wikimedia projects [03:12:41] someone mentioned you were a good person to talk to about this [03:14:22] andrewbogott: i have this link https://bugzilla.wikimedia.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=PATCH_TO_REVIEW&bug_status=REOPENED&component=General&component=Infrastructure&component=tools&component=wikitech-interface&list_id=278391&product=Wikimedia%20Labs&query_format=advanced&resolution=---&resolution=LATER&resolution=DUPLICATE [03:14:34] !log editor-engagement moved ee-prototype to virt12 due to space issues on its old host [03:14:57] pancakes9: just a second, have to fix the logbot :/ [03:15:08] andrewbogott: sure, take your time, will be here [03:20:54] !log editor-engagement moved ee-prototype to virt12 due to space issues on its old host [03:21:22] labs-morebots, what is your deal? step up your game! [03:23:05] I am a logbot running on tools-exec-08. [03:23:05] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [03:23:05] To log a message, type !log . [03:26:59] pancakes9: OK, so 'wikimedia projects' is a very broad scope! Labs is a platform for volunteers to use when working on wikimedia projects, but that doesn't mean that you necessarily would want to work on labs bugs in particular... [03:27:03] does that distinction make sense? [03:27:31] andrewbogott: yes [03:28:32] andrewbogott: where would you recommend I start? [03:28:54] I'm happy to provide resources to move you towards your goal, but am not necessarily so good at providing you with a goal :) I'm looking at a mailing list now, trying to see what has been recommended to other volunteers. [03:29:21] Here's a good starting point: http://lists.wikimedia.org/pipermail/wikitech-l/2014-January/073895.html [03:29:38] That is to work on the mediawiki software itself, which is largely in php. [03:30:26] andrewbogott: do you guys need any help building a central page to help new people get onboarded? [03:31:46] Yes, probably :) For instance, the mediawiki page has that 'how to become…' link right on the front: https://www.mediawiki.org/wiki/MediaWiki [03:31:59] We could use a similar thing on https://wikitech.wikimedia.org/wiki/Main_Page [03:32:12] It's something I ponder now and then, but I don't have a great idea about how it would look. [03:32:28] What we really need is a way for volunteers with existing projects to advertise for help [03:33:01] andrewbogott: about to go to sleep but, FYI, new instances in eqiad get the NFS treatment now. [03:33:08] andrewbogott: also, i wanted to help out on puppet-related projects [03:33:19] Coren: Great! does that mean we're done? [03:33:33] andrewbogott: what is eqiad? [03:33:59] pancakes9: Ah, in that case you are in the right place :) user matanya is doing lots of puppet cleanup work, and I'm sure he has tasks he could suggest. [03:34:30] I think he's sleeping and/or working right now, though. [03:34:44] pancakes9: eqiad is the name of our current primary datacenter. [03:35:12] We're in the process of cleaning up/shutting down our old one (pmtpa) and are going to set up a new one (name tbd) [03:35:47] andrewbogott: I think. I mean, as far as I can tell instances are fully functional but we need to switch the class in LDAP for the older ones. Also, what happens about names between the two sites? Can you gave a 'foo-instance' in eqiad and a 'foo-instance' in pmtpa? [03:36:02] Coren: fyi, virt12 still has some wiggle room, so I'm moving an instance or two from virt6 to there. [03:36:21] ... I didn't even know we had a virt12. :-) [03:36:41] Coren: duplicate names should be supported. Indeed, hopefully we can migrate instances by creating duplicates in eqiad and the clobbering their virt file with a copy from tampa. [03:36:56] I haven't tried that yet, but it should be a good trick to keep respective nova DBs happy [03:37:17] andrewbogott: That's great because I'll really want to set up a parallel tools. [03:38:13] andrewbogott: We can't copy between the two fileservers yet (no hole in the firewall) but I think we're golden otherwise. For labs in general anyways. I still need the new tools-db for tools but that's on my plate tomorrow. [03:38:22] pancakes9: The thing I would be working on (if I weren't fussing with datacenter stuff) is converting our existing puppet code to modules and linting it and such. Some of that is very dicey and affects production machines, but some of it is straightforward. That's the kind of work that matanya has mostly been doing. [03:38:31] andrewbogott: You need to tell me when instances created in eqiad will stick though. [03:38:41] Coren: ok, so we need network changes before we can migrate shared storage? [03:38:51] I will ask mark about that when he's awake. [03:39:14] andrewbogott: labstore4.pmtpa (10.0.0.44) <-> labstore1001.eqiad (10.64.37.6) [03:39:19] ok. [03:39:54] andrewbogott: what is your role? [03:40:05] Coren: before we turn this over to volunteers I'm really hoping to have wikitech manage both DCs. Ryan is going to do some work in the next day or so to enable that. [03:40:48] pancakes9: Coren and I are the staff for Labs. Coren mostly works on toollabs (which is a subproject of labs) and I work on the general infrastructure/user interface/etc. for labs. [03:41:09] Coren: But as for which instances will stick… well, let's do a bit more testing and a couple of practice migrations. [03:42:08] what are nova DBs [03:42:41] Coren: Then it might be nice to get a clean re-install of the compute nodes, just for good measure... [03:43:03] pancakes9: We're talking about arcane openstack/labs design stuff. Nova is the service that manages virtualization in labs. [03:43:22] andrewbogott: If you think that's desirable, it absolutely is the right time for it. :-) [03:43:50] Well, only because there are some bits and bobs of old VMs on there that I don't know how to clean up properly [03:44:01] Ah, reinstalling the compute nodes won't help with that, though... [03:44:05] so nevermind re: reinstall [03:44:31] At some point the nova dbs will get out of sync, I guess declaring that yesterday was the last sync point is as good as anything. [03:44:53] The only issue here really is that security group changes will diverge between the dcs [03:46:00] So, Coren, the answer is that we probably won't need to delete any eqiad instances. But I don't want to promise that we won't, quite yet. [03:47:21] But we can't sincerely migrate until we do, really. [03:48:03] Right -- but… needs more testing! [03:48:57] pancakes9: do you know where to find the WMF puppet repo? Have you created a Labs account yet? [03:52:11] andrewbogott: i have created an account [03:52:27] andrewbogott: a bugzilla account at least [03:52:36] oh, that's different. [03:52:47] You'll want an account here: http://wikitech.wikimedia.org/ [03:54:10] andrewbogott: done [03:54:45] Bed beckons. [03:54:48] * Coren waves. [03:54:52] 'night [03:55:10] pancakes9: I'm too distracted to suggest anything specific, but if you want to poke around in our puppet repo, it is here: https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet [03:56:12] andrewbogott: cool, i'll poke around and come back with questions [03:56:31] andrewbogott: what time zone are most people in this room in? [03:56:36] i'm assuming EST? [03:56:39] lots of 'em [03:56:54] right now I'm in GMT+8, which is in SE asia. [03:57:05] Most volunteers are in western Europe or the US. [03:57:13] So that narrows it down to a 16-hour span :) [03:57:26] It's pretty quiet here during my daytime (which, it's noon now for me) [04:10:00] dammit [04:34:10] andrewbogott: what does Yuvi Panda do? [04:34:43] pancakes9: the full rundown is here: https://wikimediafoundation.org/wiki/Staff_and_contractors [04:35:02] Which doesn't tell you what anyone does. [04:35:20] ;-) [04:36:58] Yuvi works on native apps for mobile and lots and lots of other stuff [08:25:01] !log moved bots-sql to virt12 to free up space on virt6 [08:25:01] moved is not a valid project. [08:25:08] !log bots [08:25:17] !log bots moved bots-sql to virt12 to free up space on virt6 [10:11:56] !jira is To request bugs to be moved from JIRA to Bugzilla, please file a bug using http://tinyurl.com/jirabugzilla [10:11:57] Key was added [11:00:10] valhallasw: hey :] I got some Jenkins slaves in labs that let us run tox and pip :-] [11:00:26] gotta write some documentation [12:02:04] hedonil: cool [12:02:09] er, hashar [12:23:43] http://ganglia.wmflabs.org/ down? [12:25:45] Krinkle: A bit slower than what I would hope for, but works for me? [12:26:20] It was unavailable completely a few minutes ago (no http response at all) [13:36:13] And ganglia web server seems down again.. [13:37:22] Puppet is failing for cvn instances (possibly other projects as well) according to Nova web dashboard [13:37:23] Not sure what to do about it.. [13:37:32] andrewbogott: CorenAFK: [13:38:02] Krinkle: Willl look into it shortly. [13:38:08] https://wikitech.wikimedia.org/wiki/Special:NovaInstance?projects[]=cvn [13:39:06] Krinkle: I just now caused a brief, partial DNS outage. Are things coming back now? [13:39:48] andrewbogott: Yep, ganglia is back up [13:39:57] cool. Sorry... [13:40:01] and it seems the cvn is back up as well [13:40:31] I only barely understand this issue… part of it is that dns can't come up on virt0 or virt1000 without hand intervention... [13:40:31] andrewbogott: Do yo know how the web proxy service (the one that is "native" to the nova / OSM interface) is configured? e.g. how much delay should I expect for itt o take effect? [13:40:41] I rebooted virt100, which was the source of the original issue. [13:41:04] But I don't quite understand why having two dns servers means that when one is down dns fails. Seems like the opposite of the redundancy we would want [13:41:15] indeed [13:41:22] 'twin engine problem' [13:41:30] The dynamic proxy should work more-or-less immediately. [13:41:37] Well, modulo dns prop [13:41:48] OK, maybe it was not working because of the dns issue? [13:41:56] probably! [13:42:02] Contrary to ganglia, cvn.wmflabs.org was responding but with gateway timeout [13:42:14] One more thing, https://wikitech.wikimedia.org/wiki/Special:NovaInstance?projects[]=cvn shows me the image ID is deprecated. [13:42:34] One says "(deprecated)" the other says "(deprecated 2014-02-18)" [13:42:40] I only created that instance fresh yesterday [13:42:43] already deprecated? [13:43:02] Well… each time we change the base isntance we need to set the old one aside. [13:43:07] So 'deprecated' just means there's a new one. [13:43:08] and puppet is apparently "failing" though it hasn't affected me in any way yet (glusterfs and ssh work fine) [13:43:13] Instances that rely on it are fine. [13:43:24] does ubuntu get security updates? [13:43:47] I don't actually know how instances are configured by default. I think they update but I'm not positive. [13:44:06] what kind of changes do you make to the base, and to what extend can/should/are they applied to existing ones? I would hope that that is mostly possible or being done, otherwise instances would be all very different in the long term, difficult to maintain also. [13:44:15] I fixed a labs-wide puppet but earlier today… but instances should be happy by now. I'll log in to see what's going on. [13:44:33] OK, thanks :) [13:44:40] The changes generally relate to how the instances act when they first come up. Once puppet has applied most instances should be uniform regardless of their base instance. [13:52:05] andrewbogott: /Should/ be. A thing to watch out for in the base images is things that are installed there but not mentionned in puppet. [13:53:03] CorenAFK: sure. [13:54:54] Krinkle: your instances look fine. I don't know why puppet was marked as 'failed' -- everything is green now [13:55:14] andrewbogott: What is your current status? I see the eqiad bastions were destroyed and one recreated (but broken), and manage-nfs-volumes on storage1001 doesn't seem to be able to query the projects from LDAP anymore. You're in the middle of something? [13:55:54] CorenAFK: In theory I'm just reconciling eqiad glance with pmtpa glance. [13:55:58] It's not going that great though. [13:56:03] The ldap thing… I can offer no explanation :( [13:56:26] File "/usr/local/sbin/manage-nfs-volumes-daemon", line 276, in get_hosts [13:56:26] host_ip = host[1]["aRecord"][0] [13:56:26] KeyError: 'aRecord' [13:56:39] I know that worked yesterday. [13:56:50] * CorenAFK looks for a be more verbose toggle in that script. [13:56:56] Maybe because there are 0 hosts? [13:58:38] Ah, projects with no hosts? Hm. [13:58:40] * CorenAFK reads code. [13:59:07] Glance is killing me. It works great for most images, just not the ones that I actually care about [13:59:19] Does python have a perl Data::Dumper()-like thing or PHP print_r equivalent? [13:59:41] "print repr(o)" gets you somewhere. [14:00:54] is there a wiki page for a checklist of things to do wrt the server move? [14:01:12] scfc_de: Indeed it does. It's kinda jsonish, but it's better than nothing. [14:01:37] scfc_de: Thanks. [14:02:19] andrewbogott: Aha. There are hosts entries in LDAP with no aRecord entry. [14:02:20] chippy: I think we're not at a stage where user action is needed. So I would wait for the official "Go!" :-). [14:02:33] scfc_de, okay thanks. [14:02:52] ('dc=i-00000001.eqiad.wmflabs,ou=hosts,dc=wikimedia,dc=org', {'puppetvar': ['realm=labs', 'instanceproject=bastion', 'instancename=bastion-eqiad']}) [14:03:35] andrewbogott: Which may be (part of) the reason why it's listed as 'instance state ERROR' in the UI. [14:03:52] CorenAFK: it's the other way round I think [14:04:01] instances don't come up because glance won't serve the image [14:04:07] hence, ERROR! [14:04:23] And, and since it's not up it never gets an IP. [14:04:34] right [14:04:42] manage-nfs-volumes needs to be able to cope with that. [14:04:45] * CorenAFK goes and updates it. [14:05:33] OK, I feel pretty strongly that if a server is going to return 500 there should be some kind of error message in the log file. Not just a happy 'Successfully retrieved image' [14:06:54] ESPECIALLY if the problem is that the image file can't be read [14:06:58] * andrewbogott yells at glance [14:07:32] Thanks, did a full check and things are running smoothly. It seems the new instance I created is a lot more stable than the old one I created ~ 10 months ago. I'll migrate stuff over and delete the old one. [14:07:34] coren, want me to leave that broken instance in place as a test case? [14:08:17] andrewbogott: I still see one last trace of the instance I deleted yesterday, in ganglia it is marked as 'down'. No worries, but just letting you know. I'll file a bug if you think its a bug. [14:08:35] http://ganglia.wmflabs.org/latest/?c=cvn vs https://wikitech.wikimedia.org/wiki/Nova_Resource:Cvn [14:08:41] Krinkle: I don't have much if anything to do with labs ganglia. I'm sure that whoever does would consider it a bug :) [14:09:48] andrewbogott: The fix is trivial, I'll be able to test in it a few minutes at most. [14:10:05] ok, let me know when I can rebuild. [14:11:40] andrewbogott: Is this sane? https://gerrit.wikimedia.org/r/#/c/114146/ [14:12:14] help. My labs instance fastcci-worker2 is unreachable. ssh kicks me out. Probably the home dir is not there (console in nova shows a login prompt) [14:12:29] dschwen: Yeay gluster [14:12:34] andrewbogott, you fixed it last time [14:12:38] hi coren [14:12:50] dschwen: andrewbogott is in the middle of something. Lemme go kick your gluster. [14:12:59] I'm on travel and it apparentky has been down for 48h [14:13:01] What's the project name? [14:13:03] k thx [14:13:04] Coren: yeah, looks good [14:13:06] fastcci [14:13:20] worker2 is the instance [14:13:31] worker1 instance is fine [14:13:48] Coren, I think we have one more volume than gluster can handle… every time I kick one, another one falls off the edge :( [14:14:33] andrewbogott: How "fun". The good news is that gluster is on its way out. [14:14:37] dschwen: Try now? [14:14:42] Coren: yep [14:15:09] hm [14:15:14] login hangs [14:15:27] at least I'm not kicked imediately [14:15:28] dschwen: The gluster mount takes some time. Often over 30s [14:15:40] ok, waiting [14:15:43] Coren: ok, glance is working again, so I think eqiad is back in business. [14:16:03] Lemme +2 my bugfix, test it, and you can redo that bastion. [14:16:08] And, I'll try to resist the urge to purge existing instances in the future :) Sorry if I messed up your workflow [14:17:16] andrewbogott: No worries, this allowed me to find and squish a bug. :-) [14:17:27] Coren: did you miss breakfast? [14:17:51] andrewbogott: Nah-- breakfast is ongoing. :-) [14:18:10] dschwen: Success? [14:18:22] hmm, my login is still hanging [14:18:28] SyntaxError: invalid syntax [14:18:28] I won't blame you if you step away and re-start your workday in an hour... [14:18:42] after the second "If you are having access problems..." [14:18:42] if "aRecord" in host[1] [14:18:42] ^ [14:18:47] ok, clearly I'm of no use doing python code review... [14:18:56] missing a : [14:19:03] Oh duh. Even *I* knew that! [14:20:40] should I try rebooting the instance? [14:21:06] dschwen: Lemme go take a look at it first. [14:22:27] shouldn'thave chosen shared home dirs... [14:23:09] dschwen: No, shared home dirs is good. It's gluster that is evil. AFAICT, it's stuck hard; probably because there is a broken gluster mounting process that's not recovering. A reboot will probably fix you. [14:23:34] Coren, sometimes you have to kill-9 those gluster processes. [14:23:40] And/or restart autofs on the instance [14:23:53] andrewbogott: I know, but not being able to log in as root on the instance makes that difficult. :-) [14:24:01] True! [14:24:19] andrewbogott: Bugfix alpplied and working. Feel free to nuke the broken bastion. [14:24:28] ok, thanks. [14:25:04] Failed to reboot instance fastcci-worker2. [14:25:09] :-( [14:26:57] It lies. It rebooted succesfully. [14:27:06] 14:26:47 up 0 min, 1 user, load average: 1.31, 0.43, 0.15 [14:27:30] With working /home to boot [14:27:33] ha [14:27:35] yes! [14:28:24] dschwen: Gluster is on its way out. Users will get an opportunity to spit on its grave when we decomission it. :-) [14:28:42] haha, sign me up [14:30:06] With the new system, when there's a failure all projects will fail at once! So we'll notice right away [14:30:21] looks like my /data/project stuff is not there though [14:30:21] … as I understand it :/ [14:31:24] what happens when you cd /data/project? [14:31:34] nothng, hangs [14:31:48] hey, that's something! [14:32:01] :-D [14:32:15] (I know I hate that answer, too) [14:32:25] -bash: cd: /data/project: No such file or directory [14:32:33] got this now [14:32:44] yeah, I just killed the volume. [14:32:54] Cd out of the dir, then do 'sudo service autofs restart' [14:33:21] ok, done, trying to cd [14:33:27] works [14:33:59] Sorry this is so fragile. The future will be better, or at least different! [14:34:16] and I'm back in business [14:34:23] ok [14:36:12] thanks guys! [14:38:33] andrewbogott: It'd certainly be noticable, but I'm expecting failures to be rare now that there is no XFS in the way. :-) [14:42:37] andrewbogott: I note your bastion images are still fuxx0red? [14:43:15] Coren: Hard to say. I can't log in, that's for sure. [14:43:22] Hoping that it's just a slow puppet run... [14:43:27] although that seems less and less likely [14:43:34] Ah, true, that might be it. [14:43:47] Does this image still install autofs because that might interfere. [14:44:06] yes, this is the same image that tampa uses. [14:44:15] Hm. [14:44:20] That worked yesterday. [14:44:24] yeah [14:44:33] They respond to ping, so that's something [14:46:22] well, lemme try to build a new image. Although that might just get us two problems [14:48:27] actually… Coren, what would removing autofs from an instance entail? Just not installing the package? [14:48:56] andrewbogott: Since the new class doesn't install it, yes. [14:49:05] hm... [14:50:50] but is the issue really that the package is present, or that there's an entry someplace telling it to try to mount things? [14:51:03] (I ask because the latter would be easier to remove) [14:52:08] e.g. [14:52:11] andrewbogott: ... in *theory* there shouldn't be anything put in auto.master by the new setup, but having the package installed means the daemon is running by default which might be an issue. [14:52:13] /etc/autofs_ldap_auth.conf /etc/autofs_ldap_auth.conf [14:52:26] andrewbogott: Also, I'm pretty sure the current image contains stuff in auto.master [14:52:27] ^ those two files are currently present [14:58:51] andrewbogott: In the image? [14:59:12] yeah, I'm building an image without them. I also removed the line that explicitly starts autofs [15:02:30] "Interesting"; I see 10.68.16.2 10.68.16.3 10.68.16.4 10.68.16.5 as IPs for bastion-home in eqiad, but there are only two instances. Leftover crud in LDAP? [15:03:42] Also: do we want a Trusty image? I think I'd like to set up the new tools with them if we have one. [15:04:16] We want one but I'll need to build it. [15:04:36] .2 and .3 are real bastions, I don't know why .4 and .5 are still in there... [15:04:51] our code links ldap host entries sometimes. I've worked on that off and on but haven't found a good solution [15:05:16] well, wait, .4 and .5 are in testlabs [15:05:59] Ah, huh. And if the project name is the same, it'll leak into manage-nfs-volumes. Is this an issue? [15:06:30] Can random passerbys get privileges on testlabs? [15:07:12] hm, maybe. [15:09:48] If so, that's an issue because it could be misused to elevate on the real labs. The best solution would be for testlabs to use a different base DN, or to prefix project names. [15:10:30] Alternately, never give bits on testlabs to people who don't have that same bit on the real one. [15:11:22] wait, what's special about testlabs? Seems like this problem could allow random cross-project access in general [15:12:29] Well no; normally only people with admin on project X can create instances in that project. But if testlabs is looser about privileges, someone could create an instance /there/ and gain access to the filesystem. [15:13:13] You'll only be put in the exports if the instance is in the same project. [15:14:18] The issue is "being able to create an instance in testlabs you wouldn't have been allowed to create in the real labs" [15:16:41] Oh… I don't think testlabs does anything that regular labs can't do [15:16:47] it's just another project, isn't it? [15:24:09] It should be. It's important to be aware though. [15:31:32] hashar: Do you have time to help me figure out how to get started testing scap in beta today? [15:38:11] yeahhh [15:38:20] * hashar gives donut and coffee to bd808 [15:38:46] bd808: we might want to spam #wikimedia-qa [15:39:01] * bd808 will join there [15:48:11] hashar: @tox: cool, nice to hear! [15:49:22] andrewbogott: Coren: looks like something got broken on deployment-prep project. I can't log on instance (examples: deployment-bastion.pmtpa.wmflabs ) [15:50:11] I can log on deployment-prep-master.pmtpa.wmflabs , I think it is running an old fork of operations/puppet.git though [15:54:17] hashar, any better now? [15:54:27] trying [15:54:37] nop :( [15:54:54] wondering if it is related to NFS [15:55:10] dunno [15:58:52] ah [15:58:58] managed to log on deployment-sql.pmtpa.wmflabs [15:59:09] andrewbogott: I'm having the same problems as hashar. I can ssh directly to bastion[123].pmtpa.wmflabs but proxying through to any internal labs instance is failing. [15:59:10] in my home dir the files belong to UID and GID 4294967294 :D [15:59:17] labnfs.pmtpa.wmnet:/deployment-prep/home on /home type nfs4 (rw,port=0,nfsvers=4,hard,rsize=8192,wsize=8192,sec=sys,clientaddr=10.4.0.53,sloppy,addr=10.0.0.45) [15:59:39] same problem on /data/project which is NFS as well [15:59:48] labnfs.pmtpa.wmnet:/deployment-prep/project on /data/project type nfs4 (rw,port=0,nfsvers=4,hard,rsize=8192,wsize=8192,sec=sys,clientaddr=10.4.0.53,sloppy,addr=10.0.0.45) [15:59:55] This is definitely a Coren issue, I can't really help [15:59:57] deployment-sql.pmtpa.wmflabs works for me too [16:00:02] andrewbogott: thanks :-] [16:00:12] Ah, hm. Lemme check. [16:00:21] Coren: NFS broke somehow. /home /data/project yields UID and GID 4294967294 on deployment-prep. [16:01:12] bd808: so yeah beta access is "restricted" from time to time :-] I would say once per month [16:01:38] Keeping the riff-raff out :) [16:01:40] bd808: it uses to be two or three times per week, often for a days or two [16:01:43] used [16:01:50] GlusterFS was evil :] [16:02:03] I'be been doing a few changes in preparation to the move to eqiad; none of them should have affected pmtpa but apparently something slipped in. (Or it's a coincidence but I don't believe in those). [16:02:15] anyway nowadays it is surprisingly stable. Most issues with beta are usually in mw code and sometime in varnish code. [16:02:28] Coren: I can attempt rebooting / running puppet [16:02:44] hashar: Gimme a few minutes to try to see what's up first. [16:02:51] okkk [16:03:25] Coren: If it helps debug scholarship-alpha.pmtpa.wmflabs seems to be effected as well [16:04:36] deployment-sql.pmtpa.wmflabs lets me in but shows weird uid/gid ownership. scholarship-alpha.pmtpa.wmflabs and deployment-bastion.pmtpa.wmflabs say that my key is denied. [16:08:16] Ima reboot scholarship-alpha for testing. [16:08:30] * bd808 approves [16:10:20] scholarship-alpha was made all better with a simple reboot (it had not been rebooted for 120-odd days, and there were a few NFS outages in that interval) [16:10:26] deployment-bastion next. [16:10:53] autofs really doesn't cope well with change; it's generally the first things to go. [16:11:48] NFS is fine on deployment-bastion, but again autofs got confused and is no longer able to mount the SSH keys. rebooting. [16:12:03] * Coren tries something else first. [16:12:35] !log wikimania-support ssh to scholarship-alpha restored after Coren rebooted instance [16:12:54] * bd808 looks around for logmsgbot [16:13:09] Nope. autofs won't even restart. [16:13:30] In other good news, eqiad instances no longer rely on autofs either. [16:14:20] Yep. Rebooting deployment-bastion has autofs working again and being able to mount keys. [16:15:46] Coren, I'm still working on a new image, but you might also want to look at https://gerrit.wikimedia.org/r/#/c/114127/ with a suspicious eye [16:15:52] (after you resolve current issue) [16:16:53] andrewbogott: Wouldn't it be better to remove the explicit includsion from LDAP in the first place? [16:17:26] hashar: I think deployment-sql will need to suffer the same fate. Anything I need to do before I can safely reboot it? [16:17:29] maybe? If you have a way of doing that that won't take a day of work, please do so! [16:17:56] andrewbogott: I'm thinking a ldapvi with a global search-and-replace would work. [16:18:09] andrewbogott: I don't know if I've got credentials for it though. [16:18:13] I've never used ldapvi, but, sure, that seems reasonable [16:18:32] novaproxy probably has [16:19:34] * Coren tries while hashar gives the all-clear for a reboot. [16:20:28] Wait, novaproxy from where? :-) [16:21:20] Coren: rebooting an instance :] [16:21:24] deployment-apache32.pmtpa.wmflabs [16:21:33] hashar: I think deployment-sql will need to suffer the same fate. Anything I need to do before I can safely reboot it? [16:21:35] ^^ [16:21:42] na just reboot [16:21:46] it is the main database server [16:21:53] hopefully mysql can suffer a reboot :-] [16:22:31] !log deployment-prep rebooting apache32 and apache33 breaking beta :-] [16:23:12] Coren: apache32 reachable again :] [16:23:29] !log deployment-prep rebooting -bastion [16:23:36] So yeah. Autofs breakage precipitated by gluster breakage. [16:23:39] :-( [16:23:52] bluster must dieeee [16:24:14] We got a gun to its head now, just waiting for the right time to pull the trigger. [16:24:41] !log deployment-prep -bastion : /etc/init.d/udp2log stop && /etc/init.d/udp2log-mw start (known bug) [16:25:00] Does anybody know how to restart the log bot? It seems to be awol [16:25:01] Coren: so all fixed up by rebooting instances. Thank you :-] [16:25:28] bd808: https://wikitech.wikimedia.org/wiki/Morebots#Example:_restart_the_ops_channel_morebot [16:25:44] that is all I know [16:28:03] So we need somebody in the morebots tool group... [16:28:15] labs-morebots, are you here? [16:28:15] I am a logbot running on tools-exec-08. [16:28:15] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:28:15] To log a message, type !log . [16:28:37] bd808: I restarted it earlier but it's having trouble accessing the wiki. I don't know why and can't really debug it right now [16:28:44] * bd808 nods [16:31:33] andrewbogott: ldapvi seems to be working. Do I save? I did a g/ldap::role::client::labs/s//role::labs::instance/ [16:31:53] 495 hosts updated that way. [16:32:39] Coren, conference call? [16:32:47] andrewbogott: Sure. [17:07:44] andrewbogott: So primary action item for andrewbogott is unbreak things. [17:07:55] yeah :( [17:33:16] andrewbogott: So the change in LDAP is done and your change reverted; but will wikitech add the old class or the new class if people create more instances? [17:33:41] in tampa it's still adding the old class. I'll fix that shortly (but you should nag me in 15) [17:42:10] Coren, ok, both DCs are including 'role::labs::instance' automatically now. [17:42:16] Did you verify that puppet still works in tampa? [17:42:23] I did. It does. [17:42:34] Unless someone sneakily created a new instance between my edit and your fix. :-) [17:46:22] * Coren actually checks that. [17:47:33] Nope. All good. [17:47:54] So… what does it mean that I can ping all these new instances but they don't have root keys? [17:48:03] Unlikely to be a networking problem, right? [17:48:19] Seems like probably puppet is failing to run… [17:48:30] would you expect us to have access in the event of a puppet failure? [17:48:37] I'd say. I should think it means puppet breaks at some point. Is there anything visible in the puppetmaster log? [17:49:00] Well, we have no access, not even a syslog [17:49:01] andrewbogott: That's hard to arrange without baking in keys in the image; though we /could/ stuff the new server key there. [17:49:14] the thing is, I have nova configured to insert my key on startup. [17:49:20] No, I mean in the log of the puppet /master/. Perhaps the attempt is visible there. [17:49:27] ah, I see. [17:49:29] * andrewbogott looks [17:49:48] Ohwait! Certificates, perhaps? Is the old certificate axed when an instance is deleted? [17:50:58] could be certs... [17:51:58] Coren: how would I tell? puppetca -l shows some pending certs but not for any of the instances in question... [17:53:11] Well, they wouldn't be /pending/ certificates. [17:53:13] Try this: [17:53:24] puppet cert --clean {node certname} [17:53:55] That'll remove an existing signed cert. Rebooting the instance then should make a new signature request thing. [17:57:18] Coren, sorry, I don't know what to put for {node certname} [17:57:31] is that one thing or two? [17:58:34] It's the name on the certificate, the node normally. Go by what puppetca -l shows you as the right model; I haven't played with our puppetmaster yet so I don't know if you should expect i-0000xxx or nodename.foo.bar in there [17:59:03] interesting, it was gracious about a few, but for the bastions it said 'err: Could not call revoke: Could not find a serial number for i-00000003.eqiad.wmflabs' [17:59:19] Anyway… what I really don't understand is why my injected key isn't working. That doesn't have anything to do with puppet [18:02:35] Coren, please join us in wikimedia-office? [18:02:52] Yep [18:03:25] * Silke_WMDE was about to ask the same [18:03:27] :) [19:14:05] !log tools tools-login: Disabled crontab and pkill -HUP -u fatemi127 [19:14:19] morebot? [19:14:37] labs-morebots: ? [19:14:37] I am a logbot running on tools-exec-08. [19:14:37] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:14:37] To log a message, type !log . [19:16:49] "If you want to run a bot, please be advised that bots cannot be run on tools-login; you need to run them as a job on the grid. See https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Submitting.2C_managing_and_scheduling_jobs_on_the_grid for how to do that." -- *argl* [19:20:24] * valhallasw hands scfc_de a cup of coffee [19:21:02] http://cdn.memegenerator.net/instances/500x/46224666.jpg [19:24:36] valhallasw: I'm looking for an ASCII version of that for a "disable-crontab USER" script :-). Something that really says, "I mean bizness, bro!" :-). [19:27:35] scfc_de: :D [19:27:54] scfc_de: maybe the login banner used on the exec hosts? [19:29:11] or something here: http://www.retrojunkie.com/asciiart/logos/stopsign.htm [19:31:14] a) Cool, thanks! :-) b) What odd Google ads on top :-). [19:31:22] valhallasw: nothing works, we tried banned before [19:31:32] * banner [19:40:26] Hm.. web proxy having issues again? [19:40:43] Getting 502 Bad Gateway on http://cvn.wmflabs.org/api.php again [19:41:42] I just disabled all gadgets that use cvn 2 days ago, re-enabled them today after dns issue was resolved. Now getting js errors and bug reports again for it being down. [19:43:49] Coren: andrewbogott_afk [19:44:05] Sorry to keep pinging, not sure what else to do. [19:44:33] I'm getting {"error":"missing-query"} myself. [19:44:49] So I'm definitely talking to it. [19:46:14] scfc_de: petan: you could use write command which is more impressive ;) [19:46:34] what u mean [19:47:18] Krinkle: I don't see 502s at all. Do you still have them? [19:47:41] Yes [19:48:06] That {"error"} is the expected response if accessing api.php without a query [19:48:13] petan: did you receive the message ? [19:48:27] no [19:48:29] but I'm getting nginx bad gateway instead, which is from the web proxy (cvn itself runs apache) [19:48:32] hedonil: The problem is that those users are usually not online :-). [19:48:52] doing curl localhost/api.php or curl cvn.wmflabs.org/api.php from within a labs instance works though. [19:49:48] traceroute: Warning: cvn.wmflabs.org has multiple addresses; using 208.80.153.190 [19:49:49] traceroute to cvn.wmflabs.org (208.80.153.190), 64 hops max, 52 byte packets [19:50:00] 13 10ge5-1.csw5-pmtpa.wikimedia.org (84.40.25.102) 143.250 ms 136.008 ms 137.995 ms [19:50:05] 14 208.80.153.190 (208.80.153.190) 131.576 ms 134.635 ms 137.360 ms [19:50:26] that's from my own machine [19:50:48] curl 'http://cvn.wmflabs.org/api.php' [19:50:48] 502 Bad Gateway [19:50:49]
nginx/1.5.0
[19:52:48] scfc_de: yes, but some times they are - and I think it's a bit better than just the 'hidden' mails, which are easy to miss [19:54:50] petan: Hmm. the message should have appeared at your console session pts/117 on tools-login... [19:55:20] hedonil: do you have an idea how many terminals on how many servers and laptops I have open? [19:55:23] more than 200 atm [19:55:29] petan: first banner, then ban :p [19:56:19] I don't even know I have such a terminal on pts/117 open, and I don't even know on which PC it is [19:56:29] hedonil: For example the user just now wasn't online and I don't remember others being; I'll add a on-wiki message to the script. In the end it's their "duty" to read the rules (or at least use some common sense :-)). [19:56:33] petan: hehe [19:57:27] "rm -Rf /" -- ooops, wrong window :-). [19:59:13] :O [19:59:32] Is there a problem with Labs? [19:59:46] specifically, nginx [19:59:56] Krinkle: Hmmm. [20:00:00] cvn.wmflabs.org has address 208.80.153.190 [20:00:01] cvn.wmflabs.org has address 208.80.153.214 [20:00:16] It looks as though two instances have the same public name. [20:00:35] Clearly only one of those is correct; I'm hitting the right one and you're hitting the wrong one. [20:00:46] huh: What do you mean by "nginx"? [20:01:06] scfc_de: I mean https://cvn.wmflabs.org/api.php is giving 502 errors [20:01:39] "502 Bad Gateway (nginx/1.5.0)" [20:01:45] huh: See the (ongoing) conversation between Coren and Krinkle :-). [20:01:47] petan: you must own extraordinary incredible multitasking skills ;-) [20:01:54] !logs [20:01:54] raw text: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/ cute html: http://tools.wmflabs.org/wm-bot/logs/index.php?display=%23wikimedia-labs [20:01:58] no, but my laptop does [20:06:38] Coren: cvn.wmflabs.org used to be served by a manually associated IP (208.80.153.131 via https://wikitech.wikimedia.org/wiki/Special:NovaAddress), with instance cvn-apache2 (now deleted). I created instance cvn-apache4, set up web proxy for cvn -> cvn-apache4, and then disassociated the IP, and deleted the cvn-apache2 instance. [20:06:51] So the IP it uses now is one handled internally by web proxy. [20:07:08] Yeah, but it has both it seems. [20:07:20] Which is going to cause you no end trouble. [20:07:23] No, 208.80.153.131 is neither of the two you named. [20:07:31] Ah, good point. [20:07:40] That IP is no longer used, shouldn't be an issue. [20:07:55] using web proxy I don't need to assign a public IP manually, right? [20:08:21] Lemme try to figure out where that comes from. [20:08:40] I'm guessing 208.80.153.190 is the bad one. [20:08:56] !log tools tools-login: Disabled crontab and pkill -HUP -u fatemi127 [20:09:16] scfc_de: does wm-bot trust you? [20:09:38] you're not logged in with a wikimedia/* mask [20:10:03] cloak is the right term, I think [20:10:46] valhallasw: !log is handled by morebots, which I HUPped just now. Let's see. [20:10:58] Krinkle: Yeah, I see it used by the same server that has {blue,pink,green}lake.wmflabs.org [20:11:00] !log tools tools-login: Disabled crontab and pkill -HUP -u fatemi127 [20:11:03] oh, right. Still, you're nog logged in to nickserv, so that might be an issue. [20:11:17] (not sure how morebots handles authentication) [20:11:43] (No, again: "AttributeError: 'module' object has no attribute 'enable_twitter'". *argl*) [20:12:28] Krinkle: I know of no obvious way to figure out which instance this matches, though. [20:12:30] valhallasw: I am *not* logged in to NickServ?! I'm pretty sure I protected this nick, it just has no "cloak". [20:12:56] Aha, it looks like wikisteam-web in wikisteam [20:13:12] Coren: Does webproxy use one public IP per subdomain <-> instance/port, or does it do multiple and use the incoming subdomain to know where to forward to? [20:13:14] scfc_de: DOH. Yes, you're right, you're logged in according to nickserv. [20:13:45] It uses the name; but there in addition to the proxy there is something else that snagged the same name. [20:14:06] scfc_de: based on the adminbot source, check the config.py file for a line enable_twitter = ... [20:14:15] Coren: cvn.wmflabs.org should map to cvn-apache4.pmtpa.wmflabs:80, which, though I can't see it, I guess is done by 208.80.153.214. [20:14:32] (but the code there doesn't actually use that parameter as far as I can see) [20:14:52] no, I'm blind. adminlog.py. [20:15:18] Krinkle: I don't think I'm clear. Because the proxy maps your domain name, it takes the name for itself. There is /also/ an instance that takes the name for itself; this one explicitly. [20:15:38] Krinkle: It's the latter that's causing an issue. :-) [20:15:58] valhallasw: confs/labs-logbot.py doesn't set that parameter; perhaps the default is 'yes'? I'll add an explicit no and restart. [20:16:27] scfc_de: I think there just is no default value, causing it to choke [20:16:33] Coren: OK. The only thing I can imagine this is is the old cvn-apache2, which used to claim cvn.wmflabs.org when it had its own public IP. But I've disassociated that IP and that instance has been deleted (plus, it was a different instance) [20:17:14] Krinkle: Edsu also took that name for their instance wikisteam-web in the wikisteam project. [20:17:27] ? [20:17:38] scfc_de: note that the logs *do* make it to the wiki [20:17:47] ( https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log ) [20:19:08] valhallasw: Ah! Well, eh, ... [20:19:20] Let's try again, anyway. [20:19:25] !log tools tools-login: Disabled crontab and pkill -HUP -u fatemi127 [20:19:27] Logged the message, Master [20:19:33] Krinkle: That's arguably a bug in the openstack extension and/or Yuvi's proxy: that you can have the same name picked in both. But that's currently the case: there are two addresses because someone /else/ is also trying to use cvn.wmflabs.org [20:19:40] Okay, so let's clean up the SAL then. [20:20:21] Coren: Did wikistream also claim cvn as web proxy, or by other means? [20:20:36] Explicitly, as a public IP for one of their instances. [20:21:15] Do you have a log of that the action of assigning that in wikistream? Or is it possible that maybe it fell into that project due to a bug? [20:21:37] !log tools morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots [20:21:38] Logged the message, Master [20:21:39] I find it hard to believe they would claim cvn.wmflabs.org, cvn predates wikistream, that would never work. [20:22:02] it's too bad morebots cannot be used for the projects within tools :-( [20:22:02] I don't think that's logged, but I don't know that Yuvi's proxy actually checks for conflicts (which it should) [20:22:11] * Coren removes the cvn username. [20:22:15] s/user/host/ [20:23:04] So cvn and wikistream both had a public IP assigned, the cvn one was the one that actually got it (reliably) for the past months, and today I deleted it and set up an instance proxy instead, and then the wikistream one started getting into the dns as well. [20:23:06] Interesting. [20:23:39] So though both the web proxy and the nova public IP assignment are subject to conflicts, the public IP assignment at least doesn't let both go to DNS (it only advertises one, by whatever arbitrary means) [20:23:43] There's definitely a bug /somewhere/, because the actual instance doesn't have most of those names. [20:24:08] I *hope* that DNS is derived from LDAP, because I know I can clean /that/ [20:25:11] Thanks :) [20:25:55] although it shouldn't be too hard to adapt... [20:27:49] Krinkle: I don't know how fast/often that DNS sync occurs. Let's wait a few minutes and see [20:29:54] Krinkle: Nope. Not working. I can't seem to find the source of this. Debugging. [20:30:30] It's workign for me now. Maybe it gave me the other IP now. [20:32:12] Yeah, DNS is still returning both A records. [20:33:07] 208.80.153.190 seems to be a misconfigured nginx. [20:37:28] !log local-gerrit-patch-uploader Does this work? A reading of the source suggests it should. [20:37:29] Logged the message, Master [20:37:41] Krinkle: I'd hunt it down, but atm I have my hands full. You can work around the issue by changing /your/ hostname for the moment. [20:37:42] * valhallasw checks where that ended up [20:38:53] not surpisingly, https://wikitech.wikimedia.org/wiki/Nova_Resource:Local-gerrit-patch-uploader/SAL [20:39:13] !log gerrit-patch-uploader Does this work? A reading of the source suggests it shouldn't. [20:39:14] gerrit-patch-uploader is not a valid project. [20:42:47] Coren: I can access it through instance-proxy.wmflabs.org, but using a different subdomain doesn't make much sense since this hostname is actively used as an API by various gadgets and tools, and has been for several months now. [20:43:14] Ah, bleh. [20:44:30] Ah, I think I managed to forcibly remove it. [20:47:13] Looks like that did the trick. [23:49:39] hi there, have there been database changes about two days ago? a query that used to work now causes the error: ‘UPDATE command denied to user 'p50380g50752'@'10.4.0.220' for table 'user'‘