[01:59:35] [bz] (NEW - created by: Peter Bena, priority: Normal - enhancement) [Bug 35947] Enable ipv6 on labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=35947 [06:26:07] hi Ryan_Lane [06:26:23] legoktm@bots-bnr2:~$ cd /public/datasets/public/enwiki [06:26:23] -bash: cd: /public/datasets/public/enwiki: Invalid argument [06:26:33] been having that issue for the past 24 hours [09:03:54] legoktm: he's UTC-4 atm, fyi [09:04:02] * jeremyb_ heads to sleep [09:04:04] oh [09:04:10] i just saw that he joined the channel [13:28:51] !logs [13:28:52] logs http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs [14:32:35] Coren, is storage still misbehaving? Of so can you give me a specific example? [14:32:46] um… petan, same question? [14:33:52] andrewbogott: hi [14:33:54] yeah [14:33:56] let me check [14:34:13] legoktm@bots-bnr2:~$ cd /public/datasets/public/enwiki [14:34:13] -bash: cd: /public/datasets/public/enwiki: Invalid argument [14:34:53] legoktm, any idea how long that has been the case? [14:35:05] lemme check my logs [14:35:08] 24h+ [14:35:22] d'you know a maximum time? [14:36:03] i first noticed it at [14:36:13] 1AM CDT yesterday [14:36:14] When did you last see it working properly? [14:36:17] so in utc thats [14:36:23] oh [14:36:31] probably 36 hours ago? [14:37:23] ok, that's useful, thanks. [14:38:17] legoktm: better? [14:38:42] yay! [14:38:54] let me run my dump scan script now :) [14:38:55] thanks [14:39:31] can you make sure that you can write as well as read? [14:40:01] er, should i be able to write in /public? [14:40:11] legoktm@bots-bnr1:/public$ mkdir test [14:40:11] mkdir: cannot create directory `test': Permission denied [14:40:16] reading works fine [14:40:19] That's probably fine then. [14:40:29] Let me know if it further misbehaves. [14:40:38] will do [14:41:47] * andrewbogott feels like a surgeon whose only tools are a hacksaw and a mallet [14:56:34] andrewbogott: what did you do to fix it? [14:56:50] ah [14:56:54] Restarted glusterfs-server on labstore1 and labstore2 (pretty sure the former worked and the latter did nothing) [14:57:46] yeah [14:57:49] that's NFS [14:58:20] the autofs entry only points to labstore1 [16:24:55] MaxSem: Hi [16:28:28] hey apmon [16:40:18] MaxSem: Do you know what the next steps are for the osm setup? [16:41:43] apmon, we need Varnish config - I can poke at it [16:44:28] also, I need to port Munin plugins to Ganglia [16:44:34] Do you know if we can test the rados implementation in labs? [16:45:20] Also it would be good to test the fail-over and load balancing features of mod_tile / renderd [16:46:29] mhm [16:46:43] should be doable:) [16:48:52] hey paravoid, what did your OSM discussion with Mark conclude with? [16:52:26] Do you know what the state of maps-tiles{1,2} are? Can those be used for fail-over testing? [16:53:05] don't know, I didn't setup anything on them [16:54:42] sigh, was busy with other stuff, need to resume with OSM [16:55:03] looks like they might be fairly blank still then [16:55:46] I can set them up then if noone objects [16:56:24] uh, I still can't log into them [16:58:09] Ryan_Lane, I have problems with logging into maps instances - my home directory is absent [17:08:37] A question! Could a local chapter of WIkimedia Foundation run its website on Wikimedia Labs? [17:15:09] technically, possible. practically, ugh [17:15:49] WMF can host chapters' wikis in its main infrastructure = did you know that? [17:16:50] ryuch: Labs isn't really set up for high-performance production uses like that. I think you'd be very unhappy with the results if you used labs. [17:23:30] <^demon> I doubt most chapter wikis are exactly high-performance, but yeah, they should go on the cluster, not labs. [21:55:26] Ryan_Lane: Something up with labs? [21:55:28] http://ganglia.wmflabs.org/latest/?r=day&c=cvn&h=cvn-app1 [21:55:33] Instance died a few hours again [21:55:37] can't ssh into it either [21:55:44] Creating directory '/home/krinkle'. [21:55:44] Unable to create and initialize directory '/home/krinkle'. [21:55:52] (why would it have to create my home director?) [21:55:53] let me look at its log [21:55:59] Connection to cvn-app1.pmtpa.wmflabs closed. [21:55:59] Killed by signal 1. [21:56:25] that project's gluster volume wasn't acting properly [21:56:28] so I force started it [21:56:35] when [21:56:41] a couple hours [21:57:13] actually, no [21:57:19] it wasn't in the list of ones force started [21:57:36] one sec [21:57:54] I can ssh in as root, so it isn't dead. [21:58:03] dmesg [21:58:05] err [21:58:24] http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&c=cvn&h=cvn-app1&tab=m&vn=&metric_group= [21:58:24] it looks dead [21:58:34] that just means ganglia stopped reporting [21:58:44] [564456.497018] gmond[29365]: segfault at 0 ip 00007ff451f42879 sp 00007fff08341dc8 error 6 in libc-2.15.so[7ff451eb1000+1b5000] [21:59:05] [560192.339748] Core dump to |/usr/share/apport/apport 25312 6 0 pipe failed [21:59:12] and it doesn't automatically restart? [21:59:17] no [21:59:20] k [21:59:22] services generally don't do that [21:59:36] not by themselves, but I figured at least puppet would [21:59:46] I mean, I didn't install ganglia on it (obviously) [22:00:09] yeah, puppet should, assuming it's running proper;y [22:00:12] *properly [22:00:21] Apr 1 01:28:43 cvn-app1 kernel: [560192.339748] Core dump to |/usr/share/apport/apport 25312 6 0 pipe failed [22:00:28] it was last run 4 minutes ago according to the motd I get [22:00:40] (shortly before it disconnects, I never get a prompt) [22:00:51] * Ryan_Lane nods [22:01:04] something is throwing shitloads of core dumos [22:01:04] One of the bots oom'ed last week [22:01:05] *dumps [22:01:20] well, it oomed today, but it was building up over the last few days (check ganglia) [22:01:32] I'll need to fix that but right now I'd like to get it up and running again first [22:01:44] * Ryan_Lane nods [22:01:46] one sec [22:01:48] k [22:02:49] I restarted autofs and now /home is working [22:02:56] the fuse kernel module was having issues [22:03:53] I got a prompt now [22:04:00] /data ls: cannot access project: Transport endpoint is not connected [22:04:06] I just restarted ganglia monitor [22:05:06] related to fuse issues [22:05:15] /data/project needs to be unmounted [22:05:23] something is holding it open [22:05:54] probably the bot [22:06:07] Yeah, one of them [22:06:10] Killing now [22:06:16] will give us a 4 minute window [22:06:33] go [22:06:47] done. [22:07:00] I wonder what caused the kernel instability [22:07:07] need to figure out what's dropping the core dumps [22:08:11] I wouldn't be surprised if it was the gluster client itself [22:08:28] thankfully we're sprinting this week and next to get rid of it [22:11:31] !log cvn cvn-app1 went OOM and kernel failed because of it (/home and /data unavailable). sillalive has auto-started the bots and we're back up. [22:11:33] Logged the message, Master [22:12:00] It's possible that the OOM killed the gluster client for /hom [22:12:07] *home [22:12:12] it should have been logged in kern log, though [22:12:20] I didn't see an OOM [22:31:03] Ryan_Lane, I have problems with logging into maps instances - my home directory is absent [22:57:01] MaxSem: Which instance? [22:57:10] Coren, any on maps [22:57:26] *maps project [22:57:47] Can you give me one by name? My permissions are odd; I can go check but not see the actual project. :-) [22:58:28] maxsem@bastion1:~$ ssh maps-tiles1 [22:58:28] Creating directory '/home/maxsem'. [22:58:28] Unable to create and initialize directory '/home/maxsem'. [23:01:08] Huh. Something must have prevented puppet from running on those instances because I can't get to 'em either. Sorry. [23:01:38] MaxSem: But that's 99% sure to be a gluster issue. [23:02:56] Coren, they were this way from the beginning [23:03:59] Coren: I can log into those instances [23:04:43] and I can run puppet from within those instances and it doesn't seem to help [23:05:51] it's not an instance-level problem, I can't log into any of maps instances since I was added to that project [23:09:44] apmon, is /home on maps read only by chance? [23:19:01] MaxSem, try ssh'ing to maps-tiles1? I don't think it'll work but I want to watch the logs. [23:19:19] andrewbogott, done [23:19:34] ok, I see a problem... [23:20:50] andrewbogott: No, I can write to my homedir [23:21:02] apmon: ok thanks [23:51:08] MaxSem: OK, try once more? [23:51:27] andrewbogott, wheee - -thanks [23:52:07] MaxSem: Cool. [23:52:11] aude_, ping? [23:52:39] probably too late for her [23:53:02] ok. If you see her… there are a few files in her homedir on maps that are corrupt due to split-brain. I think nothing important. [23:53:20] She can find me on IRC to clean up the files if she's not able to delete them herself, or if anything needs to be rescued. [23:54:13] ok, thanks