[00:05:33] i must be doing something wrong with puppetmaster::self ... i keep getting Invalid resource type systemuser .. Invalid resource type system_role [00:06:01] basically this http://makandracards.com/makandra/9581-how-to-deal-with-puppet-parser-ast-resource-failed-with-error-argumenterror-invalid-resource-type-at [00:19:42] Coren: tools.wmflabs.org itself is rather slow. I was wondering if it was my proxy :) [00:28:11] yuvipanda_: No; the old-style setup with apaches is groaning under the strain of rewriterules on NFS combined with high volumes of some tools. [00:28:25] yuvipanda_: All the more reason to get rid of them. [00:28:30] heh [00:30:16] Coren: i just realized, that means that every request is hitting NFS [00:30:56] yuvipanda_: I don't think Apache is stupid enough to read the file on every request, but still. [00:33:11] Coren: I'm testing puppet change for proxying back to just apaches today. I'll figure out how to proxy to URL based things tomorrow. [00:33:16] testing that patch now [09:57:45] Coren: poke when around, I've a patch that should let us use the proxy for tools :D [10:39:09] Coren: oh, and portgrabber is in perl [10:39:13] * yuvipanda grumbles [14:13:39] !tunnel [14:13:39] ssh -f user@bastion.wmflabs.org -L :server: -N Example for sftp "ssh chewbacca@bastion.wmflabs.org -L 6000:bots-1:22 -N" will open bots-1:22 as localhost:6000 [14:48:56] * yuvipanda gently pokes Coren [14:49:13] Hm? [14:51:07] Coren: https://gerrit.wikimedia.org/r/#/c/97690/ [14:51:29] Coren: with that, only thing needed to make this replace webproxy is a portgrabber update :) [14:51:33] Coren: can you merge? [14:51:48] I can. Should I? :-) [14:51:56] yes :D [14:52:15] there are two more patches, but let's leave them as is for now - need andrewbogott to check if they'll affect wikitech before merging [14:52:51] but those patches don't affect tools [14:56:54] Coren: thanks! [14:57:23] Coren: I realized that portgrabber and granter are in perl :( think you can help add the redis stuff to those? [14:57:39] or should I attempt to rewrite those in python? :D [14:57:44] yuvipanda: Yeah, not an issue. [14:58:03] yuvipanda: Do you have docs on exactly how it should talk to your stuff? [14:58:23] Coren: no but it is simple enough. Want me to write 'em up or just tell you? [14:59:01] It's best documented for posterity [14:59:09] Coren: cool, gimme a sec [14:59:12] it's fairly trivial [15:06:15] is it correct that commonswiki_p at dewiki.labsdb is empty? [15:06:38] or am I looking at the wrong place? [15:09:35] Coren: https://wikitech.wikimedia.org/wiki/User:Yuvipanda/Dynamicproxy_URL_routing?venotify=created [15:09:43] a bit rough but should explain most of it? [15:10:43] Seems clear enough. [15:10:52] Coren: think you'll have time for it today? [15:11:01] Probably. [15:11:09] this makes me super excited, since I can run go / nodejs stuff off the grid with it :) [15:15:39] Coren, :DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD [15:17:03] All hail Shub-Niggurath of the many mouthes? [15:17:27] Coren, you know why I'm happy right? [15:17:50] you've got lots of D? [15:17:58] *Ds [15:18:30] yuvipanda, that and something else. What I know have starts with a D as well. [15:18:44] heh [15:25:54] For Tool Labs: Is there a option for jsub to not fill the err file with "there is a job named ... already active" messages when using the "once" option? "quiet" doesn't seem to work for this case (only for successful submissions) [15:26:16] apper, echoing [15:26:33] echoing apper. I have the same question too. [15:39:46] apper, Cyberpower678: if /usr/local/bin/job -q "$NAME"; then /usr/local/bin/jsub -n "$NAME" etc; fi ? [15:39:57] apper: Hmm, no, though it could be added. I'm forced to wonder why that'd be an issue though -- I certainly would want to know if my jobs overrun their start interval. [15:41:35] (I.e.: starting something once an hour, you'd only get that message when the run actually takes longer than an hour and means you probably need to look into it) [15:43:16] Coren: yes, mostly, but not every time. e.g. I have a script, which runs every 30 minutes and normally only runs a minute or so - then I want to know if it runs longer. But at the moment the script runs some days, because it is an inital run... so I don't want a message every 30 minutes... [15:44:46] and there are some scripts, for which I didn't bother how long they take... sometimes they have to do a lot, sometimes not - so it's okay for them to run an hour, but mostly they are running 1 minute... and they are started every 30 minutes... [15:45:59] so it's useful for scripts, where the runtime varies a lot [15:46:18] Hm. Well, I still think that this probably means you want to reconsider how you start those things, but *shrug*. The easiest way is what anomie suggested; otherwise you can put in a bugzilla for the feature request and I'll tweak the code to have a way to make even that condition be silent. [15:49:43] Coren: I will open a bug, thanks. [17:27:52] hey andrewbogott [17:27:58] I seem to have fucked up proxy-abogott-8 [17:27:58] Coren: you around? [17:28:01] can't login [17:28:08] Connection closed by UNKNOWN [17:28:50] yuvipanda: I'll have a look [17:28:52] Betacommand: usually helps if you say what you need... [17:29:33] jeremyb: I would rather not air my dirty laundry in public :P [17:30:09] Betacommand: i wasn't aware we had a laundromat, but ok then. we'll assume it's not urgent [17:30:13] :) [17:30:49] yuvipanda: are other instances in that project still working for you? [17:31:02] andrewbogott: yeah, I've been using proxy-abogott in the meantime [17:31:15] -8 says "fatal: Access denied for user andrew by PAM account configuration [preauth]" [17:31:19] andrewbogott: I made it include a toollabs role in site.pp, so maybe that broke it? [17:31:42] dunno… do you mind if I reset the puppet repo to origin? [17:31:47] go ahea [17:31:48] d [17:32:01] jeremyb: its an idiom http://www.idiomreference.com/define/air-your-dirty-laundry [17:32:11] Betacommand: i'm aware :) [17:32:19] toollabs uses host based auth iirc [17:32:26] maybe that makes a diff [17:32:29] * jeremyb runs away [17:33:29] yuvipanda: that fixed it, whatever it was. [17:34:43] who would i ping about problems with labs instances? for some reason our instances(ee-flow and ee-flow-big) are reporting /data/project as 'No such file or directory' while its shown on less. Not entirely sure whats up though because /data isn't reported via mount as anything special, typically i would expect this kind of error on a network drive [17:35:09] s/less/ls/ [17:35:16] ebernhardson: that's a common/periodic problem. I can take a look. [17:35:18] What project? [17:35:34] andrewbogott: editor engadgement, the machines are ee-flow and ee-flow-big [17:35:46] ebernhardson: I don't think that can actually be the project name... [17:36:04] andrewbogott: thanks! [17:36:05] hmm, all i knows is they are on office as Nova Resource:Editor-engadgement [17:36:16] hehe, Editor endangerment :D [17:36:18] and i access them as ee-flow.pmtpa.wmflabs and ee-flow-big.pmtpa.wmflabs, [17:36:26] :P [17:36:55] so no d in engagement :P [17:37:49] andrewbogott: yes, nova lists the project as 'editor-engagement' [17:38:12] spelling matters (and dashes) when I'm grepping :) [17:39:05] hehe [17:40:53] sorry :P) [17:40:55] ebernhardson: any better now? [17:41:09] * yuvipanda goes to endanger some editors [17:41:27] I thought it might be some pun about gadgets for engagement [17:41:59] andrewbogott: looks to be better on ee-flow-big, but ee-flow still has the issue [17:42:51] also i'm curious how you magic'd in a network drive thats not seen via /proc/mounts :) [17:44:54] andrewbogott: can you look at https://gerrit.wikimedia.org/r/#/c/97712/ and https://gerrit.wikimedia.org/r/#/c/97713/? [17:44:59] andrewbogott: separate the api config to its own file. [17:45:15] ebernhardson: now? [17:45:20] andrewbogott: it's on proxy-abogott, but I don't see a server listening on the port specified in the api thing [17:45:25] ebernhardson: it uses autofs… I don't know quite how it's configured. [17:45:25] even though the files and such are in plae [17:45:39] andrewbogott: works now, Thanks! [17:45:55] ebernhardson: no problem. I have the feeling there are going to be lots of those today :( [17:46:05] andrewbogott: :( [17:47:20] Betacommand: I am now. Was out for lunch. What's up? [17:48:03] yuvipanda: does it help to # service dynamicproxy-api restart ? [17:48:23] ebernhardson: Automounts; the filesystem doesn't actually get (tried to be) mounted until an actuall attempt to access it. [17:48:33] andrewbogott: uwsgi is running. [17:48:53] andrewbogott: I just don't see a port 5668 or whatever specified in api.conf open when I nmap [17:50:43] Coren: see PM [17:50:57] yuvipanda: did it work before you made the config change? [17:51:15] It looks reasonable to me, although I suspect there's a more graceful way to add the config [17:51:43] andrewbogott: how so? [17:51:49] andrewbogott: you mean rather than file and link? [17:52:03] Yeah, I think we have a ready-made class that handles that. Looking... [17:52:11] that'll be nice [17:52:37] nginx::site [17:53:41] andrewbogott: looking [17:54:06] andrewbogott: do ya'll have etckeeper installed? puppet should do a commit before and after each run. so you could see what changed when you reset to origin [17:54:16] * jeremyb reruns away [17:54:43] jeremyb: we don't, usually I just watch the output during a run. Probably a good idea :) [17:55:29] andrewbogott: you can always install, commit, break, commit, revert, commit :) [17:55:46] boring :P [17:55:50] even though you reset to origin you still have git reflog [17:56:12] yuvipanda: there is definitely a service running on port 5668; telnet tells me so [17:56:17] oh [17:56:22] nmap is messed up? [17:56:32] I… have never used nmap [17:57:28] ah :D [17:57:32] well, okay then :D [17:57:43] andrewbogott: can you merge those two now? I can change it to nginx::site in a later patch [17:59:23] yuvipanda: I note that one of them is -1'd... [17:59:32] andrewbogott: yeah, by me [17:59:40] andrewbogott: because I thought the service wasn't running [17:59:48] the comment even mentions that I should talk to you before merging :D [17:59:51] And also I'm running to lunch in 5… I'll merge after lunch when I can keep an eye on the roll-out. [18:01:02] andrewbogott_afk: sure! [18:32:43] andrewbogott_afk: fyi, i got into virt1000 just fine. thx for your help :) [19:27:35] OK stupid question for the day, after suspending a vim edit how do I re-open the session? [19:28:04] Betacommand: suspend as in ctrl-Z? [19:28:14] valhallasw: yep [19:28:16] fg [19:28:17] or %1 [19:29:36] valhallasw: vim %1 [19:30:38] valhallasw: like that? [19:31:01] Just `%1`: you want to run the existing job, whereas calling `vim ...` with any argument will likely start a new vim. [19:31:26] `jobs` lists the jobs in case you have multiple suspensions or background jobs, and it's not actually %1 [19:33:27] tahnks all [19:33:34] I dont use vim that often [19:52:57] Hello. Can anybody tell me why there is no dewiki_p.watchlist table at the labs db? [19:53:19] Is it intentionally missing? [19:53:55] Coren: ^ [19:54:35] krd: It is. Watchlists are considered private information. The API will offer counts of watchers, however, for most pages. [19:55:07] I had a tool "pages watched most" für dewiki. So this is no longer possible? [19:55:09] krd: The general rule Labs uses for the database replicas is: the information must be visible to registered users with no special privileges. [19:56:01] Coren: the toolserver had a sanitized view of the watchlist table [19:56:47] krd: Not as it currently is, no. It's not all that hard to get an okay from Legal for specific views into tables we don't make available though; and it's reasonable to make a case for this. [19:57:19] Coren: Can you handle this for me? [19:57:21] krd: Especially since the watchlist now /does/ offer the number of watches for pages with (iirc) more than 30. [19:58:03] I only need those >30. [19:58:18] krd: It's best if the request comes from the actual user who has a use case for the data. If you open a bugzilla, I'll make sure legal is in on it and poke them. This way, once they give their okay, implementation becomes easy. [19:58:49] I'd like to avoid the overhead of learning how to open such requests. [19:59:22] But if it's neccesary, please at least give me a link where I have to start. [20:00:35] krd https://bugzilla.wikimedia.org [20:01:25] Thx. [20:01:33] Next question: The commonswiki_p database at dewiki.labsdb is empty. Intended? [20:03:16] krd: nope [20:03:19] Hm. There shouldn't be a commonswiki_p on that shard at all; something must have gone sideways with the federation. Lemme check. [20:05:12] Oh, duh! The federated commons and wikidata tables never made it to prod! Why are you the first once to notice? :-) [20:05:37] *sigh* [20:05:45] andrewbogott: thanks for merging :) [20:06:26] Coren: https://bugzilla.wikimedia.org/show_bug.cgi?id=57617 for the watchlist table issue. [20:06:29] krd: Ah, it looks like it's just on that shard. That'll be a simple fix. [20:06:42] * Coren goes to do that now. [20:17:09] * Damianz waits for the implosion [20:17:50] Damianz: no icinga here, so... [20:18:20] yuvipanda: Who needs icinga when you get email spammed as soon as stuff breaks [20:18:27] hehe [20:18:40] Or a load of semi grumpy users... [20:19:10] Damianz: only 'semi'? [20:20:18] There's 'new' and 'grumpy' seemingly [20:26:14] ... [20:26:21] I think I just fucked up. :-) [20:26:54] I did. Very hard. [20:27:47] * Coren facedesks. [20:27:54] there there. [20:27:55] Sorry all, I just broke dewiki replication. [20:28:08] * Coren fixes before he postmortems on the list. [20:34:01] On the positive side, the commonswiki_f_p is now there on the shard. [20:34:03] :-/ [20:39:33] Damianz: You jinxed it! :-) [20:40:10] * Damianz throws some salt at Coren [20:40:49] Talking of salt... it's freaking freezing [20:40:53] * Damianz goes to find some heat [20:41:48] * yuvipanda throws some puppets at Damianz [20:42:34] Too much puppet this week already, just got a new netapp to make work [20:44:00] hi! [20:45:26] I can’t access dewiki_p.logging on Tool Labs as pb. [20:45:32] ERROR 1356 (HY000): View 'dewiki_p.logging' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them [20:45:39] Same with dewiki_p.ipblocks. [20:45:44] ireas: hey! [20:45:52] ireas: there's an outage in progress, just for dewiki [20:46:03] hi yuvipanda! [20:46:04] is being fixed, should be back to normal in an hour or so [20:46:05] ah, okay [20:46:19] ireas: there was an email to labs-l now, I suspect that'll have updates [20:46:25] hmm, at the moment, I can’t access tools.wmflabs.org anymore [20:46:25] lol [20:47:40] ireas: are you trying to ssh? [20:47:46] ireas: then it is tools-login.wmflabs.org [20:47:53] yuvipanda: SSH works, HTTP not [20:48:12] HTTP does indeed seem borked [20:48:19] * Damianz nudges Coren [20:49:29] https://scontent-b-lhr.xx.fbcdn.net/hphotos-frc3/p480x480/1456788_10202535146319612_1390784460_n.jpg < Think I know what I'm having for dinner mmmmmm [20:50:00] Damianz: pfft, I had ice cream for dinner [20:50:18] I see the slowness. [20:50:21] * Coren checks the webservers. [20:51:19] I had ice cream for lunch [20:51:38] Damianz: I had coke for lunch [20:52:04] Something is hammering on the servers /hard/. Looks like a tool is being spidered. [20:52:25] Not even me anymore, yay for direct sql [20:54:45] Blocking the spider seems to have fixed performance. [20:56:09] yes, works for me now, thanks [20:57:23] Coren: lovely email :P [20:57:28] made me chuckle [20:58:23] addshore: Well, running a bunch of "create or replace table X" from a script without double checking which database you're actually connect to certainly qualifies - at the least - as 'boneheaded' [21:00:06] [= [21:07:38] Coren, could you run a SQL query for me? [21:08:01] Platonides: What? [21:11:06] "SELECT up_value, COUNT(*) FROM user_properties WHERE up_property='language' AND up_user IN ( $(echo $(cut -f 1 bug49749-participants.txt) | sed 's/ /, /g') ) GROUP BY up_value;" at commonswiki_p [21:11:28] andrewbogott: i can haz a public IP for data4all? [21:11:34] file at http://toolserver.org/~platonides/wlm2013/bug49749-participants.txt [21:11:35] non-web port [21:11:57] Platonides: Sure. give me a minute. [21:12:12] s/commonswiki_p/commonswiki/ [21:12:25] yuvipanda: ok, done [21:12:25] thanks [21:12:30] andrewbogott: ty! [21:18:27] Platonides: Where do you want the results? [21:18:50] Coren: so, how do I get access to labsdb from a different project? [21:18:55] Coren: copy /etc/hosts and then? [21:19:03] Coren, any temp place can do [21:19:25] Coren: Also need to get /etc/iptables.conf and source it from iptables-restore. [21:19:36] sweet [21:20:17] Platonides: http://pastebin.com/VHMpSMMd [21:21:04] use ferm for the iptables? [21:21:09] thabj you! [21:21:16] *thank you! [21:24:18] Coren: don't I need an username and password to connect to labsdb? [21:24:29] Coren: should I just reuse one of my tools' access credentials? [21:24:56] yuvipanda: They should appear in your home iff you use NFS [21:24:59] sounds like i shouldnt run mariadb on localhost anymore [21:25:12] Coren: this is a differnt project [21:25:15] Coren: not tools [21:25:15] so I found out the hard way that labs instance doesn't let one add authorized_keys :D [21:25:22] yuvipanda: I am aware. [21:25:42] Coren: oh! [21:25:49] I switched to nfs, let me force a puppet run [21:28:54] andrewbogott: Coren I just switched instance data4all-chocolate to NFS, and restarted, now can't ssh :( [21:29:01] debug1: Exit status 254 [21:29:01] Killed by signal 1. [21:29:04] after it logs in [21:29:50] andrewbogott: Coren can either of you login with root key and see what's up? [21:30:01] Looking at it now. [21:30:20] yuvipanda: by default new projects don't have shared homes or shared project storage. [21:30:23] I'm guessing you want both? [21:30:42] andrewbogott: yeah? I enabled the nfs client role, that should give me it right [21:30:42] ? [21:30:54] nope, there's a 'configure project' setting. [21:31:01] oooh [21:31:03] andrewbogott: didn't realize [21:31:07] enabling nfs when you don't have shared storage… I don't know what that does… breaks, apparently. [21:31:17] andrewbogott: whe4re do I find this [21:31:35] andrewbogott: Not necessarily; /home should still work. [21:31:46] andrewbogott: But the box literally just rebooted seconds ago. [21:31:57] yeah because I rebooted it. [21:32:01] And it doesn't look like it got a clean puppet run before reboot either. [21:32:07] So, many troubling things here :) [21:32:16] uff [21:32:31] notice: Skipping run of Puppet configuration client; administratively disabled; use 'puppet Puppet configuration client --enable' to re-enable. [21:32:41] yuvipanda: ^^ [21:33:08] Coren: I didn't do it! [21:33:38] curious, I am in the middle of a puppet run right now [21:34:20] andrewbogott: Wait, what? On data4all-chocolate? [21:34:33] yeah [21:34:53] ... I don't see you logged in, nor do I see puppet running. [21:35:08] oops, broken pipe [21:35:10] * Coren is suddenly suspicious. [21:35:11] Coren: andrewbogott if it is any help, I restored the iptables file from toollabs [21:35:27] inet 10.4.1.87/21 brd 10.4.7.255 scope global eth0 [21:35:32] now I see what you see [21:35:43] * Damianz fixes andrewbogott's pipe [21:36:08] * Coren reenables puppet and runs it. [21:36:28] Coren: was that rule too toollabs specific and locked me out? [21:36:36] I should learn to read iptable rules [21:36:50] yuvipanda: No. It should work from anywhere. [21:37:06] You can't log in because autofs isn't allowing /home to be mounted. [21:37:30] hmm [21:38:18] The autofs config looks okay. [21:38:41] * yuvipanda 's connections are still killed [21:38:58] yuvipanda: That's not going to magically start working, dude. [21:39:27] Coren: but.. butt.. the phase of the moon changed! :P [21:40:51] Oh FFS! xfs data corruption! [21:41:03] ow! [21:41:07] I've had it with this shitty POC of a filesystem. [21:41:09] ouch! [21:41:38] yuvipanda: I think you should just trash that instance and make a new one :) [21:41:44] I blame yuvipanda for changing the phase of the moon [21:42:00] andrewbogott: ok! [21:42:09] andrewbogott: I'll leave the current one as is if Coren wants to investigate [21:42:16] Ah, maybe not. [21:42:43] That looks like the "known issue" with readahead and XFS on recent kernels. [21:42:59] Still, POS. I'm doing the eqiad server with ext4 [21:43:18] Coren: was it just randomly triggered? [21:43:34] andrewbogott: should you enabled shared storage as such specifically before I do something again? :) [21:43:37] yuvipanda: Yeah, coincidence of timing. [21:43:45] heh ok [21:43:54] yuvipanda: I did already. [21:44:10] Coren: also, just to confirm - I'll get a replica.my.cnf in my home when I move to NFS, right? [21:44:17] yuvipanda: Yes [21:44:37] !log data4all data4all-chocolate is trashed because PHASE OF THE MOON TRIGGERED BUG IN XFS [21:44:39] Logged the message, Master [21:44:48] !log data4all Creating data4all-pista [21:44:49] Logged the message, Master [21:45:51] Hmm. Wait before you scrap. [21:46:02] I think the problem is with the NFS server. [21:46:17] Something odd is going on. [21:46:17] Coren: yeah I haven't killed it [21:46:20] it's still there [21:46:23] just created a new one [21:47:51] * Coren needs to reboot the NFS server. FFS. It's been running perfectly for 42 days and now there's a kernel bug. [21:48:02] Coren: is it affecting tools too? [21:48:34] yuvipanda: Not yet; it looks like it's only going to prevent new projects from switching over -- but a bug in the kernel mount table is a disaster waiting to happen. [21:48:44] :( [21:48:58] Not a big deal; it just will stall NFS for a few minutes. [21:49:08] Coren: is there a way I can get a username / password for labsdb without using NFS in data4all-pista? [21:49:40] yuvipanda: That's the canonical way. It'll be back in a few moments anyways. [21:49:48] Coren: hmm ok [21:49:53] i'll just wait for the initial run to finish [21:50:44] It'd be even faster if that stupid server wasn't soooooo long to boot. [21:50:59] Coren: heh [21:54:23] getting 503s from beta labs suddenly http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [21:56:14] chrismcmahon: Emergency reboot of the NFS server. Should be back up in a couple minutes. [21:56:27] I see it's *finally* past POST now. [21:56:33] thanks Coren, good to know [22:00:10] grrrit-wm nooooo [22:00:18] yuvipanda: You restarting it? [22:00:25] marktraceur: labs NFS server restart [22:00:30] Ah, cool [22:01:08] Box is up. Services should follow shortly. [22:03:10] ... holy-not-sleeping-tonight, batman. [22:03:22] that's not a good sign [22:03:35] No, no it is not. [22:04:09] is that NFS? [22:04:26] Coren: what's up with labstore4 [22:04:27] ? [22:04:45] Ryan_Lane: XFS corruption, after all. Hard enough that trying to mount it kills the kernel. [22:04:57] * Coren throws XFS out the window. [22:05:00] using the older kernel? [22:05:05] Ryan_Lane: Yep. [22:05:15] that makes no sense [22:06:09] we use XFS on all the db nodes and haven't ran into this [22:06:19] I really don't care what sense we think it makes. I'm doing an xfs_repair, mounting RO, and moving everything to ext4. [22:06:25] heh [22:06:33] Ryan_Lane: For all we know, it's an evil interaction between XFS and NFS. [22:06:42] well, did the box crash hard? [22:06:50] Ryan_Lane: Kernel panics. [22:07:00] FastLizard4: Triggers wthe watchdog even. [22:07:04] the kernel panics were due to corruption? [22:07:13] Huh what? [22:07:18] FastLizard4: misfire ;) [22:07:26] Oh, 'k :P [22:07:55] Ryan_Lane: XFS panics, dozens of them, followed by a 'trying to recover from recursive fault', followed by the watchding desperately trying to wake every CPU in turn. [22:08:04] watchdog. [22:08:07] heh [22:08:24] * yuvipanda laughs at watchding [22:08:24] I'm so glad we're managing our own NFS server [22:08:59] Hard enough that I had to look at the console because userspace dies. [22:10:01] I am *so* glad I only used 10g of the LVS for that filesystem. [22:10:04] Did your mamas not tell you distributed file systems are hard [22:10:06] * Damianz chocks [22:10:06] Paranoia pays. [22:11:52] Ryan_Lane: Also, no thin provisioning nor snapshots so it's not that either. [22:11:59] yep [22:13:37] xfs_repair proceeds without bringing the kernel down with it -- good sign. [22:14:03] I'm seeing some lost metadata, but that's probably just the kernel having been brought down hard in the first place. [22:18:15] * a930913 panics and reads the scrollback. [22:19:09] I think I found the culprit. [22:19:20] * a930913 panics and runs. [22:20:58] * Damianz looks at his growing files and hides behind yuvi [22:21:16] * yuvipanda switches everyone to NFS and blames it on glusterfs [22:24:29] * Coren mumbles something about netapp. [22:24:30] Coren: whats the culprit ? [22:24:41] Betacommand: Looks like quotas. [22:24:43] If anyone asks, someone just tripped on the network cable. [22:24:48] is it safe to bring our stuff back up? [22:24:55] Coren: ?? [22:25:08] Betacommand: IT IS NOT SAFE TO GO ALONE! [22:25:10] Betacommand: No shared filesystem; I'm working on trying to fix things now. [22:25:15] HERE, TAKE THIS! [22:25:24] * yuvipanda gives Betacommand a stack of CD-RWs [22:26:14] * a930913 hands yuvipanda a backup tape and a needle. [22:26:21] Coren: does that mean /home is down? [22:26:29] Betacommand: And /data/project [22:26:41] * yuvipanda drops a stone tablet on a930913 [22:26:49] Coren: whats the eta on restoration? [22:27:00] If all goes well, ~10m [22:27:36] sounds good :) [22:27:56] Lol at how many people panic and come on when something happens. [22:28:04] thanks, Coren [22:28:28] a930913: I wanted to know if I'm the only one with problems ;) [22:28:37] that's why I came here ;) [22:28:48] apper: Exactly ;) [22:28:54] Allright, the good news is: no matter what I can mount the filesystem readonly correctly so there is a way out. [22:30:59] At least Coren is sort of alive, trying to find anyone for toolserver when it explodes is near impossible and then they don't care so much when you find them and you have to listen to another round of 'not that supported@ [22:31:23] :p [22:31:25] Things went better than I feared. [22:31:40] I should expect NFS mounts to wake up starting now. [22:32:29] Ryan_Lane: 99% probability the culprit was quota. [22:32:43] well, that makes some sense [22:32:49] (Which, incidentally, is also something that isn't used in prod) [22:32:57] exactly :) [22:33:26] I wait for the day that all the prod database servers take a shit over the filesystem [22:34:42] Damianz: please dont say that [22:36:16] Yeah, I got the filesystem back in a good state. [22:36:22] Betacommand: It would be a bad day...month, but I like to be pessimistic [22:36:26] * Coren gingerly restarts NFS [22:37:46] * Betacommand grumbles about having to restart everything [22:38:38] Betacommand: Probably not. [22:38:45] Hard mounts. [22:39:30] Coren: all my stuff died [22:39:49] thus all of it needs restarted [22:39:50] Betacommand: It's probably just wedged waiting on the filesystem, which is now gradually coming back. [22:40:05] Coren: you forget who you are talking to [22:40:31] has there ever been an outage that hasnt killed all my stuff? [22:41:33] Coren: I had 1 bot still up [22:42:12] * Damianz looks at gerrit-wm [22:42:35] The exec nodes are coming back up. [22:42:50] are there estimations when dewiki will come back? [22:43:25] apper: Should be in a few hours. [22:43:49] Coren: thanks [22:44:14] Coren: andrewbogott btw, exact same result for -pista as for -chocolate [22:44:16] yuvipanda: Can you restart grrrit or something? It seems ill. [22:44:27] Coren: ahw, yes [22:44:38] Hmm, is there a way to tell the grid to restart tasks in the continuious queue no matter what the exit code? The crontab of qstat + grep + jsub seems insane [22:45:22] Damianz: Yeah, you can force a restart of any job with qmod, but it's normal that many bots will have survived this unharmed. [22:45:44] yuvipanda: OK, so… you switch the instance to nfs, and then run puppet, and then reboot? [22:45:53] andrewbogott: yah [22:46:14] Well mine died because the forking failed, but the main job exited cleanly so it didn't get restarted :( So I have conjobs to check if things are running and resubmit them if not to work around it. [22:46:42] Damianz: Oh, sorry, I think I misunderstood you originally. [22:47:17] Damianz: There is no specific way to do this, but if you wrap your job in a while true; do $something; done that'll have the same effect. [22:47:31] It's a bit dangerous though. [22:48:10] Hmmm, good point - they're wrapped in scripts anyway to setup the path stuff correctly... that way there's no way to say 'hey don't keep doing this if it fails x times constantly' though. I like that in supervisord. [22:48:25] Though I could do that I suppose... more work than I wanted [22:48:28] * Damianz notes to look into later [22:50:35] As far as I can tell, things should be back to normal now. [22:50:48] yuvipanda: I can offer no explanation… maybe Coren can look once he catches his breath [22:51:17] andrewbogott: "m going to try turning it off and on [22:51:33] That always works. [22:52:57] andrewbogott: i restarted it, but i dunno if it actually restarted? [22:53:01] seems to still be up [22:53:05] that or it came back up in 5s [22:53:57] yuvipanda: It looks like it's working on -chocolate. /home is there anyways [22:54:02] yuvipanda: Try to log in? [22:54:07] * yuvipanda tries [22:54:17] Coren: woo works! [22:54:19] thanks :D [22:54:34] Coren: no labsdb access tho? [22:54:38] yuvipanda: Your home being there should mean you get your replica.my.cnf within a few minutes. [22:54:39] does something need kicking? [22:54:43] aha :D [22:54:44] nice [22:55:22] trying to use VisualEditor on beta labs just now: Error loading data from server: parsoidserver-http-bad-status: 404. Would you like to retry? [22:55:52] yuvipanda: I suspect that everything was working /except/ the reboot which maybe never happened. [22:55:56] Still not sure how I feel about that - db should be more of a labs service and easily usable outside tools imo [22:55:56] chocolate has been up 1:25 [22:56:02] andrewbogott: hmm [22:56:06] does that correspond with when you created it or when you rebooted it? [22:56:07] andrewbogott: i hit reboot at least a few times [22:56:15] andrewbogott: sounds more like creation [22:56:22] yuvipanda: Oh, wait, you meant for YOU and not a service group? [22:56:25] not sure what config got overwritten though [22:56:29] Coren yes! [22:56:41] yuvipanda: That's not done automatically at all; but you can use the same creds you get on tools. [22:56:51] Coren: okay, so I'll just copy paste something [22:56:56] yuvipanda: I'm going to try to reboot chocolate from the gui to see if it fails for me [22:57:02] ok [22:57:21] hm, nope! [22:57:22] yuvipanda: global users aren't done automatically otherwise anyone could steal credentials by just creating a project and making homes. :-) [22:57:28] curious [22:57:30] ah right [22:57:54] Oh, the gui even says 'failed' it's just not very big. [22:58:27] oh [22:58:29] it does? [23:00:17] andrewbogott: did you manage to restart it? [23:02:26] yeah, both should be fine now [23:02:46] It restarted fine from the shell; this is maybe a bug in wikitech [23:05:55] andrewbogott: prolly [23:09:46] Coren: Thanks for fixing the problems promtly btw <3 [23:10:45] That's what they pay me the big^H^H^Hmedi^W adequate bucks for. [23:11:09] First food, then postmortem [23:20:43] Is privacy-policy.xml in all Labs wikis? [23:20:51] It seems to be an export. [23:20:55] Is there any reason to keep it around? [23:21:23] It looks like it's already been imported into the wikis. [23:45:24] andrewbogott: do I need to restart the system or something after I modified a security group? [23:45:39] andrewbogott: I didn't add a new security group or anything, just modified the rules in the group itself [23:47:02] well, let me restar tit [23:55:36] Coren|AFK: FYI, a bunch of AnomieBOT's tasks failed with errors. But since the logs are written to NFS and NFS was b0rken, I have no indication why. My guess is "Couldn't write to logs", though, or maybe "MySQL died in a weird way" if the labsdb was affected by the NFS outage too. [23:57:36] Coren|AFK: Hmm, that's weird. Job 1369613 came through the outage fine, but then when I sent it the "restart" signal (by writing 'restart' to a particular key in redis), it quit. Same for job 1369618. Nothing written to the log files for either one. [23:58:01] And job 1369621 too.