[00:00:15] And ^C on tools-dev doesn't abort whatever is hanging; so reading blocking on device? [00:00:29] YuviPanda: The bash thing on tools-*login*? [00:00:34] aye [00:00:43] and then my mosh died [00:01:36] Should I reboot tools-login, then? Might not solve the tools-dev issues, but could it hurt? [00:02:06] Coren, petan: You're both offline? [00:02:41] yeah [00:02:56] scfc_de: http://tools.wmflabs.org/?status doesn't seem to work either [00:03:02] so maybe the entire cluster's out? [00:03:10] i've no way of chcking SGE [00:03:32] AnomieBOT is still making edits, as of a few minutes ago anyway [00:03:55] hmm [00:04:26] !log tools Rebooted tools-login apparently out of memory and not responding to ssh [00:05:29] Hah, the bot seems to be down as well. But doesn't it live on Bots? [00:05:29] Ah, that was what that ding was. "Power button pressed." [00:05:49] wolfgang42: You had an open session? [00:06:06] scfc_de: Open, but not responding. [00:06:19] I had mariadb open at the time, and I couldn't exit. [00:08:25] hmm [00:08:26] morebots has left [00:09:11] Now "ssh tools-login.wmflabs.org" works, but hangs at the same step (after "No mail."). Fuck. [00:11:19] do we have some sort of a 'panic' button? [00:12:06] I have, right next to my Ctrl key. [00:12:07] * wolfgang42 pushes the panic button. [00:12:15] Nothing happens. [00:15:40] scfc_de: Yours was hanging after "No mail"? Mine is now, but before it wasn't getting past the access problems notice. So there's progress, I guess. [00:16:45] wolfgang42: On tools-dev, yes. Now tools-login is in the same (still unusable) status as tools-dev. [00:17:03] well, I have You have new mail. [00:17:03] :P [00:17:11] but that's because i've setup forwarding. [00:17:45] From Puppet, I think NFS is on labstore1. If I look at ganglia.wikimedia.org, that seems to be "okay". I have no idea what's the problem here. [00:19:41] Does someone see anything obvious in https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=tools&instanceid=51f3a377-90b5-4f46-8590-5624c50c5dbf®ion=pmtpa ? [00:20:11] "task jsub:3012 blocked for more than 120 seconds"? [00:20:12] you've to be projectadmin to see that, scfc_de [00:20:18] maybe SGE is dead? [00:24:24] http://pastebin.com/2vCBtMES [00:25:21] tools-master (the grid-master) is in the same state. Also I don't think something in /etc/profile.d (or wherever in the login startup chain) contacts the grid master, so it shouldn't hang. [00:25:59] scfc_de: "nslcd[2266]: [3c9346] error writing to client: Broken pipe" might be bad news, that seems to be related to LDAP. [00:26:23] scfc_de: /scripts/local-premount/rescuevol: line 72: add_mountroot_fail_hook: not found [00:26:30] mabbe [00:26:30] ? [00:26:36] toolsbeta-login.pmtpa.wmflabs hangs as well, so it doesn't seem to be limited to Tools. [00:27:32] autofs might also be fucked? initctl: Unknown job: S20autofs [00:28:17] Bots however works. NFS as the common denominator? Then we would be stuck until Coren, Ryan_Lane or andrewbogott come online. [00:29:20] true [00:29:29] the analytics project (limn0 instance) also works [00:29:31] so probably NFS [00:30:37] I'll write a mail to Coren and Ryan. Until they fix that I fear we must work on other projects :-). [00:30:45] whaaat nooooooo! [00:30:46] :P [00:30:55] i was working on VisualEditor stuff and when I come back labs goes down :( [00:33:43] When did the troubles start? 23:30Z? [00:34:09] 04:55IST, [00:34:15] so -0530 [00:34:20] that'll be... [00:34:23] 23:25Z? [00:35:18] scfc_de: ^ [00:35:46] YuviPanda: Thanks. [00:36:08] that was when wolfgang first reported it [00:36:17] i checked and had issues at that time, but could type things into my mosh connection [00:36:21] (got the bash: fork thing) [00:43:07] Why isn't -login responding? [00:43:28] Cyberpower678: Scroll back. [00:46:07] Where am I scrolling back to? [00:47:50] Why is labs down? [00:47:52] TLDR [00:48:02] I'm very sleepy. [00:49:35] Apparently NFS failure. [00:49:53] TOOOOOOOOOLSEEEEEEEERVEEEERRRR [00:52:00] * Cyberpower678 trouts Coren [00:54:02] Helllo! Is tools.wmflabs.org working? [00:54:13] JohnMarkOckerblo, no. [00:54:22] scfc_de, bots seem to be down too. https://en.wikipedia.org/wiki/User:Cyberbot_I [00:55:06] Cyberpower678: The Bots project or bots running on Tools? If the latter, that's to be expected. [00:55:15] (and is there a standard place to look for server status reports?) [00:55:44] Tools because it has replication. [00:56:03] And other various toolserver environments. [00:56:44] JohnMarkOckerblo: I posted a short note on the mailing list a few minutes ago, but we should change this channel's topic as well. [00:56:47] * anomie notes than AnomieBOT is still happily editing away [00:57:10] * Cyberpower678 smacks anomie for bragging. ;p [00:57:30] I'm requesting Bots access. [00:57:42] Well, at least I know it's not my script going haywire. (hopefully.) [00:58:03] I'm going to setup a backup there, should it stall on Tools in the future. [00:58:14] Cyberpower678: I'd guess anything that gets launched from a cronjob is not being started, but continuous processes will keep running, until maybe they happen to run into whatever is breaking stuff [00:58:33] anomie: Do you do file system stuff? [00:58:37] anomie, well I use cronjobs. [00:59:03] That does explain alot, since the script that is running is loaded in the active memory already. [00:59:04] scfc_de: It writes to stdout/stderr with each edit [00:59:56] anomie, https://en.wikipedia.org/wiki/User:Cyberbot_I This is where my status updater comes in handy. ;p [01:00:11] scfc_de: And it will periodically check for a certain file's existence, for "restart" commands. And possibly it will update its cookie file, too. [01:00:59] hm. If there's a space issue, as prior logs suggest, my cgi script does write a log into my directory, but I think that's on the big shared partition w lots of space. [01:01:14] ('course, I can't actually get a shell now to verify. :-) [01:01:57] JohnMarkOckerblo: Definitely not your problem. [01:02:33] anomie: Don't know if all NFS is stuck, or there's some buffering? [01:02:58] How goes the tools troubleshooting? [01:04:06] scfc_de: I don't know NFS *that* well, but IIRC it's pretty good about dying as soon as something goes wrong. I really suspect the login issue is whatever is making nslcd error out (not that I know much about that either). [01:07:31] anomie: But then login shouldn't work at all, or? [01:08:07] wolfgang42: Without Coren or Ryan, there's nothing we (or I) can do. [01:08:30] any estimates on uptime? (I do have another server with the same script, and could potentially point my templates to it, but probably not a good idea unless tools is going to be down for extended period.) [01:08:46] * Cyberpower678 smacks petan in hopes that he'll respond. [01:08:53] scfc_de: Any idea why it's rejecting my ssh connection entirely now? (Connection closed by 208.80.153.224) [01:09:16] wolfgang42, verifying your claim... [01:10:09] wolfgang42: Don't know, maybe out of memory again? [01:10:19] Unable to establish SSH connection now. [01:10:52] What's lab's monitoring page. [01:11:06] With all of those graphs, and doohickies. [01:11:26] scfc_de, ^ [01:11:29] JohnMarkOckerblo: Worst case probably morning US east coast (10 hours?). [01:11:46] Cyberpower678: ganglia.wmflabs.org [01:12:07] * anomie notes that deployment-bastion.pmtpa.wmflabs seems to be having the same issue... [01:12:07] JohnMarkOckerblo, I agree with scfc_de. I estimate 9-11 hours before uptime. [01:12:35] anomie: I think hashar converted beta to NFS as well. [01:13:03] Grid load seems to be increasing. [01:13:26] Labs may totally crash at some point. [01:14:15] Interestingly, http://ganglia.wmflabs.org/latest/?p=2&c=tools&h=tools-login lists tools-login as being up. [01:14:55] Cyberpower678: Only tools-login is running amok. We'll need to look at the crontabs some time. [01:14:55] This seems to be a repeat of what happened before. [01:15:21] hm. Not good, but at least it's not during the semester. Might switch templates to point to Penn if things are still down by noon east coast time. [01:15:22] Though, on closer inspection, I've never seen a load avg. anywhere close to the listed 855.36 [01:16:00] And if that's the case, I may be in part responsible for stalling -login to the point of unusability. [01:16:20] I've not yet had the chance to adjust my crontab. [01:17:30] Cyberpower678: Yes and no. "* * * * *" isn't nice, but you (and everyone else) probably couldn't log into tools-login even otherwise, cf. tools-dev. [01:18:09] scfc_de, I've been meaning to convert those to continuous scripts and submit a jsub continuous command [01:22:38] [05:05:29 PM] Hah, the bot seems to be down as well. But doesn't it live on Bots? <-- Nope, it lives in the "morebots" project on labs. [01:23:18] legoktm: Does that use NFS as well? [01:23:32] Er sorry, I meant "morebots" on tools. [01:24:01] scfc_de, wolfgang42: looks like it's marked as down now. [01:26:25] Well, I'm not going to reboot it as it wouldn't change anything. I've mailed Coren and Ryan, and we'll have to wait for their return. Good night everybody! [01:27:29] night scfc_de, thanks for trying :) [02:32:04] Soo....stuff down? [02:33:16] Yup :/ [02:33:21] NFS is down [02:33:46] thats fine, just use TS copy [03:35:03] I'm receiving mail from cron saying: /bin/sh: execle: Cannot allocate memory [03:41:36] Fine time to go out to dinner. [03:41:47] * Coren notes, with some suspicion, that we are Sunday. Again. [03:42:06] * Coren will be back in 15 [03:49:24] good, it wasn't me at least. I run Perl and it's a bit ram crazy, but I didn't start yet. [04:17:32] Yeah, there's a plausible known issue with a kernel bug, but I don't like the timing one bit; it occured three times now, with 14 days between the failures. A server doesn't "randomly" fail only on Sunday evenings every other week. [05:39:30] Coren: How long does it take from creating a tool until I can "become" it? [05:39:44] its listed on https://wikitech.wikimedia.org/wiki/Special:NovaProject, but "become: no such tool 'mwp'" [05:39:51] legoktm: A few minutes, normally, but you do have to log off and back on to get the group membership. [05:40:02] Ok [05:40:07] Also, it seems I can't remove a tool that I own? [05:40:33] I don't need "mwp-testing" anymore if you can get rid of it [05:46:50] Coren: Log on and off wikitech? Or tools-login? [05:47:16] loots-login [05:47:37] not working then :( [05:47:45] * Coren checks. [05:48:46] Not super urgent, I'll just run the script under another tool for now [05:49:42] I see no reason why it shouldn't work. I see the group, and you belonging to it. [05:50:29] Can I switch to your user account to test? [05:50:31] legoktm@tools-login:~$ become mwp [05:50:31] become: no such tool 'mwp' [05:50:34] sure, go for it [05:52:08] I see why. [05:53:03] The toolwatcher broke when NFS went ill. [05:53:07] Should be up and running. [05:53:57] woot, thanks [05:54:04] * Coren goes to sleep now. [05:54:10] gnite [06:12:27] hmm .. too late for Coren .. :-/ [06:13:34] Someone else who can install perl modules using cpan? Some I can install locally, others don't work. [06:13:41] petan? [07:19:09] lo [08:36:26] !log deployment-prep Switching the text cache traffic from deployment-squid to deployment-cache-text1 by reassociating the public IP 208.80.153.219 [08:36:30] Logged the message, Master [08:43:01] !log deployment-prep Shutdowning deployment-squid , service migrated to deployment-cache-text01 (varnish). [08:43:04] Logged the message, Master [09:10:05] @notify AzaToth [09:10:06] I'll let you know when I see AzaToth around here [09:19:08] zz_YuviPanda: ping [09:27:36] YuviPanda what happened yesterday? [09:27:42] ah [09:27:44] i don't know [09:27:47] I wasn't here [09:27:52] tools-login ran out of memory [09:27:54] so did tools-dev [09:27:54] -login was rebooted [09:27:59] and was rebooted [09:28:02] but was still unresponsive [09:28:04] interesting [09:28:10] kernel logs probably mentioned NFS being fucked up [09:28:17] and then I went to sleep [09:28:17] well, I fixed this permanently I guess [09:28:20] ok [09:29:13] memory waws, I think, a side effect [09:30:33] well, terminatord will handle it :o [09:30:36] Beetstra hi [09:31:13] hi Petan [09:31:29] petan: so yesterday's problem was apparently NFS fucking up [09:31:35] nothing to do with memory [09:31:35] yes I see [09:31:39] (at least that's what I get from reading backlog) [09:31:51] I am definitely not sure... I wasn't here [09:32:13] I will check logs, however nfs should autoremount once fixed [09:32:23] also, it's possible to repair this without reboot [09:32:35] Can you help me later this afternoon (I have to go now for an hour or so) - basically, perl modules are difficult to install apparently, some don't, others half, some do. Coren installed POE for me yesterday because it did not want to install at all [09:32:35] * petan hates reboots [09:32:49] I don't even remember last time I had to reboot some server to fix any issue [09:33:14] we have some boxes here with uptime over 1100 days [09:33:28] is there some problem with the db server? [09:33:33] fale which one? [09:33:46] petan: itwiki.labsdb [09:33:52] Beetstra ok, can you make a list? :) [09:34:07] fale: you are correct [09:34:25] petan: I preferred to be wrong :D [09:34:39] fale: s2 is having problems [09:34:46] oh [09:34:54] in fact it seems to me that all of them are having troubles :> [09:35:19] petan: big problems or small problems? [09:35:28] YuviPanda howcome that everytime I go afk for weekend, servers are dying and systems going down :D [09:35:31] I always miss the fun [09:35:40] yeah, Coren|Sleep was wondering the sam ething too :) [09:35:57] petan: It seems that 12 hrs ago the s2 altready did not responded correctly [09:36:12] fale: I will ping people who can fix this... [09:36:19] petan: thanks :) [09:38:33] who are sleeping :( [09:39:48] petan: :D [09:39:59] fale: it's not servers! it's -login being fucked :) [09:40:04] fale: on -dev it works [09:40:09] as well as the other boxes [09:40:14] petan: cool [09:41:20] petan: can you install python-zmq and python-matplotlib on tools-dev? [09:41:29] i had it on login, but I Think I should be doing this on dev instead [09:41:43] sure [09:41:55] everything what is on -login should be on -dev [09:42:03] * petan slaps the person who installed this on -login [09:42:08] hehe :P [09:42:41] !log tools petrb: installing python -zmg -matplotlib @ dev [09:42:44] Logged the message, Master [09:43:10] how does -dev works? [09:43:32] fale: it's like login, but it has more packages and less people on it, so it's generally better for everything :D but ssshhh [09:43:51] or people will start using it and it will be just as fucked [09:43:52] petan: :D does it have a webserver? [09:44:15] it's part of tools infrastructure, so it's using tools webservers [09:44:25] petan: fucked? ogin: load average: 0.76, 1.15, 1.19 [09:44:45] because it's just after reboot [09:45:09] YuviPanda btw, if problem was in NFS howcome ONLY -login was affected? [09:45:16] -dev was too [09:45:23] but -dev wasn't rebooted [09:45:27] !sal [09:45:27] https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log [09:45:27] petan: uhm, so it's not possible to use a version of a sw as "stable" on -login and test the new one on -dev? [09:45:31] yeah because we rebooted login and nothing happened :P [09:45:36] fale: no [09:45:40] so Tim decied no point in rebooting -dev [09:45:45] fale: but you can do that on betatools [09:45:55] petan: how does betatools works? [09:46:02] fale: it's a beta version of tools [09:46:11] it's exactly same but separate [09:46:16] and running some new / experimental things [09:46:21] cool [09:46:23] or configurations [09:46:33] do I need some special permissions? [09:46:39] http://tools-beta.wmflabs.org/ [09:46:50] not really, just tell me your username I will give you access [09:47:07] petan: thanks :) my username is fale :D [09:47:08] there is toolsbeta-login server, but it has no public IP [09:47:10] petan: okay, that was installed. Thanks :) [09:47:16] YuviPanda yw [09:47:35] fale: you need to login there using some other box, like -login [09:47:36] petan: I have to use bastion? [09:47:40] or bastion [09:47:43] I see [09:48:39] petan: there are a lot of servers with no public ip. Does wmf suffer from ip shortage? [09:49:09] I don't know... I just know that public IP's are given only to them who really need them [09:49:24] yeah but I think toolsbeta-login should get one [09:49:27] toolsbeta has 1 public ip for its webserver and that is all I needed for it [09:49:55] YuviPanda we could give it some, but why... :P you can always use bastion [09:50:14] well, removes the need for forwarding [09:50:20] it's very rarely used... [09:50:26] that is true [09:50:27] yeah [09:50:44] btw fale you have access there now [09:50:54] petan: thanks :) [09:50:58] heh, 'if you can not understand how to use bastion, you can't probably do much with toolsbeta' :D [09:50:59] yw :) [09:51:37] I have to go now, brb 1h [09:52:58] petan: tools-db has the ram in critical [09:55:46] fale: really? [09:56:04] ah -db [09:56:06] that's ok... [09:56:08] gtg now [09:56:15] :o [10:03:50] * fale notices that bastion wants to be rebooted [10:05:12] they all want to be rebooted [10:05:15] such kinky servers [10:05:52] YuviPanda: :D [10:06:15] hehe become: no such tool 'addbot' [10:06:43] * Adam_WMDE thinks NFS is broken on tools [10:07:01] just before this I got ls: cannot open directory .: Stale NFS file handle on /data/project/addbot [10:07:39] YuviPanda: the real question is: why does it complains with me that I'm not root? [10:07:52] it probabl y doesn't know [10:07:56] Adam_WMDE: it is broken on NS [10:07:57] err [10:07:57] Adam_WMDE: yep, -login is having a lot problems [10:07:58] Tools [10:08:03] fale: :< [10:08:04] sortof came back, but not fully [10:08:20] we might have to wait until Coren comes back up [10:08:54] YuviPanda: how comes that a server can not know which users are root on that particular server? [10:09:12] i guess nobody bothered to implement it? [10:09:18] heh, not just tools-login, I just tried to ssh from bastion2 to tools-dev and the connection closed straight away :/ [10:09:23] and you don't have to have root to be able to reboot servers. [10:09:37] Adam_WMDE: really? I'm on tools-dev now and it works... [10:09:40] fale: have you just added yourself to thesudoers list? [10:09:46] YuviPanda: how did you get to tools-dev? [10:09:55] ssh tools-dev.wmflabs.org [10:10:02] oh, going directly in ;p [10:10:05] Adam_WMDE: I don't think I'm in the sudoers [10:10:13] Adam_WMDE: yeah :D [10:10:15] fale, what project? [10:10:42] Adam_WMDE: I think I'm not understandind the quesion [10:10:52] hmm YuviPanda seems to close the connection straight away for me :/ [10:11:12] i just closed it, sshing again [10:11:15] fale: which project on labs? tools? bots? something else? [10:11:19] Adam_WMDE: ah, you're right now. closes for me too now [10:11:22] :< [10:11:37] well, nothing's gonna happen on tools today for me then. [10:11:38] Adam_WMDE: I was talking about bastion ;) [10:12:04] mhhm, be back a bit later! :) [10:12:08] YuviPanda: now -login close connestions to me too [10:12:21] yeah, we're all out again :) [10:12:32] YuviPanda: cool :D [10:12:43] and no roots here [10:13:24] mhhm [10:14:00] * Adam_WMDE pings them all for good luck Coren|Sleep petan andrewbogott_afk :> [10:14:09] and of course Ryan_Lane ;p [10:16:25] heh, also seems the webspace is down because of this [10:17:20] PROBLEM - Disk space on labstore3 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [10:18:51] Hum .. guessing I chose a bad da to take off work to migrte stuff from toolserver to tool labs. [10:19:03] yup :) [10:19:29] Oh well. Season 3 game of Thrones marathon it is then ;) [10:20:00] hehe, lots of 'outages' in that too :P [10:23:42] !log bastion Disk space on labstore3 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [10:23:54] ah of course bot is dead :-D [10:30:29] hashar: :D [10:30:32] Well, on the plus side, at least my bot bounced back well from the last crash :( [10:32:37] I think there is a inverse rule between the roots around the and the problems. More sysadmin are available less problems occure, less sysadmin are available more problems occure [10:33:19] YuviPanda: I can see why Coren|Sleep was planning to remove nfs :D [10:33:28] :D [10:33:34] idk what he was going to replace it with, however [10:34:00] YuviPanda: I hope something more realyable [10:34:09] I am not sure what that is, though [10:34:15] needs to be accessible across multiple systems [10:34:16] and fastish [10:35:35] YuviPanda: multiple system in the sense of multiple machines or multiple SO? [10:35:39] *OS [10:35:43] machines [10:53:53] Adam_WMDE hi [10:55:15] petan: hi :) [10:55:48] petan: PM [10:56:09] fale: so what is the problem? [10:56:28] petan: -login and -dev refuse ssh connections [10:57:00] petan: the logging bot is dows, as well as the webserver [10:57:04] fale: petan apergos is looking at it in -operations [10:57:19] logging bot? [10:57:22] what you mean [10:57:51] petan: the one responding to !log here [10:58:25] ah, that one [11:04:36] yeah my IRC bot just went down on tools [11:05:21] !log test [11:05:28] !ping [11:05:28] pong [11:05:34] at least 1 bot survived :P [11:05:51] but just thanks to heavy caching [11:05:53] what happened, hardware fail? [11:06:09] gry: not sure yet, but some md arrays were degraded [11:06:16] according to apergos... [11:06:34] nfs server & gluster are having troubles, but instances are running, just mounted storage is not [11:06:57] that means /home and /data are not accessible :( [11:08:14] welcome to the life of a sysadmin for 35434535th time. good luck. :) [11:18:12] Can not connect to ssh://tools-login.wmflabs.org , always stops after sending public key. [11:18:40] What's wrong? [11:20:40] zhuyifei1999 problems with nfs :/ [11:21:46] What? [11:21:57] Hardware problems, zhuyifei1999, they're looking into it. [11:22:29] Ok, is there any other [11:22:30] P [11:22:42] Ok, thanks. [11:23:30] Sorry for mistyping enter. [11:25:38] Is the crontab running right now? [11:26:12] yes [11:26:52] everything is running, but as the shared storage is down, users can't login and they usually can't even work as the shell hang on IO wait [11:28:13] Another strange thing is that when I connect to https://tools.wmflabs.org/, I got Error 7 (net::ERR_TIMED_OUT): The operation timed out. [11:28:42] that is absolutely normal: you connected to proxy, proxy requested the pache from apache, which hang on IO as well [11:28:49] so it time out and proxy send you the error page [11:31:01] Thanks. [11:31:47] petan, ? [11:32:02] yes? [11:32:31] if I said !doc is !doc and then used !doc. What would wm-bot do? [11:32:48] !doc is doc [11:32:48] Key was added [11:32:50] err [11:32:53] !doc del [11:32:53] Successfully removed doc [11:32:56] !doc is !doc [11:32:56] Key was added [11:32:57] !doc [11:32:57] !doc [11:33:01] this ^ [11:33:18] So it does not recurse into infinity. [11:33:27] it's smarter than that [11:33:34] petan, :p [11:34:05] petan: what if I invite another bot here and tell it !doc is !doc too? [11:34:29] liangent, then we'll have some fun. :D [11:34:42] liangent I don't know, I implement some anti abuse mechanism in past but I already forgot how it works and I maybe even removed it as it caused troubles [11:34:58] I think if you invite another bot and do that, I will kick it :) [11:35:00] Cyberpower678: both would be killed by freenode for flooding [11:35:14] valhallasw nope [11:35:17] Cyberpower678: then after reconnection it should be OK [11:35:19] valhallasw only the second bot [11:35:24] wm-bot can never flood [11:35:26] good enough [11:35:41] it's sending max 1 message per second [11:35:45] oh [11:35:49] there is delivery queue in it [11:35:49] then it could go on forever [11:36:00] or I could just type [11:36:02] Message queue was reloaded! [11:36:09] which delete the content of queue [11:37:16] petan: Is there a reason why you ignore me? [11:37:21] a global queue for all channels, or one queue per channel? [11:37:29] Steinsplitter what [11:37:39] see PM [11:37:51] Steinsplitter which one? my bouncer died tonight [11:37:57] I have no PM from you [11:38:03] ah, okay :D [11:38:12] petan, when will labs go back up? [11:38:19] now [11:38:24] Cyberpower678: they are back [11:38:38] * Coren tries to figure out what happened. [11:38:56] petan: ther was an open Ticket about Spam on the BetaCluster? [11:38:59] tools-login may take a moment to recuperate. [11:38:59] they're back!? [11:39:00] Coren: read -operations backlog [11:39:17] Coren: I hope you are not rebooting -login again... [11:39:32] petan: It shouldn't be necessary. [11:39:45] petan: i cannod find it. But i think we can clous the nicket now. The problem seems go be resolved. [11:39:53] t's absolutely UNnecessary [11:40:06] toolsbeta-login was affected by this as well and is back up without reboot [11:40:25] !log tools petrb: tools is back o/ [11:40:27] Logged the message, Master [11:40:29] toolsbera-login also doesn't have quite as many wedged cron jobs trashing. :-) [11:40:37] yes I know :> [11:40:40] So it'll recuperate faster. [11:40:46] but -dev already works! [11:40:58] petan, I can't access -login. [11:42:07] Cyberpower678: It'll take some time for -login to recuperate its backlog. [11:42:43] But I want access now Coren! :p [11:42:59] * Coren points at -dev [11:43:16] so scheduled jobs should return? [11:43:35] rschen7754: They should probably already be on their way back. [11:44:04] Coren, tools and tollsbeta seems to be having some issues. [11:44:25] any chance somene in here is on labstore3 and rebootd it a few minutes ago? [11:44:35] apergos: Me. [11:44:49] ah. could you next time peek into wikimedia-operations and [11:45:02] let us know? since I did that very thing about 10 mins earlier [11:45:17] apergos: You have!? [11:45:17] although I did not see mount points recover [11:45:17] yes, I did [11:45:19] maybe 15 mins earlier [11:45:29] I only saw the /time/nnn ones come back [11:45:53] Yes, sorry. I was focused on labs and didn't check -operations. [11:45:55] are you in the other channel by any chanace? [11:53:37] Looks like -login will need a kick upside the pants. [11:53:58] rschen7754: In fact, as far as I can tell, all continuous jobs have been restarted already. [11:54:20] Coren: then that means that my tool failed to restart again *sigh* [11:54:26] Coren: it's actually coming back [11:54:34] Coren: I just got some response on one of my ssh's [11:54:45] like the banner appeared after 10 minutes :> [11:54:56] give it few more minutes... [11:55:03] petan: I know, but from what I can tell it'll take an hour to muddle through the cron backlog. [11:55:10] :( [11:55:14] petan: -login is meant to be easily rebootable. [11:55:17] poor box [11:55:32] since *nothing* should be running on it. Right everyone? *glare* [11:55:33] ok... it's just I hate rebooting unix servers :D it's nasty [11:55:37] Coren, does ganglia update on its own, or do we constantly have to keep hitting refresh? [11:55:46] beta is struggling along, still giving the good ol 503 server errors [11:56:15] rschen7754: How do you start it, normally? [11:56:20] Coren, ? [11:56:38] Coren: jstart -mem 512m python /data/project/hat-collector/irc-bots/snitch.py [11:56:44] Cyberpower678 yes it updates itself [11:56:48] cool [11:57:01] petan, how often? [11:57:04] Cyberpower678 I am here as well, you don't need to ping Core n with every question :3 [11:57:20] let me check it should say that in a browser [11:57:23] I think every few secs [11:58:26] it's some javascript :( no idea [11:58:44] it seems to reload the data only, not whole page [11:59:07] rschen7754: Odd, because that should work. Note https://tools.wmflabs.org/?status [11:59:16] petan, I see that. That's why I asked. :p [11:59:27] Coren: yeah, it just doesn't reboot, and never has for me :( [11:59:44] rschen7754: Are there situtation where it can exit with a return value of 0? [12:00:04] Looks like the grid load dropped dramatically. [12:00:07] Coren: that's what i'm guessing, when it errors out for whatever reason it exits with 0 [12:00:33] i haven't looked at the code but that's my guess [12:00:59] rschen7754, have you tried jsub continuous? [12:01:05] maybe that's it... [12:01:13] That's the only thing I cas see offhand. The system seems to be back to normal now, so we can check into it. [12:01:29] well jstart does the same thing as jsub -once -cont [12:02:18] * Cyberpower678 logins [12:02:22] anyway, bedtime [12:02:31] YAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAY! [12:02:41] Labs is up. [12:03:09] Now I can install an extension on to Peachy Wiki too test something out. :p [12:03:16] * Beetstra su-s and reboots .. ;-) [12:05:07] Coren you should have go btrfs <3 [12:05:10] Jsub working? [12:05:27] * petan should have go to pub instead of office [12:05:53] !log tools petrb: starting toolwatcher [12:05:55] Logged the message, Master [12:06:39] Coren now we need to fix that route to host issue [12:06:45] Coren try to ping s1 [12:06:54] petan: Busy atm [12:07:13] :( [12:09:48] Doing a repair on the filesystem; please stand by [12:10:51] filesystems will be hung for a bit. Normal, but they will return soon. [12:11:42] petan: i hav closed this bug.... https://bugzilla.wikimedia.org/show_bug.cgi?id=39446 [12:15:10] ok [12:18:10] Why is ssh up but down again? [12:19:14] Coren, petan: Grid load is going up again. [12:22:29] [08:09:48] Doing a repair on the filesystem; please stand by [12:22:29] [08:10:51] filesystems will be hung for a bit. Normal, but they will return soon. [12:22:29] [ [12:25:09] Nikerabbit: ^^^ :-) [12:25:37] Load now back to 7.1 [12:28:19] Nikerabbit: beta is back up [12:28:44] !log deployment-prep restarted both apaches. Beta has been down for a couple hours due to a NFS issue on labstore3. [12:28:46] Logged the message, Master [12:31:55] !log [12:33:09] !log [12:36:51] !doc [12:36:51] !doc [12:37:07] !doc del [12:37:07] You are not authorized to perform this, sorry [12:40:18] !doc del [12:40:18] Successfully removed doc [12:41:56] !doc [12:41:56] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [13:27:26] whats the link to see current running jobs? [13:30:18] you mean this? https://gdash.wikimedia.org/dashboards/jobq/ [13:30:21] or smething else? [13:30:35] apergos I think he means this: [13:30:36] !status [13:30:40] er [13:30:46] http://tools.wmflabs.org/?status [13:30:50] ah :-D [13:30:51] !status is http://tools.wmflabs.org/?status [13:30:51] Key was added [13:30:58] I'm in the wrong channel for prod [13:41:12] petan: thanks [13:46:20] !screenfix [13:46:20] script /dev/null [13:49:39] at least beta is running, if slowly [13:49:50] :) [13:54:54] !log deployment-prep attempting to enable HTTPS on the varnish text cache by applying role::protoproxy::ssl::beta [13:54:57] Logged the message, Master [14:02:49] !seen magnus [14:02:50] you probably wanted to use @seen [14:02:57] @seen magnus [14:02:57] Steinsplitter: I have never seen magnus [14:03:03] @seen magnusm [14:03:03] Steinsplitter: I have never seen magnusm [14:03:09] @seen manske [14:03:09] Steinsplitter: I have never seen manske [14:03:12] grrrrrrrrr [14:13:23] @seen grrrrrrrrr [14:13:24] zhuyifei1999: I have never seen grrrrrrrrr [14:13:49] @seen me [14:13:50] zhuyifei1999: Last time I saw me they were changing the nickname to , but is no longer in channel #wikipedia-zh at 5/15/2013 2:51:56 PM (46.23:21:53.7577130 ago) [14:14:14] @seenrx magnus [14:14:14] petan: Last time I saw magnuse_ they were quitting the network with reason: no reason was given at 9/22/2012 6:09:38 AM (282.08:04:35.6182190 ago) (multiple results were found: magnuse, magnus__) [14:14:21] Steinsplitter ^ [14:14:49] !log deployment-prep rebooting deployment-cache-mobile01 [14:14:53] Logged the message, Master [14:14:56] 2012. thx :D [14:15:04] ? [14:15:17] 282 days ago :P [14:15:22] but that is for magnuse [14:15:28] @seen magnus__ [14:15:28] petan: Last time I saw magnus__ they were quitting the network with reason: Quit: Page closed N/A at 6/28/2013 3:22:11 PM (2.22:53:17.4627430 ago) [14:16:02] @seen zhuyifei1999 [14:16:02] zhuyifei1999: are you really looking for yourself? [14:17:20] @seen you [14:17:20] zhuyifei1999: I have never seen you [14:22:35] !log tools petrb: installing following packages on grid: libdata-dumper-simple-perl libhtml-html5-entities-perl libirc-utils-perl libtask-weaken-perl libobject-pluggable-perl libpoe-component-syndicator-perl libpoe-filter-ircd-perl libsocket-getaddrinfo-perl libpoe-component-irc-perl libxml-simple-perl [14:22:37] Logged the message, Master [14:25:40] petan: Do you get the error mails from /data/project/admin/scripts/rotate_logs.sh as well? [14:25:55] yes, that is correct [14:26:00] there was NFS outage [14:26:04] this morning, again [14:26:48] No, this morning there wasn't any, but at least Friday and Saturday it complained. [14:26:56] aha... [14:27:05] there are many emails... let me find it [14:27:08] what is in subject? [14:27:22] I got it [14:27:33] take: not found? [14:27:43] that e-mail is pretty old.. [14:29:17] yes I see... [14:29:31] that suck at some point, but I can't think of a simple solution, there are 2 problems... [14:29:42] 1 is fixable [14:30:11] there is a bug in that log parser and there is also problem with file permissions... maybe g+s could work [14:31:00] something creepy is going on with nfs server [14:32:21] petan: Eh, the (last) mail is from Saturday 23:50Z. Sure we're talking about the same mail? [14:33:17] yes probably, there is 1 exception and some unix errors [14:33:31] the exception is rather minor issue (easily fixable) but the other problem is worse [14:34:43] !log deployment-prep applying role::protoproxy::ssl::beta on deployment-cache-mobile01 (attended to replace varnish-t3 for mobile caching) [14:34:45] Logged the message, Master [14:35:18] Why's wikitech logo not transparent? (https://wikitech.wikimedia.org/wiki/Talk:Main_Page#File:Wikimedia_labs_logo.svg) [14:35:31] petan: Well, if it is easily fixable, perfect :-). For the hard stuff, there's always Apache 2.4 on the horizon :-). [14:35:41] Steinsplitter: I think Magnus' nick is Magnus_Manske. [14:35:48] scfc_de it's not yet even on beta tools... [14:35:51] !log deployment-prep updated puppet repository on deployment-varnish-t3 and running puppet there [14:35:54] Logged the message, Master [14:35:55] that is going to take a lot of time for it to happen :/ [14:36:16] also I don't know if apache 2.4 can really help with this [14:36:24] petan: You need a higher mountain, then you can look further. [14:36:29] AFAIK it was only able to generate error logs in better way [14:36:42] we are talking about global access logs here [14:38:50] !log deployment-prep rebooting deployment-varnish-t3 [14:38:53] Logged the message, Master [14:40:01] !log deployment-prep binding mobile IP address 208.80.153.143 to deployment-cache-mobile01 [14:40:04] Logged the message, Master [14:40:39] !log deployment-prep shutdowning deployment-varnish-t3 (replaced by deployment-cache-mobile01 [14:40:49] Logged the message, Master [14:40:55] http://httpd.apache.org/docs/2.2/logs.html#piped looks interesting. [14:42:13] !log deployment-prep Migration to the new mobile instance was tracked by {{bug|49469}} [14:42:16] Logged the message, Master [14:52:25] Can others access the webserver (http://tools.wmflabs.org/)? [14:53:26] Coren: why ganglia tell me that nfs server is down? [14:53:36] labstore3 is down according to it :/ [14:53:38] petan: Because it just died. [14:53:44] D: [14:54:08] looks like a crash again [14:54:30] That were very short 14 days. [14:54:34] * Coren stares at the hardware. 'Work, dammit' [14:54:43] scfc_de: Different issue, I think. [14:56:16] ah, down again. [14:56:19] * YuviPanda goes to do other stuff [15:04:07] petan, doesn't do a very good job on reporting pages. [15:04:14] wm-bot ^ [15:04:20] huh? [15:04:36] petan, [15:04:37] Change on meta_wikimedia a page Requests for comment/X!'s Edit Counter was modified, changed by King of Hearts link https://meta.wikimedia.org/w/index.php?diff=5622480 edit summary: [+326] /* Remove opt-in completely */ sup [15:04:37] <-- OrenBochman1 has quit (Ping timeout: 268 seconds) [15:04:38] Change on meta_wikimedia a page Requests for comment/X!'s Edit Counter was modified, changed by M7 link https://meta.wikimedia.org/w/index.php?diff=5623793 edit summary: [+105698] Reverted changes by [[Special:Contributions/Shihaam:riyaz|Shihaam:riyaz]] ([[User talk:Shihaam:riyaz|talk]]) to last version by Horgner [15:04:38] Coren: Fixed? The webserver works again. [15:04:57] Yeah, NFS recovers well from its server going boom then coming back up. [15:05:04] Cyberpower678 what's wrong [15:05:18] But the filesystem is ailing; and the copy I'm making is placing stress on it. [15:05:29] petan, wm-bot reported these in a row. Look at the page history and compare. It's missing several notices. [15:06:13] Cyberpower678 that's because mediawiki change too often and maintains no backward compatiblity :P blame poor mediawiki devs, after that you can go and blame me [15:06:18] Coren: What FS do you use? [15:06:36] petan, I thought it monitors RC Feed. [15:06:37] scfc_de: XFS atm. I'm making a copy to ext4 and will switch to it. [15:06:41] ok, bailing from here, back to the regular too many channels [15:06:51] So that we can figure out if the bug is xfs-dependent, at least. [15:06:57] So if it's monitoring the RC feed, why would developers cause it to fail? [15:07:08] Cyberpower678 hmm... let me check something [15:08:56] WOW [15:09:01] something really nasty is going on [15:09:13] wm-bot's log has almost several gb [15:09:26] like thousands of exception per minute o.O [15:09:34] petan, somehow I'm not surprised. [15:09:36] :p [15:11:05] something had to be changed in format of rc feed [15:11:14] there are too many invalid symbols now [15:11:46] ah noez [15:11:58] this is a different plugin, it is crying because of rss reader [15:12:11] This module is not currently loaded in core [15:12:36] @system-rm Feed [15:12:36] Unloaded module Feed [15:12:46] Coren: Different topic: Is the only reason to upgrade to Apache 2.4 the ability to split error logs? [15:13:05] scfc_de: No, also better per-user virtual hosts. [15:14:22] Coren: Okay. Because you seem to be using some piped logsplitter already, and looking at the 2.2 doc, it seems to be possible to pipe /error/ logs to another program as well. [15:18:21] Cyberpower678 was it today? [15:19:08] I must say that error log for today doesn't really say much [15:21:58] petan: Coren unrelated note, everything from labs-l seems to go into gmail's spam, with message: [15:21:59] Be careful with this message. Many people marked similar messages as spam. Learn more [15:22:02] just afyi [15:22:15] petan, several days. [15:27:14] Cyberpower678 aha... do you know when it was? [15:27:25] Cyberpower678 otherwise it's quite hard for me to check what was wrong [15:27:56] YuviPanda LOL [15:28:06] YuviPanda what the ...... [15:28:12] Let's see... [15:28:23] * petan stabs google in the eye [15:29:12] petan, check the logs for "@RC+ meta_wiki Requests_for_comment/X!'s_Edit_Counter" [15:29:23] there are no such logs [15:29:31] unless you have enabled public logging [15:29:37] Doesn't wm-bot log this channel? [15:29:44] this one? yes [15:30:01] !searchlog [15:30:08] I used that command here accidentally when I meant to use it elsewhere. [15:30:17] !searchlog @RC+ meta_wiki Requests_for_comment/X!'s_Edit_Counter [15:30:27] LOL [15:30:46] Results: Found 1 in 6.077189207077 seconds [15:30:47] http://bots.wmflabs.org/~petrb/logs/#wikimedia-labs/20130701.txt: [15:30:48] On line 691: [15:29:12] petan, check the logs for "@RC+ meta_wiki Requests_for_comment/X!'s_Edit_Counter" [15:30:51] XD [15:30:59] !searchlog is http://bots.wmflabs.org/~wm-bot/searchlog [15:31:00] Key was added [15:33:07] nothing in logs from that day, except for some unrelated stuff [15:33:24] I forgot to tell you I found which day you posted this [15:35:10] o/ addshore so here you are :D [15:35:21] :O ya [15:35:34] why? :) [15:36:01] is beer in Berlin cheaper than on hackaton? :P [15:36:16] I hope so otherwise that octoberfest is going to suck lol [15:36:37] like amsterdam was 10 times more expensive than here :D [15:36:45] petan, I swear I used that command here. [15:36:54] Cyberpower678 yes I know, I found it [15:37:00] Logging was initiated yesterday. [15:37:05] yes I know I know [15:37:11] http://bots.wmflabs.org/~petrb/logs/#wikimedia-labs/20130630.txt [15:37:14] here it is [15:37:24] but still, there is no error in system logs [15:37:28] from that day [15:37:48] but TBH whole this RC feed module is crap [15:37:52] it needs a lot of love [15:38:15] whole wm-bot is kind of crap... [15:39:28] but there are other things I need to do before I start fixing its code [15:41:15] PROBLEM Current Load is now: CRITICAL on tools-webserver-01.pmtpa.wmflabs 10.4.0.128 output: CRITICAL - load average: 48.39, 29.83, 17.16 [15:41:18] o.O [15:41:36] Cyberpower678: BTW, good time to fix your crontab :-). [15:41:51] Coren: did you say "little" performance loss [15:41:54] scfc_de, no time. [15:41:57] Is "local-jimmy" around here? [15:42:04] petan: tools-login was over 550 load average during the crash yesterday :p [15:42:19] I also didn't count on the filsystem making the server crash. [15:42:31] wolfgang42: symptom, not cause. [15:42:35] wolfgang42 that number is lying often, I've seen machines with hundreds of thousands of load and they were actually just fine :P [15:42:50] scfc_de, working on other things right now. [15:42:51] Still. [15:43:02] I will be free on July 4th. [15:43:09] * YuviPanda makes 'over 9000' joke [15:43:20] nice load up there petan ;p [15:43:24] Ima leave NFS down while I make the copy and switch now. [15:43:26] OVER NINETHOUSAAAAAAAAAAAAAAAND [15:43:28] Coren why is it crashing? is it kernel? [15:43:38] petan: kernel oops in xfs [15:43:54] wtf old version you have? [15:44:17] don't tell me this xfs bug is in latest stable o.o [15:44:27] * kernel [15:44:48] petan, so when is the feed module getting fixed? [15:45:43] Cyberpower678 as soon as I find out what the problem is (unlikely to happen but if I did, it would be eventually soon) or when I completely rewrite it (more possible, but needs a lot of time) [15:46:02] or, when you fix it! [15:46:20] it's open source - is it broken? fix it! [15:46:29] topic diff <3 [15:46:43] Coren: Do you have an ETA when the switch will be done? Minutes, hours? [15:47:17] days? [15:47:18] years? [15:47:20] :0 [15:47:22] scfc_de: The copy is the long operation. I expect, with no NFS access... 30-45 min? Or so. [15:47:49] Labs has stopped responding again. [15:47:56] Cyberpower678 o'rly? [15:47:57] Coren: That's much shorter than I expected. [15:48:14] Cyberpower678 maybe it's related to nfs server being down ;) [15:48:30] AGAIN? This is the third time. [15:48:38] Within 24 hours. [15:48:45] Cyberpower678: rtft [15:48:47] it's all part of one big time, I don't think it fully recovered from the first time [15:48:59] (read the fucking topic) [15:49:10] :P [15:49:54] My topic has "NULL" written in it. [15:49:58] so, accoring YuviPanda there was actually one big outage, since labs started. that's a good statistic :) [15:50:20] Cyberpower678 upgrade your irc client! :P [15:50:40] preferably to some made by cool programmers, like petan [15:51:35] petan, nevermind. Topic is now visible and lag is 15 seconds. :O [15:53:30] * Cyberpower678 's internet is crapping out. [15:53:32] :/ [15:54:29] petan, how am I supposed to fix wm-bot? [15:54:46] Cyberpower678 checkout source code, find a bug, fix it, submit a patch [15:54:54] This bot is running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 1.10.8.15 source code licensed under GPL and located at https://github.com/benapetr/wikimedia-bot [15:57:21] * Cyberpower678 has no fucking clue on how to program in C#. :p [15:57:25] petan, ^ [15:57:35] learn [15:57:43] it's almost like c++ [15:57:49] just more rich stdlib :P [15:57:51] petan, I'm learning C++ next semester. [15:58:13] I was learning C++ several years before we started programming in school... that's a very lame excuse [15:58:51] sooner you start learning, faster you master it [15:59:00] petan, no. I learned other languages/ [15:59:09] but c++ is like english in programming [15:59:12] or, rather c [15:59:14] not c++ [15:59:21] the other languages are not so important [15:59:32] \ignore petan [15:59:35] err [15:59:37] :O [15:59:37] oops [16:00:05] python is like mongolian [16:00:06] btw [16:00:07] :P [16:00:13] * Cyberpower678 hates python [16:00:18] !pythonguy [16:00:18] this guy master python more than you: http://lh5.ggpht.com/-gjDgXLXmWTQ/TsuuOwSKWHI/AAAAAAAAk4w/XJOKxaGti-c/boy%252520python%252520bff%252520snake%2525206_thumb.jpg [16:00:37] LOLZ [16:00:43] *lol* [16:00:46] !doc [16:00:47] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [16:00:54] !docs [16:00:54] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [16:01:01] !doc del [16:01:01] Unable to find the specified key in db [16:01:07] !del doc [16:01:08] if you want to delete a key, this is wrong way [16:01:11] meh [16:01:13] hold on [16:01:15] :/ [16:01:18] there is no !doc key [16:01:26] it's autocompletion which show it [16:01:30] !do [16:01:30] There are multiple keys, refine your input: docs, domain, [16:01:33] you see [16:01:44] !doc is What's up doc? [16:01:44] Key was added [16:01:48] !doc [16:01:48] What's up doc? [16:04:36] Cyberpower678: Why do you hate python? [16:05:07] wolfgang42, it hogs unnecessary resources. [16:05:32] And the language doesn't look pleasant. [16:05:34] Use pypy then, or, if you must, shedskin. [16:06:48] Personally I prefer Python because I only have to write pseudocode, and I don't have to have redundant brackets [16:07:25] wolfgang42, it makes you a lazy programmer [16:09:03] Cyberpower678: This is a bad thing? [16:09:15] wolfgang42, hell yea. [16:09:32] python makes you forget decent programming. [16:09:59] I prefer not to have to waste effort. Why spend time fiddling with the language when I could be thinking about what I want the program to *do*? [16:11:12] python makes you forget decent programming. [16:11:19] do you prefer PHP instead? [16:11:37] YuviPanda: :D [16:12:40] ANd I agree with petan that C, is the best language for a good grasp of programming. C doesn't allow for any laziness. [16:13:01] YuviPanda: PHP makes you remember decent programming, as you continually lament being unable to decently program ;) [16:13:04] Which is why I intend to learn it next semester. [16:13:49] anomie, I don't get it. [16:14:09] Cyberpower678: Random PHP bashing [16:14:11] I need to learn C at some point. I've done some stuff with it, but I've never actually sat down and *learned* it. [16:14:29] I'm learning it next semester. :) [16:15:09] I may have to revise my estimate a bit to 1h or so; the rsync takes longer than I expect. Lots and lots of small files. :-) [16:15:53] good morning, I guess you are aware that gluster seems to have issues again? [16:16:08] gwicke, read the topic. [16:17:22] ah, so the switch to NFS already happened? [16:17:50] /data/project in parsoid-spof is read-only [16:17:59] normally that was a Gluster issue [16:20:55] hmm, http://en.wikipedia.beta.wmflabs.org/ down? [16:21:09] andre__, topic [16:21:26] ah. reading. thanks :) [16:30:34] any updates? [16:31:03] AzaToth: -operations seems to have more chatter [16:42:59] gwicke_away: Not for that project. There way well be an independent and unrelated gluster issue. :-) [17:07:37] Load now at 1.9k [17:10:16] petan, wm-bot failed again. [17:10:28] It should have reported and update just now. [17:11:20] Let's just say that there is little point in looking at load on servers that are, currently, without a live filesystem. :-) [17:11:32] heh [17:11:32] 310/376 of the way. [17:19:40] Coren: had you tried switching the scheduler? [17:19:52] you're currently switching from xfs -> ext4? [17:20:19] was there anything that suggested a filesystem bug? it looked like an NFS kernel bug to me [17:25:29] Ryan_Lane: Different problem. We got corruption on the XFS filesystem that quickly degraded since last night's reboot. [17:25:36] And yes, last night I switched schedulers. [17:25:52] Load now at 2.2k [17:26:01] Cyberpower678: You can stop the updates now. [17:26:12] Coren, why? :p [17:26:15] 322/376 progress [17:27:28] Cyberpower678: Load average 2.2k?! [17:27:46] sudo shutdown -h NOW! [17:27:48] The number is going up. [17:28:01] But [17:28:02] (376 - 322) * (17:26 - 17:11) / (322 - 310) = 67.5 minutes to go? [17:28:03] oh [17:28:08] Load 2200?! [17:28:18] 2200.00?! [17:28:20] scfc_de: G not minutes. :-) [17:28:33] THAT'S OVER 9000!!!! No... wait, it's not. [17:28:39] How [17:28:41] Just how [17:28:47] Did someone forkbomb it? [17:29:22] FastLizard4: Kinda. Some people are overly generous with cron jobs; combine that with no filesystem = pile up of cron. :-) [17:29:34] Coren: Aaaaaagh [17:29:38] Yeah, sudo shutdown -h now [17:29:39] :P [17:30:03] no, "$PJES2" then "Z EOD" and in the end "QUIESCE" [17:30:10] BTW, if someone does that, will "Reboot instance" bring the instance back to life? [17:30:29] scfc_de: What, a `shutdown` command? [17:30:31] I imagine so [17:30:35] scfc_de: Yes. [17:30:48] Not that it'd be useful /now/ mind you. [17:31:09] Can you pull the virtual plug? [17:31:53] FastLizard4: Why? [17:32:03] Coren: For other projects it might be interesting if you do development "nine to five" and you don't need all instances running 24/7. [17:32:09] Coren: Well, if you can't shutdown the system [17:32:42] scfc_de: The overhead cost of an idle instance is neglectible. [17:33:03] And a downed instance gets no updates. [17:33:18] I guess [17:33:26] Still, a load average of 2200.00 is absurd :P [17:33:28] Coren: True. [17:33:50] Literally the highest load average I've ever seen is 800 [17:34:48] If I had a machine pushing 2200, I'd kill it if for no other reason than to clear the process table [17:35:21] What instance is this, anyway? [17:35:36] FastLizard4: Due to the cron jobs, it would fill up immediately after reboot. There's no harm. [17:36:23] True [17:38:02] FastLizard4, no the highest you've seen is 2.5k. Which the load is at now. [17:38:27] Cyberpower678: No, it's not the highest I've seen until I see it with my own eyes in htop :P [17:38:44] then go ook. [17:38:54] But no one's told me what instance it is! :P [17:39:02] And I'm at work so I don't have my SSH key son hand :P [17:39:07] *SSH keys on hand [17:39:37] -login [17:39:53] tools [17:39:55] bots [17:40:39] FastLizard4: Coren: For other projects it might be interesting if you do [17:40:39] development "nine to five" and you don't need all instances running [17:40:39] 24/7. [17:32] [17:40:39] Coren: Well, if you can't shutdown the system [17:40:42] scfc_de: The overhead cost of an idle instance is neglectible. [17:40:45] And a downed instance gets no updates. [17:33] [17:40:49] I guess [17:40:52] Still, a load average of 2200.00 is absurd :P [17:40:55] Coren: True. [17:40:58] Literally the highest load average I've ever seen is 800 [17:41:01] scfc_de: o_O? [17:41:03] *** Jasper_Deng_away (~chatzilla@wikimedia/Jasper-Deng) is now known as [17:41:03] Jasper_Deng [17:34] [17:41:06] If I had a machine pushing 2200, I'd kill it if for no other [17:41:09] reason than to clear the process table [17:41:12] What instance is this, anyway? [17:35] [17:41:13] scfc_de, WTF? [17:41:22] FastLizard4: Due to the cron jobs, it would fill up immediately [17:41:22] after reboot. There's no harm. [17:41:22] O_O [17:41:22] True [17:36] [17:41:25] FastLizard4, no the highest you've seen is 2.5k. Which the [17:41:28] load is at now. [17:38] [17:41:31] Cyberpower678: No, it's not the highest I've seen until I see it [17:41:34] with my own eyes in htop :P [17:41:38] then go ook. [17:41:41] But no one's told me what instance it is! :P [17:41:43] And I'm at work so I don't have my SSH key son hand :P [17:39] [17:41:44] Someone get a channel op and add +q to scfc_de [17:41:46] *SSH keys on hand [17:41:50] -login [17:41:52] tools [17:41:55] bots [17:41:57] Nah, he's almost done :P [17:41:58] ERC> FastLizard4: http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=tools&h=tools-login&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [17:42:04] Well, eh, sorry. [17:42:13] it's okay, I've done that OO :P [17:42:22] *too, even [17:43:37] I guess from your delayed responses the paste didn't appear all at once? [17:44:08] scfc_de: No, you client is obviously smart enough to not flood itself out. :-) [17:44:39] scfc_de, [17:44:40] it [17:44:42] was [17:44:44] coming [17:44:46] in [17:44:48] like [17:44:50] this [17:45:54] Need to look up if I can enable some safeguard à la "don't send it if it's more than x lines". [17:46:05] *cough* irssi *cough( [17:46:09] s/(/*/ [17:46:09] :P [17:47:29] FastLizard4: Isn't written in Lisp, I believe :-). [17:47:42] It's writtn in Perl :P [17:48:03] Wait, is it? [17:48:04] No. [17:48:11] Irssi is C but supports Perl addons [17:49:17] ERC has "flood control", but apparently it just delays the messages, and doesn't ask: "Are you serious?" [17:55:51] Again Pascal to the rescue: http://darcs.informatimago.com/local/darcs/public/emacs/pjb-erc.el -- probably needs some adapting, though. [17:56:59] 338/376 [17:57:06] Slowed down a bit. Lots of small files. :-( [17:58:36] Do we have one volume for /data/project and one for /home? [17:58:55] scfc_de: Different mountpoints, same underlying filesystem. [17:59:09] Doesn't change anything though; rsync couldn't care less. :-) [18:00:03] Coren: corruption? eww. [18:00:12] found a cause? [18:00:14] Coren: Sure; I was thinking more about if isolating some stuff may make management easier. [18:00:33] Ryan_Lane: Eew is right. I'm having lots of "fun". [18:00:46] Ryan_Lane: None, though clearly the forcible reboots aren't good for it. [18:01:20] yeah [18:02:32] Ryan_Lane: Speaking of... I'm still very much annoyed/bugged at the timing. [18:02:43] yeah. it's odd [18:02:49] is there any cron that could cause that? [18:03:07] Ryan_Lane: I saw nothing that seemed relevant. [18:03:40] :( [18:03:43] It's not the snapshotting, which is what I feared originally; that takes place at the wrong time and, anyways, is every hour. [18:03:53] yeah [18:04:03] Ryan_Lane, Coren whaled me for being responsible for high loads on -login. [18:04:08] (The only difference about the weeklies is that they /aren't/ deleted 3 days later) [18:04:20] Cyberpower678: That shouldn't have been a cause. [18:04:33] Cyberpower678: yeah, you aren't causing filesystem issue [18:04:36] *issues [18:04:42] Cyberpower678: You did go quite overboard with the cron every minute though. :-) [18:04:59] Better to sleep in your script than have cron start every minute. :-) [18:05:01] Coren, I plan to fix those when my time frees up. [18:05:33] I plan to make everything continuous. [18:05:36] 362/376; the pace picked up a little. [18:05:44] Coren, LIKE [18:06:08] Cyberpower678: For starters, you just need to replace "* * * * *" with " * * * *". [18:06:30] scfc_de, umm...no. [18:06:38] That doesn't make it continuous. [18:06:54] I'm going to convert a lot of the scripts to be continuous. [18:07:55] Ryan_Lane: I'm going to disable timetravel for the time being, keeping things on the simple side for now. I'll experiment with stability in eqiad before restarting it. [18:08:30] * Ryan_Lane nods [18:08:49] Coren: are things back up? [18:09:09] Betacommand: Soon. The copy is almost complete, and it's just a few minutes after that. [18:10:41] * Cyberpower678 's RAM is overloading. [18:11:01] 11.21 GB of 15.9 GB used [18:11:17] Cyberpower678: You shouldn't really need to convert anything. If it's a script and it should be restarted once it exits, "jstart