[00:00:15] And ^C on tools-dev doesn't abort whatever is hanging; so reading blocking on device? [00:00:29] YuviPanda: The bash thing on tools-*login*? [00:00:34] aye [00:00:43] and then my mosh died [00:01:36] Should I reboot tools-login, then? Might not solve the tools-dev issues, but could it hurt? [00:02:06] Coren, petan: You're both offline? [00:02:41] yeah [00:02:56] scfc_de: http://tools.wmflabs.org/?status doesn't seem to work either [00:03:02] so maybe the entire cluster's out? [00:03:10] i've no way of chcking SGE [00:03:32] AnomieBOT is still making edits, as of a few minutes ago anyway [00:03:55] hmm [00:04:26] !log tools Rebooted tools-login apparently out of memory and not responding to ssh [00:05:29] Hah, the bot seems to be down as well. But doesn't it live on Bots? [00:05:29] Ah, that was what that ding was. "Power button pressed." [00:05:49] wolfgang42: You had an open session? [00:06:06] scfc_de: Open, but not responding. [00:06:19] I had mariadb open at the time, and I couldn't exit. [00:08:25] hmm [00:08:26] morebots has left [00:09:11] Now "ssh tools-login.wmflabs.org" works, but hangs at the same step (after "No mail."). Fuck. [00:11:19] do we have some sort of a 'panic' button? [00:12:06] I have, right next to my Ctrl key. [00:12:07] * wolfgang42 pushes the panic button. [00:12:15] Nothing happens. [00:15:40] scfc_de: Yours was hanging after "No mail"? Mine is now, but before it wasn't getting past the access problems notice. So there's progress, I guess. [00:16:45] wolfgang42: On tools-dev, yes. Now tools-login is in the same (still unusable) status as tools-dev. [00:17:03] well, I have You have new mail. [00:17:03] :P [00:17:11] but that's because i've setup forwarding. [00:17:45] From Puppet, I think NFS is on labstore1. If I look at ganglia.wikimedia.org, that seems to be "okay". I have no idea what's the problem here. [00:19:41] Does someone see anything obvious in https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=tools&instanceid=51f3a377-90b5-4f46-8590-5624c50c5dbf®ion=pmtpa ? [00:20:11] "task jsub:3012 blocked for more than 120 seconds"? [00:20:12] you've to be projectadmin to see that, scfc_de [00:20:18] maybe SGE is dead? [00:24:24] http://pastebin.com/2vCBtMES [00:25:21] tools-master (the grid-master) is in the same state. Also I don't think something in /etc/profile.d (or wherever in the login startup chain) contacts the grid master, so it shouldn't hang. [00:25:59] scfc_de: "nslcd[2266]: [3c9346] error writing to client: Broken pipe" might be bad news, that seems to be related to LDAP. [00:26:23] scfc_de: /scripts/local-premount/rescuevol: line 72: add_mountroot_fail_hook: not found [00:26:30] mabbe [00:26:30] ? [00:26:36] toolsbeta-login.pmtpa.wmflabs hangs as well, so it doesn't seem to be limited to Tools. [00:27:32] autofs might also be fucked? initctl: Unknown job: S20autofs [00:28:17] Bots however works. NFS as the common denominator? Then we would be stuck until Coren, Ryan_Lane or andrewbogott come online. [00:29:20] true [00:29:29] the analytics project (limn0 instance) also works [00:29:31] so probably NFS [00:30:37] I'll write a mail to Coren and Ryan. Until they fix that I fear we must work on other projects :-). [00:30:45] whaaat nooooooo! [00:30:46] :P [00:30:55] i was working on VisualEditor stuff and when I come back labs goes down :( [00:33:43] When did the troubles start? 23:30Z? [00:34:09] 04:55IST, [00:34:15] so -0530 [00:34:20] that'll be... [00:34:23] 23:25Z? [00:35:18] scfc_de: ^ [00:35:46] YuviPanda: Thanks. [00:36:08] that was when wolfgang first reported it [00:36:17] i checked and had issues at that time, but could type things into my mosh connection [00:36:21] (got the bash: fork thing) [00:43:07] Why isn't -login responding? [00:43:28] Cyberpower678: Scroll back. [00:46:07] Where am I scrolling back to? [00:47:50] Why is labs down? [00:47:52] TLDR [00:48:02] I'm very sleepy. [00:49:35] Apparently NFS failure. [00:49:53] TOOOOOOOOOLSEEEEEEEERVEEEERRRR [00:52:00] * Cyberpower678 trouts Coren [00:54:02] Helllo! Is tools.wmflabs.org working? [00:54:13] JohnMarkOckerblo, no. [00:54:22] scfc_de, bots seem to be down too. https://en.wikipedia.org/wiki/User:Cyberbot_I [00:55:06] Cyberpower678: The Bots project or bots running on Tools? If the latter, that's to be expected. [00:55:15] (and is there a standard place to look for server status reports?) [00:55:44] Tools because it has replication. [00:56:03] And other various toolserver environments. [00:56:44] JohnMarkOckerblo: I posted a short note on the mailing list a few minutes ago, but we should change this channel's topic as well. [00:56:47] * anomie notes than AnomieBOT is still happily editing away [00:57:10] * Cyberpower678 smacks anomie for bragging. ;p [00:57:30] I'm requesting Bots access. [00:57:42] Well, at least I know it's not my script going haywire. (hopefully.) [00:58:03] I'm going to setup a backup there, should it stall on Tools in the future. [00:58:14] Cyberpower678: I'd guess anything that gets launched from a cronjob is not being started, but continuous processes will keep running, until maybe they happen to run into whatever is breaking stuff [00:58:33] anomie: Do you do file system stuff? [00:58:37] anomie, well I use cronjobs. [00:59:03] That does explain alot, since the script that is running is loaded in the active memory already. [00:59:04] scfc_de: It writes to stdout/stderr with each edit [00:59:56] anomie, https://en.wikipedia.org/wiki/User:Cyberbot_I This is where my status updater comes in handy. ;p [01:00:11] scfc_de: And it will periodically check for a certain file's existence, for "restart" commands. And possibly it will update its cookie file, too. [01:00:59] hm. If there's a space issue, as prior logs suggest, my cgi script does write a log into my directory, but I think that's on the big shared partition w lots of space. [01:01:14] ('course, I can't actually get a shell now to verify. :-) [01:01:57] JohnMarkOckerblo: Definitely not your problem. [01:02:33] anomie: Don't know if all NFS is stuck, or there's some buffering? [01:02:58] How goes the tools troubleshooting? [01:04:06] scfc_de: I don't know NFS *that* well, but IIRC it's pretty good about dying as soon as something goes wrong. I really suspect the login issue is whatever is making nslcd error out (not that I know much about that either). [01:07:31] anomie: But then login shouldn't work at all, or? [01:08:07] wolfgang42: Without Coren or Ryan, there's nothing we (or I) can do. [01:08:30] any estimates on uptime? (I do have another server with the same script, and could potentially point my templates to it, but probably not a good idea unless tools is going to be down for extended period.) [01:08:46] * Cyberpower678 smacks petan in hopes that he'll respond. [01:08:53] scfc_de: Any idea why it's rejecting my ssh connection entirely now? (Connection closed by 208.80.153.224) [01:09:16] wolfgang42, verifying your claim... [01:10:09] wolfgang42: Don't know, maybe out of memory again? [01:10:19] Unable to establish SSH connection now. [01:10:52] What's lab's monitoring page. [01:11:06] With all of those graphs, and doohickies. [01:11:26] scfc_de, ^ [01:11:29] JohnMarkOckerblo: Worst case probably morning US east coast (10 hours?). [01:11:46] Cyberpower678: ganglia.wmflabs.org [01:12:07] * anomie notes that deployment-bastion.pmtpa.wmflabs seems to be having the same issue... [01:12:07] JohnMarkOckerblo, I agree with scfc_de. I estimate 9-11 hours before uptime. [01:12:35] anomie: I think hashar converted beta to NFS as well. [01:13:03] Grid load seems to be increasing. [01:13:26] Labs may totally crash at some point. [01:14:15] Interestingly, http://ganglia.wmflabs.org/latest/?p=2&c=tools&h=tools-login lists tools-login as being up. [01:14:55] Cyberpower678: Only tools-login is running amok. We'll need to look at the crontabs some time. [01:14:55] This seems to be a repeat of what happened before. [01:15:21] hm. Not good, but at least it's not during the semester. Might switch templates to point to Penn if things are still down by noon east coast time. [01:15:22] Though, on closer inspection, I've never seen a load avg. anywhere close to the listed 855.36 [01:16:00] And if that's the case, I may be in part responsible for stalling -login to the point of unusability. [01:16:20] I've not yet had the chance to adjust my crontab. [01:17:30] Cyberpower678: Yes and no. "* * * * *" isn't nice, but you (and everyone else) probably couldn't log into tools-login even otherwise, cf. tools-dev. [01:18:09] scfc_de, I've been meaning to convert those to continuous scripts and submit a jsub continuous command [01:22:38] [05:05:29 PM] Hah, the bot seems to be down as well. But doesn't it live on Bots? <-- Nope, it lives in the "morebots" project on labs. [01:23:18] legoktm: Does that use NFS as well? [01:23:32] Er sorry, I meant "morebots" on tools. [01:24:01] scfc_de, wolfgang42: looks like it's marked as down now. [01:26:25] Well, I'm not going to reboot it as it wouldn't change anything. I've mailed Coren and Ryan, and we'll have to wait for their return. Good night everybody! [01:27:29] night scfc_de, thanks for trying :) [02:32:04] Soo....stuff down? [02:33:16] Yup :/ [02:33:21] NFS is down [02:33:46] thats fine, just use TS copy [03:35:03] I'm receiving mail from cron saying: /bin/sh: execle: Cannot allocate memory [03:41:36] Fine time to go out to dinner. [03:41:47] * Coren notes, with some suspicion, that we are Sunday. Again. [03:42:06] * Coren will be back in 15 [03:49:24] good, it wasn't me at least. I run Perl and it's a bit ram crazy, but I didn't start yet. [04:17:32] Yeah, there's a plausible known issue with a kernel bug, but I don't like the timing one bit; it occured three times now, with 14 days between the failures. A server doesn't "randomly" fail only on Sunday evenings every other week. [05:39:30] Coren: How long does it take from creating a tool until I can "become" it? [05:39:44] its listed on https://wikitech.wikimedia.org/wiki/Special:NovaProject, but "become: no such tool 'mwp'" [05:39:51] legoktm: A few minutes, normally, but you do have to log off and back on to get the group membership. [05:40:02] Ok [05:40:07] Also, it seems I can't remove a tool that I own? [05:40:33] I don't need "mwp-testing" anymore if you can get rid of it [05:46:50] Coren: Log on and off wikitech? Or tools-login? [05:47:16] loots-login [05:47:37] not working then :( [05:47:45] * Coren checks. [05:48:46] Not super urgent, I'll just run the script under another tool for now [05:49:42] I see no reason why it shouldn't work. I see the group, and you belonging to it. [05:50:29] Can I switch to your user account to test? [05:50:31] legoktm@tools-login:~$ become mwp [05:50:31] become: no such tool 'mwp' [05:50:34] sure, go for it [05:52:08] I see why. [05:53:03] The toolwatcher broke when NFS went ill. [05:53:07] Should be up and running. [05:53:57] woot, thanks [05:54:04] * Coren goes to sleep now. [05:54:10] gnite [06:12:27] hmm .. too late for Coren .. :-/ [06:13:34] Someone else who can install perl modules using cpan? Some I can install locally, others don't work. [06:13:41] petan? [07:19:09] lo [08:36:26] !log deployment-prep Switching the text cache traffic from deployment-squid to deployment-cache-text1 by reassociating the public IP 208.80.153.219 [08:36:30] Logged the message, Master [08:43:01] !log deployment-prep Shutdowning deployment-squid , service migrated to deployment-cache-text01 (varnish). [08:43:04] Logged the message, Master [09:10:05] @notify AzaToth [09:10:06] I'll let you know when I see AzaToth around here [09:19:08] zz_YuviPanda: ping [09:27:36] YuviPanda what happened yesterday? [09:27:42] ah [09:27:44] i don't know [09:27:47] I wasn't here [09:27:52] tools-login ran out of memory [09:27:54] so did tools-dev [09:27:54] -login was rebooted [09:27:59] and was rebooted [09:28:02] but was still unresponsive [09:28:04] interesting [09:28:10] kernel logs probably mentioned NFS being fucked up [09:28:17] and then I went to sleep [09:28:17] well, I fixed this permanently I guess [09:28:20] ok [09:29:13] memory waws, I think, a side effect [09:30:33] well, terminatord will handle it :o [09:30:36] Beetstra hi [09:31:13] hi Petan [09:31:29] petan: so yesterday's problem was apparently NFS fucking up [09:31:35] nothing to do with memory [09:31:35] yes I see [09:31:39] (at least that's what I get from reading backlog) [09:31:51] I am definitely not sure... I wasn't here [09:32:13] I will check logs, however nfs should autoremount once fixed [09:32:23] also, it's possible to repair this without reboot [09:32:35] Can you help me later this afternoon (I have to go now for an hour or so) - basically, perl modules are difficult to install apparently, some don't, others half, some do. Coren installed POE for me yesterday because it did not want to install at all [09:32:35] * petan hates reboots [09:32:49] I don't even remember last time I had to reboot some server to fix any issue [09:33:14] we have some boxes here with uptime over 1100 days [09:33:28] is there some problem with the db server? [09:33:33] fale which one? [09:33:46] petan: itwiki.labsdb [09:33:52] Beetstra ok, can you make a list? :) [09:34:07] fale: you are correct [09:34:25] petan: I preferred to be wrong :D [09:34:39] fale: s2 is having problems [09:34:46] oh [09:34:54] in fact it seems to me that all of them are having troubles :> [09:35:19] petan: big problems or small problems? [09:35:28] YuviPanda howcome that everytime I go afk for weekend, servers are dying and systems going down :D [09:35:31] I always miss the fun [09:35:40] yeah, Coren|Sleep was wondering the sam ething too :) [09:35:57] petan: It seems that 12 hrs ago the s2 altready did not responded correctly [09:36:12] fale: I will ping people who can fix this... [09:36:19] petan: thanks :) [09:38:33] who are sleeping :( [09:39:48] petan: :D [09:39:59] fale: it's not servers! it's -login being fucked :) [09:40:04] fale: on -dev it works [09:40:09] as well as the other boxes [09:40:14] petan: cool [09:41:20] petan: can you install python-zmq and python-matplotlib on tools-dev? [09:41:29] i had it on login, but I Think I should be doing this on dev instead [09:41:43] sure [09:41:55] everything what is on -login should be on -dev [09:42:03] * petan slaps the person who installed this on -login [09:42:08] hehe :P [09:42:41] !log tools petrb: installing python -zmg -matplotlib @ dev [09:42:44] Logged the message, Master [09:43:10] how does -dev works? [09:43:32] fale: it's like login, but it has more packages and less people on it, so it's generally better for everything :D but ssshhh [09:43:51] or people will start using it and it will be just as fucked [09:43:52] petan: :D does it have a webserver? [09:44:15] it's part of tools infrastructure, so it's using tools webservers [09:44:25] petan: fucked? ogin: load average: 0.76, 1.15, 1.19 [09:44:45] because it's just after reboot [09:45:09] YuviPanda btw, if problem was in NFS howcome ONLY -login was affected? [09:45:16] -dev was too [09:45:23] but -dev wasn't rebooted [09:45:27] !sal [09:45:27] https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log [09:45:27] petan: uhm, so it's not possible to use a version of a sw as "stable" on -login and test the new one on -dev? [09:45:31] yeah because we rebooted login and nothing happened :P [09:45:36] fale: no [09:45:40] so Tim decied no point in rebooting -dev [09:45:45] fale: but you can do that on betatools [09:45:55] petan: how does betatools works? [09:46:02] fale: it's a beta version of tools [09:46:11] it's exactly same but separate [09:46:16] and running some new / experimental things [09:46:21] cool [09:46:23] or configurations [09:46:33] do I need some special permissions? [09:46:39] http://tools-beta.wmflabs.org/ [09:46:50] not really, just tell me your username I will give you access [09:47:07] petan: thanks :) my username is fale :D [09:47:08] there is toolsbeta-login server, but it has no public IP [09:47:10] petan: okay, that was installed. Thanks :) [09:47:16] YuviPanda yw [09:47:35] fale: you need to login there using some other box, like -login [09:47:36] petan: I have to use bastion? [09:47:40] or bastion [09:47:43] I see [09:48:39] petan: there are a lot of servers with no public ip. Does wmf suffer from ip shortage? [09:49:09] I don't know... I just know that public IP's are given only to them who really need them [09:49:24] yeah but I think toolsbeta-login should get one [09:49:27] toolsbeta has 1 public ip for its webserver and that is all I needed for it [09:49:55] YuviPanda we could give it some, but why... :P you can always use bastion [09:50:14] well, removes the need for forwarding [09:50:20] it's very rarely used... [09:50:26] that is true [09:50:27] yeah [09:50:44] btw fale you have access there now [09:50:54] petan: thanks :) [09:50:58] heh, 'if you can not understand how to use bastion, you can't probably do much with toolsbeta' :D [09:50:59] yw :) [09:51:37] I have to go now, brb 1h [09:52:58] petan: tools-db has the ram in critical [09:55:46] fale: really? [09:56:04] ah -db [09:56:06] that's ok... [09:56:08] gtg now [09:56:15] :o [10:03:50] * fale notices that bastion wants to be rebooted [10:05:12] they all want to be rebooted [10:05:15] such kinky servers [10:05:52] YuviPanda: :D [10:06:15] hehe become: no such tool 'addbot' [10:06:43] * Adam_WMDE thinks NFS is broken on tools [10:07:01] just before this I got ls: cannot open directory .: Stale NFS file handle on /data/project/addbot [10:07:39] YuviPanda: the real question is: why does it complains with me that I'm not root? [10:07:52] it probabl y doesn't know [10:07:56] Adam_WMDE: it is broken on NS [10:07:57] err [10:07:57] Adam_WMDE: yep, -login is having a lot problems [10:07:58] Tools [10:08:03] fale: :< [10:08:04] sortof came back, but not fully [10:08:20] we might have to wait until Coren comes back up [10:08:54] YuviPanda: how comes that a server can not know which users are root on that particular server? [10:09:12] i guess nobody bothered to implement it? [10:09:18] heh, not just tools-login, I just tried to ssh from bastion2 to tools-dev and the connection closed straight away :/ [10:09:23] and you don't have to have root to be able to reboot servers. [10:09:37] Adam_WMDE: really? I'm on tools-dev now and it works... [10:09:40] fale: have you just added yourself to thesudoers list? [10:09:46] YuviPanda: how did you get to tools-dev? [10:09:55] ssh tools-dev.wmflabs.org [10:10:02] oh, going directly in ;p [10:10:05] Adam_WMDE: I don't think I'm in the sudoers [10:10:13] Adam_WMDE: yeah :D [10:10:15] fale, what project? [10:10:42] Adam_WMDE: I think I'm not understandind the quesion [10:10:52] hmm YuviPanda seems to close the connection straight away for me :/ [10:11:12] i just closed it, sshing again [10:11:15] fale: which project on labs? tools? bots? something else? [10:11:19] Adam_WMDE: ah, you're right now. closes for me too now [10:11:22] :< [10:11:37] well, nothing's gonna happen on tools today for me then. [10:11:38] Adam_WMDE: I was talking about bastion ;) [10:12:04] mhhm, be back a bit later! :) [10:12:08] YuviPanda: now -login close connestions to me too [10:12:21] yeah, we're all out again :) [10:12:32] YuviPanda: cool :D [10:12:43] and no roots here [10:13:24] mhhm [10:14:00] * Adam_WMDE pings them all for good luck Coren|Sleep petan andrewbogott_afk :> [10:14:09] and of course Ryan_Lane ;p [10:16:25] heh, also seems the webspace is down because of this [10:17:20] PROBLEM - Disk space on labstore3 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [10:18:51] Hum .. guessing I chose a bad da to take off work to migrte stuff from toolserver to tool labs. [10:19:03] yup :) [10:19:29] Oh well. Season 3 game of Thrones marathon it is then ;) [10:20:00] hehe, lots of 'outages' in that too :P [10:23:42] !log bastion Disk space on labstore3 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [10:23:54] ah of course bot is dead :-D [10:30:29] hashar: :D [10:30:32] Well, on the plus side, at least my bot bounced back well from the last crash :( [10:32:37] I think there is a inverse rule between the roots around the and the problems. More sysadmin are available less problems occure, less sysadmin are available more problems occure [10:33:19] YuviPanda: I can see why Coren|Sleep was planning to remove nfs :D [10:33:28] :D [10:33:34] idk what he was going to replace it with, however [10:34:00] YuviPanda: I hope something more realyable [10:34:09] I am not sure what that is, though [10:34:15] needs to be accessible across multiple systems [10:34:16] and fastish [10:35:35] YuviPanda: multiple system in the sense of multiple machines or multiple SO? [10:35:39] *OS [10:35:43] machines [10:53:53] Adam_WMDE hi [10:55:15] petan: hi :) [10:55:48] petan: PM [10:56:09] fale: so what is the problem? [10:56:28] petan: -login and -dev refuse ssh connections [10:57:00] petan: the logging bot is dows, as well as the webserver [10:57:04] fale: petan apergos is looking at it in -operations [10:57:19] logging bot? [10:57:22] what you mean [10:57:51] petan: the one responding to !log here [10:58:25] ah, that one [11:04:36] yeah my IRC bot just went down on tools [11:05:21] !log test [11:05:28] !ping [11:05:28] pong [11:05:34] at least 1 bot survived :P [11:05:51] but just thanks to heavy caching [11:05:53] what happened, hardware fail? [11:06:09] gry: not sure yet, but some md arrays were degraded [11:06:16] according to apergos... [11:06:34] nfs server & gluster are having troubles, but instances are running, just mounted storage is not [11:06:57] that means /home and /data are not accessible :( [11:08:14] welcome to the life of a sysadmin for 35434535th time. good luck. :) [11:18:12] Can not connect to ssh://tools-login.wmflabs.org , always stops after sending public key. [11:18:40] What's wrong? [11:20:40] zhuyifei1999 problems with nfs :/ [11:21:46] What? [11:21:57] Hardware problems, zhuyifei1999, they're looking into it. [11:22:29] Ok, is there any other [11:22:30] P [11:22:42] Ok, thanks. [11:23:30] Sorry for mistyping enter. [11:25:38] Is the crontab running right now? [11:26:12] yes [11:26:52] everything is running, but as the shared storage is down, users can't login and they usually can't even work as the shell hang on IO wait [11:28:13] Another strange thing is that when I connect to https://tools.wmflabs.org/, I got Error 7 (net::ERR_TIMED_OUT): The operation timed out. [11:28:42] that is absolutely normal: you connected to proxy, proxy requested the pache from apache, which hang on IO as well [11:28:49] so it time out and proxy send you the error page [11:31:01] Thanks. [11:31:47] petan, ? [11:32:02] yes? [11:32:31] if I said !doc is !doc and then used !doc. What would wm-bot do? [11:32:48] !doc is doc [11:32:48] Key was added [11:32:50] err [11:32:53] !doc del [11:32:53] Successfully removed doc [11:32:56] !doc is !doc [11:32:56] Key was added [11:32:57] !doc [11:32:57] !doc [11:33:01] this ^ [11:33:18] So it does not recurse into infinity. [11:33:27] it's smarter than that [11:33:34] petan, :p [11:34:05] petan: what if I invite another bot here and tell it !doc is !doc too? [11:34:29] liangent, then we'll have some fun. :D [11:34:42] liangent I don't know, I implement some anti abuse mechanism in past but I already forgot how it works and I maybe even removed it as it caused troubles [11:34:58] I think if you invite another bot and do that, I will kick it :) [11:35:00] Cyberpower678: both would be killed by freenode for flooding [11:35:14] valhallasw nope [11:35:17] Cyberpower678: then after reconnection it should be OK [11:35:19] valhallasw only the second bot [11:35:24]