[00:00:53] Load avg. on willow is CRITICAL: CRITICAL - load average: 43.33, 25.33, 20.88 [00:08:54] /sql on ptolemy is OK: DISK OK - free space: /sql 128193 MB (21% inode=99%): [00:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [00:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [00:15:44] Can someone guide me on how to create a crontab file? [00:20:38] ceradon: what do you wanna do? [00:21:22] create a crontab to run my bot, can you help, Alchimista? [00:21:24] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [00:21:54] ceradon: i may try, if it's not too complicated :P but what do you want to do specificly? [00:23:41] Alchimista: can you join ##ceradonbot please? [00:30:43] Sun Grid Engine execd on willow is UNKNOWN: CHECK_NRPE: Error receiving data from daemon. [00:31:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [00:32:24] SMF on willow is UNKNOWN: CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. [00:32:34] /tmp on willow is CRITICAL: Connection refused by host [00:33:25] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [00:33:34] /tmp on willow is OK: DISK OK - free space: /tmp 379 MB (74% inode=99%): [00:37:24] SMF on willow is OK: OK - all services online [00:40:44] Sun Grid Engine execd on willow is UNKNOWN: CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. [00:45:05] /sql on rosemary is CRITICAL: DISK CRITICAL - free space: /sql 38185 MB (3% inode=99%): [00:48:25] Environment on willow is UNKNOWN: CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. [00:49:14] Environment on willow is CRITICAL: Connection refused by host [00:50:34] /tmp on willow is CRITICAL: Connection refused by host [00:50:34] / on willow is CRITICAL: Connection refused by host [00:50:44] Cluster on willow is CRITICAL: Connection refused by host [01:00:54] Load avg. on willow is CRITICAL: Connection refused by host [01:01:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [01:08:24] SMF on willow is CRITICAL: Connection refused by host [01:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [01:12:04] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [01:16:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 127021 MB (20% inode=99%): [01:18:44] hmm, phe@willow:~$ free [01:18:44] -bash: fork: Not enough space [01:20:41] better now [01:20:44] mzmcbride@willow:~$ free [01:20:44] -bash: free: command not found [01:21:29] "df -h" shows nothing crazy. [01:49:14] Environment on willow is CRITICAL: Connection refused by host [01:50:33] /tmp on willow is CRITICAL: Connection refused by host [01:50:34] / on willow is CRITICAL: Connection refused by host [01:50:44] Cluster on willow is CRITICAL: Connection refused by host [02:00:54] Load avg. on willow is CRITICAL: Connection refused by host [02:01:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [02:09:24] SMF on willow is CRITICAL: Connection refused by host [02:10:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [02:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [02:16:41] Is there a memory issue on willow? [02:16:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 123689 MB (20% inode=99%): [02:40:53] Anyone know why I am getting: [02:40:54] -bash: fork: Not enough space [02:44:34] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [02:49:14] Environment on willow is CRITICAL: Connection refused by host [02:49:34] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [02:50:34] /tmp on willow is CRITICAL: Connection refused by host [02:50:34] / on willow is CRITICAL: Connection refused by host [02:51:45] Cluster on willow is CRITICAL: Connection refused by host [02:54:24] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:25] SSH on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:34] RAID on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [02:55:14] RAID on hyacinth is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [02:55:15] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [02:55:15] SSH on z-dat-s6-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:00:54] Load avg. on willow is CRITICAL: Connection refused by host [03:01:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [03:04:35] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:09:25] SMF on willow is CRITICAL: Connection refused by host [03:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [03:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [03:17:53] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 117910 MB (19% inode=99%): [03:28:45] SSH on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:29:34] SSH on willow is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:33:09] I'm having serious issues on Willow. Does anyone know if it's having problems? [03:46:25] SSH on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:25] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:46] SSH on z-dat-s4-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:54] Load avg. on z-dat-s4-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:46:54] Load avg. on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:05] Load avg. on z-dat-s3-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:14] Load avg. on z-dat-s6-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:14] SMTP on hyacinth is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:14] SMTP on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:14] SMTP on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:15] SSH on z-dat-s3-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:15] SSH on hyacinth is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:15] Load avg. on z-dat-s7-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:25] Load avg. on hyacinth is OK: OK - load average: 0.43, 2.60, 3.37 [03:47:34] SMF on z-dat-s3-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:34] SMF on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:34] SMF on z-dat-s4-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:34] RAID on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:45] Load avg. on z-dat-s6-a is OK: OK - load average: 0.36, 2.50, 3.32 [03:47:46] Environment on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:51] uh, oh... I've never seen that many timeouts before. [03:48:06] MySQL on z-dat-s6-a is CRITICAL: (Service Check Timed Out) [03:48:06] MySQL slave on z-dat-s6-a is CRITICAL: (Service Check Timed Out) [03:48:06] Load avg. on z-dat-s7-a is OK: OK - load average: 0.26, 2.33, 3.25 [03:48:06] SMTP on hyacinth is OK: SMTP OK - 0.004 sec. response time [03:48:06] SMTP on z-dat-s6-a is OK: SMTP OK - 0.005 sec. response time [03:48:07] SMTP on z-dat-s7-a is OK: SMTP OK - 0.060 sec. response time [03:48:07] SMF on z-dat-s3-a is OK: OK - all services online [03:48:07] SMF on z-dat-s4-a is OK: OK - all services online [03:48:07] SMF on hyacinth is OK: OK - all services online [03:48:08] SSH on z-dat-s3-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:08] SSH on hyacinth is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:14] RAID on hyacinth is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [03:48:14] MySQL on z-dat-s6-a is OK: Uptime: 147566 Threads: 14 Questions: 21369877 Slow queries: 10696 Opens: 183048 Flush tables: 2 Open tables: 644 Queries per second avg: 144.815 [03:48:15] MySQL slave on z-dat-s6-a is OK: Uptime: 147566 Threads: 14 Questions: 21369878 Slow queries: 10696 Opens: 183048 Flush tables: 2 Open tables: 644 Queries per second avg: 144.815 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 196 [03:48:15] SSH on z-dat-s6-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:15] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:24] Environment on hyacinth is OK: ok: temperature ok fan ok voltage ok chassis ok [03:48:24] Load avg. on z-dat-s4-a is OK: OK - load average: 0.98, 2.34, 3.23 [03:48:34] Load avg. on z-dat-s3-a is OK: OK - load average: 1.10, 2.34, 3.22 [03:48:34] SSH on z-dat-s4-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:50:14] Environment on willow is CRITICAL: Connection refused by host [03:50:34] /tmp on willow is CRITICAL: Connection refused by host [03:50:34] / on willow is CRITICAL: Connection refused by host [03:52:44] Cluster on willow is CRITICAL: Connection refused by host [03:53:36] TParis, I am too. [03:54:01] TParis: Me too... :( [03:54:33] oh good, it's not just me then [03:54:42] Running dir returns -bash: fork: Not enough space [03:54:59] I normally use nightshade, but since that was destroyed a few days ago, I temporarily switched to willow. Now, I have nowhere to go to. :P [03:55:38] Are any of the Toolserver ops people on? I don't see DaB. or nosy [03:55:38] * Seahorse is trying to start CVN Bots but they get killed as soon as they connect [03:56:06] It is possible that everyone who was on nightshade, has now switched to willow and it just can't cope [03:56:15] or, we can just blame all the damn interwiki bots [03:56:20] So, in other words... we can't log in... [03:56:28] I can login [03:56:33] but ever other command doesnt work [03:56:36] I'm logged in atm [03:56:38] yeah, i'm logged in [03:56:42] but correct, commands are failing [03:56:43] Me too... [03:56:58] CD and dir work for me now [03:57:06] does anyone even know what actually happened to nightshade or why it isn't fixed? [03:57:23] it ended with all of my bots crashing and nobody telling me why. [03:57:27] Seahorse: It got wiped, as far as I can tell. [03:57:33] interestingly, "ls" worked for me, but typing "ls -l" is freezing it [03:57:35] dir is broken on my root [03:57:45] LOL, I can do ps all [03:57:47] ls: memory exhausted [03:57:49] ls -a is giving me about 90% failure [03:58:05] Is cron working for y'all? [03:58:16] perfect time to use my vps [03:58:41] I will be majorly annoyed if it's just someone's tool stuck in an infinite loop [03:58:54] Magog_the_Ogre: Me too... [03:59:11] are you sure nightshade got wiped btw? [03:59:19] apparently there is a way to check how much memory every user's processes are using. [03:59:21] because I'm going to set up my cron jobs again, and I don't want my tool running twice [03:59:27] or get the top offenders [03:59:41] seahorse: wouldn't that require root access? [03:59:54] no, someone did this before [04:00:02] I just forget what it was called [04:00:03] Magog_the_Ogre: From what I can tell, they're going to restore it. [04:00:21] does anyone know what was wrong with it? [04:00:46] It stopped responding [04:00:54] Load avg. on willow is CRITICAL: Connection refused by host [04:01:08] tsnag: Thank you, captain obvious [04:01:15] lol [04:01:33] :) [04:01:37] oh tsnag, if only you could tell the ops, and not us who already know. [04:01:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [04:02:20] All the ops are probably asleep... [04:03:44] someone should probably mail the mailing list [04:03:45] Well I guess that means it's bed time for me too then [04:03:48] they don't have them spread out in different time zones? [04:04:13] Seahorse, as I understand it, there are effectively only 2 active ops atm [04:04:36] I've got the mailing list... [04:04:47] that is rather sad [04:05:50] Seahorse: Agreed :( [04:08:04] what did tsnag say? lol [04:08:08] I have him blacklisted [04:08:11] toolserver-l@wikimedia.org , right? [04:08:51] NVM found it. [04:09:00] Magog_the_Ogre: Load avg. on willow is CRITICAL: Connection refused by host [04:09:25] SMF on willow is CRITICAL: Connection refused by host [04:09:44] Magog_the_Ogre: SMF on willow is CRITICAL: Connection refused by host [04:09:56] what's smf? [04:10:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [04:10:58] Not really sure... [04:11:19] right now i wish I knew more about how to use PS [04:11:23] I'm going to google it [04:11:24] Magog_the_Ogre, Service management facility [04:11:51] gotta love google http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/mem_usage_determine_ps.htm [04:12:04] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [04:12:38] unfortunately, the page doesn't actually tell you how to tod it [04:12:56] ps -u username is what I do [04:15:20] 29414 pts/55 3627:34 java [04:15:52] use top [04:15:58] what? [04:16:03] it's better at seeing over memory usage [04:16:15] memory/cpu [04:17:09] Mail sent. [04:17:45] stanleku looks like he's eating up a lot [04:17:48] is that normal? [04:17:54] at least a lot of CPU [04:17:55] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 113928 MB (18% inode=99%): [04:18:04] Matthew Bowker * [Toolserver-l] Willow - Intermittent problems [04:18:19] reba: Thanks :) [04:26:35] how much memory does willow have? [04:26:48] I don't get what's with all these users plugging up 500M of memory [04:41:39] is it possible to do SFTP on any of the servers? or is it only willow/nightshade? [04:43:13] mzmcbride@willow:~$ w [04:43:13] 4:42am up 14 day(s), 16:05, 20 users, load average: 51.87, 51.28, 50.28 [04:43:14] Only Willow/Nightshade. I'm pretty sure. Don't quote me on that, though (cuz I think only Willow and Nightshade have the key system set up) [04:43:27] /home is shared. [04:43:43] You can definitely SSH; I assume SFTP also works. [04:43:49] But you're not allowed to run scripts on the other hosts. [04:44:20] So much python in top. [04:44:51] Doesn't seem like a rogue script or anything. Just too much load. [04:50:14] Environment on willow is CRITICAL: Connection refused by host [04:50:35] /tmp on willow is CRITICAL: Connection refused by host [04:50:35] / on willow is CRITICAL: Connection refused by host [04:52:45] Cluster on willow is CRITICAL: Connection refused by host [04:54:06] SMTP on willow is CRITICAL: Connection refused [04:54:34] Meh, willow's having problems again... [05:00:53] Load avg. on willow is CRITICAL: Connection refused by host [05:01:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [05:08:01] you sure? [05:08:08] nemobis is looking pretty bad atm [05:08:38] magog@willow:~/public_html$ w [05:08:38] 5:07am up 14 day(s), 16:30, 20 users, load average: 49.09, 50.22, 50.16 [05:08:38] User tty login@ idle JCPU PCPU what [05:08:38] nemobis pts/3 23Feb12 9days 5392:39 5392:39 python dumpgenerator.py --index= [05:08:38] danny_b pts/34 18Feb12 5:45 11:46 11:46 irssi [05:08:39] danny_b pts/57 Thu11pm 5:45 16 16 screen -DR irssi [05:08:41] magog pts/63 8:14pm 1 w [05:08:43] danny_b pts/94 Fri 8am 5:48 2 2 bash -rcfile .bashrc [05:08:47] jem pts/106 24Feb12 9:13 2 2 -bash [05:08:49] jem pts/112 24Feb12 7:43 45 4 -bash [05:08:51] dungodun pts/47 Mon 7am 3days 2 2 screen -r [05:08:53] schutz pts/160 Thu 4pm 2days -bash [05:08:55] alchimis pts/150 11:42pm 1:18 2 2 top -U alchimista [05:08:57] nemobis pts/170 11:40pm 5:03 47 47 -bash [05:08:59] stwalker pts/90 Sun 9pm 41 3 2 -bash [05:09:03] nettrom pts/174 Thu 6pm 2:27 1 -bash [05:09:05] dcoetzee pts/172 3:03am 1:22 3 -bash [05:09:09] mzmcbrid pts/187 1:20am 22 -bash [05:09:10] matthewr pts/189 3:53am 13 -bash [05:09:12] chris pts/29 3:46am 44 2 -bash [05:09:14] chzz pts/190 Fri 6am 21:11 26 2 -bash [05:09:18] seahorse pts/196 3:23am 1:19 1 -bash [05:09:20] whym pts/195 11:39am 13:31 tmux attach [05:09:22] uw_trans pts/198 12:56am 2:37 1 -bash [05:09:25] SMF on willow is CRITICAL: Connection refused by host [05:09:44] Whoa [05:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [05:10:51] might we finally have a culprit, I wonder? [05:11:35] We may... [05:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [05:17:27] so his login has been idle for 9 days [05:17:38] and it's been stuck in dumpgenerator.py and infinite looping the entire time [05:17:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 106554 MB (17% inode=99%): [05:18:35] Bah, that's it then. [05:18:45] well [05:19:42] we just need someone to say sudo kill 13912 [05:20:09] unfortunately, I no speaka the sudo language [05:21:53] also, I'm assuming it's that process; I suppose it's possible he just has a memory intensive process that isn't eating up the RAM [05:21:56] can't me skeptical though [05:21:59] anyway [05:22:03] it's getting late [05:22:27] I hope that email will have the ops look into it. [05:50:14] Environment on willow is CRITICAL: Connection refused by host [05:50:34] /tmp on willow is CRITICAL: Connection refused by host [05:51:33] / on willow is CRITICAL: Connection refused by host [05:53:44] Cluster on willow is CRITICAL: Connection refused by host [05:54:05] SMTP on willow is CRITICAL: Connection refused [06:00:54] Load avg. on willow is CRITICAL: Connection refused by host [06:02:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [06:07:29] oh men [06:10:24] SMF on willow is CRITICAL: Connection refused by host [06:11:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [06:12:06] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [06:17:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 99413 MB (16% inode=99%): [06:50:14] Environment on willow is CRITICAL: Connection refused by host [06:50:34] /tmp on willow is CRITICAL: Connection refused by host [06:51:34] / on willow is CRITICAL: Connection refused by host [06:53:45] Cluster on willow is CRITICAL: Connection refused by host [06:54:05] SMTP on willow is CRITICAL: Connection refused [06:54:45] SSH on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:55:34] SSH on willow is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [07:00:54] Load avg. on willow is CRITICAL: Connection refused by host [07:02:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [07:09:34] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [07:11:24] SMF on willow is CRITICAL: Connection refused by host [07:11:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [07:13:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [07:18:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 96255 MB (15% inode=99%): [07:23:04] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=1.191406/1.00, alarm hl:np_load_long=0.724610/1.50, alarm hl:mem_free=21291.000000M/300M, alarm hl:available=1/0: all.q@ortelius exceedes load threshold: alarm hl:np_load_short=1.191406/1.10, alarm hl:np_load_long=0.724610/1.50, alarm hl:mem_free=21291.000000M/300M, alarm hl:available=1/0 [07:24:05] Sun Grid Engine execd on ortelius is OK: short@ortelius OK: all.q@ortelius OK [07:31:45] SSH on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:32:34] SSH on willow is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [07:34:14] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [07:37:53] /sql on ptolemy is OK: DISK OK - free space: /sql 154222 MB (25% inode=99%): [07:50:14] Environment on willow is CRITICAL: Connection refused by host [07:50:34] /tmp on willow is CRITICAL: Connection refused by host [07:51:34] / on willow is CRITICAL: Connection refused by host [07:54:44] Cluster on willow is CRITICAL: Connection refused by host [07:55:05] SMTP on willow is CRITICAL: Connection refused [08:00:54] Load avg. on willow is CRITICAL: Connection refused by host [08:03:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [08:11:24] SMF on willow is CRITICAL: Connection refused by host [08:12:43] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [08:13:06] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [08:37:05] NTP on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:54] NTP on willow is OK: NTP OK: Offset 0.000717 secs [08:49:05] NTP on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:14] Environment on willow is CRITICAL: Connection refused by host [08:50:34] /tmp on willow is CRITICAL: Connection refused by host [08:51:34] / on willow is CRITICAL: Connection refused by host [08:54:44] Cluster on willow is CRITICAL: Connection refused by host [08:55:06] SMTP on willow is CRITICAL: Connection refused [09:00:54] Load avg. on willow is CRITICAL: Connection refused by host [09:03:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [09:11:24] SMF on willow is CRITICAL: Connection refused by host [09:12:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [09:14:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [09:17:53] 3(created) [MNT-1209] Save some space in /sql on ptolemy by re-clustering tables; Maintenance: ptolemy; Minor Minor work <10https://jira.toolserver.org/browse/MNT-1209> (Kai Krueger) [09:17:55] 3(commented) [MNT-1209] Save some space in /sql on ptolemy by re-clustering tables <10https://jira.toolserver.org/browse/MNT-1209> (Kai Krueger) [09:50:14] Environment on willow is CRITICAL: Connection refused by host [09:50:34] /tmp on willow is CRITICAL: Connection refused by host [09:51:34] / on willow is CRITICAL: Connection refused by host [09:55:44] Cluster on willow is CRITICAL: Connection refused by host [09:56:05] SMTP on willow is CRITICAL: Connection refused [10:00:54] Load avg. on willow is CRITICAL: Connection refused by host [10:04:43] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [10:07:05] NTP on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:08:54] NTP on willow is OK: NTP OK: Offset -0.000959 secs [10:11:25] SMF on willow is CRITICAL: Connection refused by host [10:11:53] 3(created) [TS-1321] No updates for plwiki_p since 2012-02-28; Toolserver; Bug <10https://jira.toolserver.org/browse/TS-1321> (Maciej Jaros) [10:12:07] Dr. Trigon * Re: [Toolserver-l] Willow - Intermittent problems [10:12:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [10:14:04] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [10:15:05] SMTP on willow is OK: SMTP OK - 0.163 sec. response time [10:15:11] @replag [10:15:11] chicocvenancio: s1-rr-a-c: 2m 46s [-0.01 s/s]; s2-user: 4d 3h 52m 6s [+0.07 s/s]; s3-rr-a: 29s [+0.00 s/s]; s3-user: 29s [+0.00 s/s] [10:15:49] 3(commented) [TS-1309] daphne s2/s5 corrupted and offline <10https://jira.toolserver.org/browse/TS-1309> (Maciej Jaros) [10:16:34] /tmp on willow is OK: DISK OK - free space: /tmp 372 MB (72% inode=99%): [10:16:34] / on willow is WARNING: DISK WARNING - free space: / 22943 MB (20% inode=99%): [10:16:44] Cluster on willow is OK: CLUSTER OK ! [10:16:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [10:17:14] Environment on willow is OK: ok: temperature ok fan ok voltage ok chassis ok [10:18:34] / on willow is OK: DISK OK - free space: / 24468 MB (22% inode=99%): [10:25:52] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Maciej Jaros) [10:27:53] 3(assigned) [TS-1317] over-quota warn-mail missing / does not work <10https://jira.toolserver.org/browse/TS-1317> (drtrigon) [10:27:59] 3(created) [MNT-1210] Willow run havoc; Maintenance; Emergency work <10https://jira.toolserver.org/browse/MNT-1210> (Marlen Caemmerer) [10:29:58] 3(commented) [ACCAPP-458] Run bots in ptwiki <10https://jira.toolserver.org/browse/ACCAPP-458> (Francisco Carvalho Venancio) [10:44:20] hey [10:44:40] someone with mediawiki database knowledge online? [10:44:57] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Marlen Caemmerer) [10:45:09] is it possible they added a field called job_timestamp quite recently? [10:45:31] yes, from 1.19 iirc [10:47:57] ALTER TABLE /*_*/job ADD COLUMN job_timestamp varbinary(14) NULL default NULL; [10:47:57] CREATE INDEX /*i*/job_timestamp ON /*_*/job(job_timestamp); [10:48:52] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Piotr Gackowski) [10:49:02] thx [10:49:11] do you know when they changed it? [10:52:50] nosy, added around start of january, I guess it was applied to the db before switching to mw 1.19 [10:52:58] https://bugzilla.wikimedia.org/show_bug.cgi?id=27724 [10:53:08] phe: thx [10:54:00] is the fork ressource exhausted solved ? [10:58:59] what do you mean? [11:00:13] it was reported on the mail list and I got it tonight, things ala [11:00:17] $ls [11:00:19] -bash: fork: Not enough space [11:00:54] Load avg. on willow is CRITICAL: CRITICAL - load average: 33.62, 30.46, 29.88 [11:02:10] where? [11:02:12] willow? [11:02:17] thats solved [11:02:18] yeps [11:02:21] i think [11:12:24] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [11:13:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [11:14:14] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [11:17:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [11:25:54] Load avg. on willow is WARNING: WARNING - load average: 12.11, 13.41, 19.99 [11:27:54] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Marlen Caemmerer) [11:32:57] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Marlen Caemmerer) [11:39:54] Load avg. on willow is OK: OK - load average: 11.86, 11.57, 14.97 [11:46:23] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:14] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [11:57:56] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Maciej Jaros) [12:12:34] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [12:13:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [12:14:14] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [12:17:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [12:30:04] /sql on thyme is WARNING: DISK WARNING - free space: /sql 191874 MB (19% inode=99%): [12:40:52] 3(created) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp; Toolserver; Task <10https://jira.toolserver.org/browse/TS-1322> (Merlijn van Deen) [12:42:53] 3(updated) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp <10https://jira.toolserver.org/browse/TS-1322> (Merlijn van Deen) [12:44:52] 3(commented) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp <10https://jira.toolserver.org/browse/TS-1322> (Andre Koopal) [12:45:11] cool [12:45:14] /sql on rosemary is CRITICAL: DISK CRITICAL - free space: /sql 38022 MB (3% inode=99%): [12:46:55] 3(commented) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp <10https://jira.toolserver.org/browse/TS-1322> (Multichill) [12:51:01] [[Special:Log/newusers]] create 10 * Accepto * (New user account) [13:05:45] [[Job scheduling]] 10https://wiki.toolserver.org/w/index.php?diff=6737&oldid=6722&rcid=8900 * Merlissimo * (-1211) (update to new config) [13:10:09] [[Job scheduling]] M 10https://wiki.toolserver.org/w/index.php?diff=6738&oldid=6737&rcid=8901 * Merlissimo * (+92) () [13:12:44] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [13:13:02] [[Job scheduling]] M 10https://wiki.toolserver.org/w/index.php?diff=6739&oldid=6738&rcid=8902 * Merlissimo * (+0) (/* optional resources */ ) [13:13:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [13:14:14] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [13:15:04] Platonides * Re: [Toolserver-l] Willow - Intermittent problems [13:17:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [13:19:03]