[00:00:53] Load avg. on willow is CRITICAL: CRITICAL - load average: 43.33, 25.33, 20.88 [00:08:54] /sql on ptolemy is OK: DISK OK - free space: /sql 128193 MB (21% inode=99%): [00:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [00:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [00:15:44] Can someone guide me on how to create a crontab file? [00:20:38] ceradon: what do you wanna do? [00:21:22] create a crontab to run my bot, can you help, Alchimista? [00:21:24] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [00:21:54] ceradon: i may try, if it's not too complicated :P but what do you want to do specificly? [00:23:41] Alchimista: can you join ##ceradonbot please? [00:30:43] Sun Grid Engine execd on willow is UNKNOWN: CHECK_NRPE: Error receiving data from daemon. [00:31:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [00:32:24] SMF on willow is UNKNOWN: CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. [00:32:34] /tmp on willow is CRITICAL: Connection refused by host [00:33:25] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [00:33:34] /tmp on willow is OK: DISK OK - free space: /tmp 379 MB (74% inode=99%): [00:37:24] SMF on willow is OK: OK - all services online [00:40:44] Sun Grid Engine execd on willow is UNKNOWN: CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. [00:45:05] /sql on rosemary is CRITICAL: DISK CRITICAL - free space: /sql 38185 MB (3% inode=99%): [00:48:25] Environment on willow is UNKNOWN: CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. [00:49:14] Environment on willow is CRITICAL: Connection refused by host [00:50:34] /tmp on willow is CRITICAL: Connection refused by host [00:50:34] / on willow is CRITICAL: Connection refused by host [00:50:44] Cluster on willow is CRITICAL: Connection refused by host [01:00:54] Load avg. on willow is CRITICAL: Connection refused by host [01:01:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [01:08:24] SMF on willow is CRITICAL: Connection refused by host [01:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [01:12:04] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [01:16:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 127021 MB (20% inode=99%): [01:18:44] hmm, phe@willow:~$ free [01:18:44] -bash: fork: Not enough space [01:20:41] better now [01:20:44] mzmcbride@willow:~$ free [01:20:44] -bash: free: command not found [01:21:29] "df -h" shows nothing crazy. [01:49:14] Environment on willow is CRITICAL: Connection refused by host [01:50:33] /tmp on willow is CRITICAL: Connection refused by host [01:50:34] / on willow is CRITICAL: Connection refused by host [01:50:44] Cluster on willow is CRITICAL: Connection refused by host [02:00:54] Load avg. on willow is CRITICAL: Connection refused by host [02:01:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [02:09:24] SMF on willow is CRITICAL: Connection refused by host [02:10:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [02:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [02:16:41] Is there a memory issue on willow? [02:16:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 123689 MB (20% inode=99%): [02:40:53] Anyone know why I am getting: [02:40:54] -bash: fork: Not enough space [02:44:34] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [02:49:14] Environment on willow is CRITICAL: Connection refused by host [02:49:34] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [02:50:34] /tmp on willow is CRITICAL: Connection refused by host [02:50:34] / on willow is CRITICAL: Connection refused by host [02:51:45] Cluster on willow is CRITICAL: Connection refused by host [02:54:24] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:25] SSH on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:34] RAID on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [02:55:14] RAID on hyacinth is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [02:55:15] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [02:55:15] SSH on z-dat-s6-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:00:54] Load avg. on willow is CRITICAL: Connection refused by host [03:01:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [03:04:35] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:09:25] SMF on willow is CRITICAL: Connection refused by host [03:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [03:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [03:17:53] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 117910 MB (19% inode=99%): [03:28:45] SSH on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:29:34] SSH on willow is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:33:09] I'm having serious issues on Willow. Does anyone know if it's having problems? [03:46:25] SSH on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:25] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:46] SSH on z-dat-s4-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:54] Load avg. on z-dat-s4-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:46:54] Load avg. on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:05] Load avg. on z-dat-s3-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:14] Load avg. on z-dat-s6-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:14] SMTP on hyacinth is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:14] SMTP on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:14] SMTP on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:15] SSH on z-dat-s3-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:15] SSH on hyacinth is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:15] Load avg. on z-dat-s7-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:25] Load avg. on hyacinth is OK: OK - load average: 0.43, 2.60, 3.37 [03:47:34] SMF on z-dat-s3-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:34] SMF on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:34] SMF on z-dat-s4-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:34] RAID on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:45] Load avg. on z-dat-s6-a is OK: OK - load average: 0.36, 2.50, 3.32 [03:47:46] Environment on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:47:51] uh, oh... I've never seen that many timeouts before. [03:48:06] MySQL on z-dat-s6-a is CRITICAL: (Service Check Timed Out) [03:48:06] MySQL slave on z-dat-s6-a is CRITICAL: (Service Check Timed Out) [03:48:06] Load avg. on z-dat-s7-a is OK: OK - load average: 0.26, 2.33, 3.25 [03:48:06] SMTP on hyacinth is OK: SMTP OK - 0.004 sec. response time [03:48:06] SMTP on z-dat-s6-a is OK: SMTP OK - 0.005 sec. response time [03:48:07] SMTP on z-dat-s7-a is OK: SMTP OK - 0.060 sec. response time [03:48:07] SMF on z-dat-s3-a is OK: OK - all services online [03:48:07] SMF on z-dat-s4-a is OK: OK - all services online [03:48:07] SMF on hyacinth is OK: OK - all services online [03:48:08] SSH on z-dat-s3-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:08] SSH on hyacinth is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:14] RAID on hyacinth is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [03:48:14] MySQL on z-dat-s6-a is OK: Uptime: 147566 Threads: 14 Questions: 21369877 Slow queries: 10696 Opens: 183048 Flush tables: 2 Open tables: 644 Queries per second avg: 144.815 [03:48:15] MySQL slave on z-dat-s6-a is OK: Uptime: 147566 Threads: 14 Questions: 21369878 Slow queries: 10696 Opens: 183048 Flush tables: 2 Open tables: 644 Queries per second avg: 144.815 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 196 [03:48:15] SSH on z-dat-s6-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:15] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:48:24] Environment on hyacinth is OK: ok: temperature ok fan ok voltage ok chassis ok [03:48:24] Load avg. on z-dat-s4-a is OK: OK - load average: 0.98, 2.34, 3.23 [03:48:34] Load avg. on z-dat-s3-a is OK: OK - load average: 1.10, 2.34, 3.22 [03:48:34] SSH on z-dat-s4-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [03:50:14] Environment on willow is CRITICAL: Connection refused by host [03:50:34] /tmp on willow is CRITICAL: Connection refused by host [03:50:34] / on willow is CRITICAL: Connection refused by host [03:52:44] Cluster on willow is CRITICAL: Connection refused by host [03:53:36] TParis, I am too. [03:54:01] TParis: Me too... :( [03:54:33] oh good, it's not just me then [03:54:42] Running dir returns -bash: fork: Not enough space [03:54:59] I normally use nightshade, but since that was destroyed a few days ago, I temporarily switched to willow. Now, I have nowhere to go to. :P [03:55:38] Are any of the Toolserver ops people on? I don't see DaB. or nosy [03:55:38] * Seahorse is trying to start CVN Bots but they get killed as soon as they connect [03:56:06] It is possible that everyone who was on nightshade, has now switched to willow and it just can't cope [03:56:15] or, we can just blame all the damn interwiki bots [03:56:20] So, in other words... we can't log in... [03:56:28] I can login [03:56:33] but ever other command doesnt work [03:56:36] I'm logged in atm [03:56:38] yeah, i'm logged in [03:56:42] but correct, commands are failing [03:56:43] Me too... [03:56:58] CD and dir work for me now [03:57:06] does anyone even know what actually happened to nightshade or why it isn't fixed? [03:57:23] it ended with all of my bots crashing and nobody telling me why. [03:57:27] Seahorse: It got wiped, as far as I can tell. [03:57:33] interestingly, "ls" worked for me, but typing "ls -l" is freezing it [03:57:35] dir is broken on my root [03:57:45] LOL, I can do ps all [03:57:47] ls: memory exhausted [03:57:49] ls -a is giving me about 90% failure [03:58:05] Is cron working for y'all? [03:58:16] perfect time to use my vps [03:58:41] I will be majorly annoyed if it's just someone's tool stuck in an infinite loop [03:58:54] Magog_the_Ogre: Me too... [03:59:11] are you sure nightshade got wiped btw? [03:59:19] apparently there is a way to check how much memory every user's processes are using. [03:59:21] because I'm going to set up my cron jobs again, and I don't want my tool running twice [03:59:27] or get the top offenders [03:59:41] seahorse: wouldn't that require root access? [03:59:54] no, someone did this before [04:00:02] I just forget what it was called [04:00:03] Magog_the_Ogre: From what I can tell, they're going to restore it. [04:00:21] does anyone know what was wrong with it? [04:00:46] It stopped responding [04:00:54] Load avg. on willow is CRITICAL: Connection refused by host [04:01:08] tsnag: Thank you, captain obvious [04:01:15] lol [04:01:33] :) [04:01:37] oh tsnag, if only you could tell the ops, and not us who already know. [04:01:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [04:02:20] All the ops are probably asleep... [04:03:44] someone should probably mail the mailing list [04:03:45] Well I guess that means it's bed time for me too then [04:03:48] they don't have them spread out in different time zones? [04:04:13] Seahorse, as I understand it, there are effectively only 2 active ops atm [04:04:36] I've got the mailing list... [04:04:47] that is rather sad [04:05:50] Seahorse: Agreed :( [04:08:04] what did tsnag say? lol [04:08:08] I have him blacklisted [04:08:11] toolserver-l@wikimedia.org , right? [04:08:51] NVM found it. [04:09:00] Magog_the_Ogre: Load avg. on willow is CRITICAL: Connection refused by host [04:09:25] SMF on willow is CRITICAL: Connection refused by host [04:09:44] Magog_the_Ogre: SMF on willow is CRITICAL: Connection refused by host [04:09:56] what's smf? [04:10:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [04:10:58] Not really sure... [04:11:19] right now i wish I knew more about how to use PS [04:11:23] I'm going to google it [04:11:24] Magog_the_Ogre, Service management facility [04:11:51] gotta love google http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/mem_usage_determine_ps.htm [04:12:04] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [04:12:38] unfortunately, the page doesn't actually tell you how to tod it [04:12:56] ps -u username is what I do [04:15:20] 29414 pts/55 3627:34 java [04:15:52] use top [04:15:58] what? [04:16:03] it's better at seeing over memory usage [04:16:15] memory/cpu [04:17:09] Mail sent. [04:17:45] stanleku looks like he's eating up a lot [04:17:48] is that normal? [04:17:54] at least a lot of CPU [04:17:55] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 113928 MB (18% inode=99%): [04:18:04] Matthew Bowker * [Toolserver-l] Willow - Intermittent problems [04:18:19] reba: Thanks :) [04:26:35] how much memory does willow have? [04:26:48] I don't get what's with all these users plugging up 500M of memory [04:41:39] is it possible to do SFTP on any of the servers? or is it only willow/nightshade? [04:43:13] mzmcbride@willow:~$ w [04:43:13] 4:42am up 14 day(s), 16:05, 20 users, load average: 51.87, 51.28, 50.28 [04:43:14] Only Willow/Nightshade. I'm pretty sure. Don't quote me on that, though (cuz I think only Willow and Nightshade have the key system set up) [04:43:27] /home is shared. [04:43:43] You can definitely SSH; I assume SFTP also works. [04:43:49] But you're not allowed to run scripts on the other hosts. [04:44:20] So much python in top. [04:44:51] Doesn't seem like a rogue script or anything. Just too much load. [04:50:14] Environment on willow is CRITICAL: Connection refused by host [04:50:35] /tmp on willow is CRITICAL: Connection refused by host [04:50:35] / on willow is CRITICAL: Connection refused by host [04:52:45] Cluster on willow is CRITICAL: Connection refused by host [04:54:06] SMTP on willow is CRITICAL: Connection refused [04:54:34] Meh, willow's having problems again... [05:00:53] Load avg. on willow is CRITICAL: Connection refused by host [05:01:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [05:08:01] you sure? [05:08:08] nemobis is looking pretty bad atm [05:08:38] magog@willow:~/public_html$ w [05:08:38] 5:07am up 14 day(s), 16:30, 20 users, load average: 49.09, 50.22, 50.16 [05:08:38] User tty login@ idle JCPU PCPU what [05:08:38] nemobis pts/3 23Feb12 9days 5392:39 5392:39 python dumpgenerator.py --index= [05:08:38] danny_b pts/34 18Feb12 5:45 11:46 11:46 irssi [05:08:39] danny_b pts/57 Thu11pm 5:45 16 16 screen -DR irssi [05:08:41] magog pts/63 8:14pm 1 w [05:08:43] danny_b pts/94 Fri 8am 5:48 2 2 bash -rcfile .bashrc [05:08:47] jem pts/106 24Feb12 9:13 2 2 -bash [05:08:49] jem pts/112 24Feb12 7:43 45 4 -bash [05:08:51] dungodun pts/47 Mon 7am 3days 2 2 screen -r [05:08:53] schutz pts/160 Thu 4pm 2days -bash [05:08:55] alchimis pts/150 11:42pm 1:18 2 2 top -U alchimista [05:08:57] nemobis pts/170 11:40pm 5:03 47 47 -bash [05:08:59] stwalker pts/90 Sun 9pm 41 3 2 -bash [05:09:03] nettrom pts/174 Thu 6pm 2:27 1 -bash [05:09:05] dcoetzee pts/172 3:03am 1:22 3 -bash [05:09:09] mzmcbrid pts/187 1:20am 22 -bash [05:09:10] matthewr pts/189 3:53am 13 -bash [05:09:12] chris pts/29 3:46am 44 2 -bash [05:09:14] chzz pts/190 Fri 6am 21:11 26 2 -bash [05:09:18] seahorse pts/196 3:23am 1:19 1 -bash [05:09:20] whym pts/195 11:39am 13:31 tmux attach [05:09:22] uw_trans pts/198 12:56am 2:37 1 -bash [05:09:25] SMF on willow is CRITICAL: Connection refused by host [05:09:44] Whoa [05:10:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [05:10:51] might we finally have a culprit, I wonder? [05:11:35] We may... [05:12:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [05:17:27] so his login has been idle for 9 days [05:17:38] and it's been stuck in dumpgenerator.py and infinite looping the entire time [05:17:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 106554 MB (17% inode=99%): [05:18:35] Bah, that's it then. [05:18:45] well [05:19:42] we just need someone to say sudo kill 13912 [05:20:09] unfortunately, I no speaka the sudo language [05:21:53] also, I'm assuming it's that process; I suppose it's possible he just has a memory intensive process that isn't eating up the RAM [05:21:56] can't me skeptical though [05:21:59] anyway [05:22:03] it's getting late [05:22:27] I hope that email will have the ops look into it. [05:50:14] Environment on willow is CRITICAL: Connection refused by host [05:50:34] /tmp on willow is CRITICAL: Connection refused by host [05:51:33] / on willow is CRITICAL: Connection refused by host [05:53:44] Cluster on willow is CRITICAL: Connection refused by host [05:54:05] SMTP on willow is CRITICAL: Connection refused [06:00:54] Load avg. on willow is CRITICAL: Connection refused by host [06:02:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [06:07:29] oh men [06:10:24] SMF on willow is CRITICAL: Connection refused by host [06:11:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [06:12:06] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [06:17:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 99413 MB (16% inode=99%): [06:50:14] Environment on willow is CRITICAL: Connection refused by host [06:50:34] /tmp on willow is CRITICAL: Connection refused by host [06:51:34] / on willow is CRITICAL: Connection refused by host [06:53:45] Cluster on willow is CRITICAL: Connection refused by host [06:54:05] SMTP on willow is CRITICAL: Connection refused [06:54:45] SSH on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:55:34] SSH on willow is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [07:00:54] Load avg. on willow is CRITICAL: Connection refused by host [07:02:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [07:09:34] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [07:11:24] SMF on willow is CRITICAL: Connection refused by host [07:11:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [07:13:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [07:18:54] /sql on ptolemy is WARNING: DISK WARNING - free space: /sql 96255 MB (15% inode=99%): [07:23:04] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=1.191406/1.00, alarm hl:np_load_long=0.724610/1.50, alarm hl:mem_free=21291.000000M/300M, alarm hl:available=1/0: all.q@ortelius exceedes load threshold: alarm hl:np_load_short=1.191406/1.10, alarm hl:np_load_long=0.724610/1.50, alarm hl:mem_free=21291.000000M/300M, alarm hl:available=1/0 [07:24:05] Sun Grid Engine execd on ortelius is OK: short@ortelius OK: all.q@ortelius OK [07:31:45] SSH on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:32:34] SSH on willow is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [07:34:14] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [07:37:53] /sql on ptolemy is OK: DISK OK - free space: /sql 154222 MB (25% inode=99%): [07:50:14] Environment on willow is CRITICAL: Connection refused by host [07:50:34] /tmp on willow is CRITICAL: Connection refused by host [07:51:34] / on willow is CRITICAL: Connection refused by host [07:54:44] Cluster on willow is CRITICAL: Connection refused by host [07:55:05] SMTP on willow is CRITICAL: Connection refused [08:00:54] Load avg. on willow is CRITICAL: Connection refused by host [08:03:44] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [08:11:24] SMF on willow is CRITICAL: Connection refused by host [08:12:43] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [08:13:06] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [08:37:05] NTP on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:54] NTP on willow is OK: NTP OK: Offset 0.000717 secs [08:49:05] NTP on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:14] Environment on willow is CRITICAL: Connection refused by host [08:50:34] /tmp on willow is CRITICAL: Connection refused by host [08:51:34] / on willow is CRITICAL: Connection refused by host [08:54:44] Cluster on willow is CRITICAL: Connection refused by host [08:55:06] SMTP on willow is CRITICAL: Connection refused [09:00:54] Load avg. on willow is CRITICAL: Connection refused by host [09:03:45] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [09:11:24] SMF on willow is CRITICAL: Connection refused by host [09:12:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [09:14:05] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [09:17:53] 3(created) [MNT-1209] Save some space in /sql on ptolemy by re-clustering tables; Maintenance: ptolemy; Minor Minor work <10https://jira.toolserver.org/browse/MNT-1209> (Kai Krueger) [09:17:55] 3(commented) [MNT-1209] Save some space in /sql on ptolemy by re-clustering tables <10https://jira.toolserver.org/browse/MNT-1209> (Kai Krueger) [09:50:14] Environment on willow is CRITICAL: Connection refused by host [09:50:34] /tmp on willow is CRITICAL: Connection refused by host [09:51:34] / on willow is CRITICAL: Connection refused by host [09:55:44] Cluster on willow is CRITICAL: Connection refused by host [09:56:05] SMTP on willow is CRITICAL: Connection refused [10:00:54] Load avg. on willow is CRITICAL: Connection refused by host [10:04:43] Sun Grid Engine execd on willow is CRITICAL: Connection refused by host [10:07:05] NTP on willow is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:08:54] NTP on willow is OK: NTP OK: Offset -0.000959 secs [10:11:25] SMF on willow is CRITICAL: Connection refused by host [10:11:53] 3(created) [TS-1321] No updates for plwiki_p since 2012-02-28; Toolserver; Bug <10https://jira.toolserver.org/browse/TS-1321> (Maciej Jaros) [10:12:07] Dr. Trigon * Re: [Toolserver-l] Willow - Intermittent problems [10:12:45] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [10:14:04] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [10:15:05] SMTP on willow is OK: SMTP OK - 0.163 sec. response time [10:15:11] @replag [10:15:11] chicocvenancio: s1-rr-a-c: 2m 46s [-0.01 s/s]; s2-user: 4d 3h 52m 6s [+0.07 s/s]; s3-rr-a: 29s [+0.00 s/s]; s3-user: 29s [+0.00 s/s] [10:15:49] 3(commented) [TS-1309] daphne s2/s5 corrupted and offline <10https://jira.toolserver.org/browse/TS-1309> (Maciej Jaros) [10:16:34] /tmp on willow is OK: DISK OK - free space: /tmp 372 MB (72% inode=99%): [10:16:34] / on willow is WARNING: DISK WARNING - free space: / 22943 MB (20% inode=99%): [10:16:44] Cluster on willow is OK: CLUSTER OK ! [10:16:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [10:17:14] Environment on willow is OK: ok: temperature ok fan ok voltage ok chassis ok [10:18:34] / on willow is OK: DISK OK - free space: / 24468 MB (22% inode=99%): [10:25:52] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Maciej Jaros) [10:27:53] 3(assigned) [TS-1317] over-quota warn-mail missing / does not work <10https://jira.toolserver.org/browse/TS-1317> (drtrigon) [10:27:59] 3(created) [MNT-1210] Willow run havoc; Maintenance; Emergency work <10https://jira.toolserver.org/browse/MNT-1210> (Marlen Caemmerer) [10:29:58] 3(commented) [ACCAPP-458] Run bots in ptwiki <10https://jira.toolserver.org/browse/ACCAPP-458> (Francisco Carvalho Venancio) [10:44:20] hey [10:44:40] someone with mediawiki database knowledge online? [10:44:57] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Marlen Caemmerer) [10:45:09] is it possible they added a field called job_timestamp quite recently? [10:45:31] yes, from 1.19 iirc [10:47:57] ALTER TABLE /*_*/job ADD COLUMN job_timestamp varbinary(14) NULL default NULL; [10:47:57] CREATE INDEX /*i*/job_timestamp ON /*_*/job(job_timestamp); [10:48:52] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Piotr Gackowski) [10:49:02] thx [10:49:11] do you know when they changed it? [10:52:50] nosy, added around start of january, I guess it was applied to the db before switching to mw 1.19 [10:52:58] https://bugzilla.wikimedia.org/show_bug.cgi?id=27724 [10:53:08] phe: thx [10:54:00] is the fork ressource exhausted solved ? [10:58:59] what do you mean? [11:00:13] it was reported on the mail list and I got it tonight, things ala [11:00:17] $ls [11:00:19] -bash: fork: Not enough space [11:00:54] Load avg. on willow is CRITICAL: CRITICAL - load average: 33.62, 30.46, 29.88 [11:02:10] where? [11:02:12] willow? [11:02:17] thats solved [11:02:18] yeps [11:02:21] i think [11:12:24] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [11:13:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [11:14:14] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [11:17:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [11:25:54] Load avg. on willow is WARNING: WARNING - load average: 12.11, 13.41, 19.99 [11:27:54] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Marlen Caemmerer) [11:32:57] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Marlen Caemmerer) [11:39:54] Load avg. on willow is OK: OK - load average: 11.86, 11.57, 14.97 [11:46:23] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:14] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [11:57:56] 3(commented) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Maciej Jaros) [12:12:34] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [12:13:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [12:14:14] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [12:17:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [12:30:04] /sql on thyme is WARNING: DISK WARNING - free space: /sql 191874 MB (19% inode=99%): [12:40:52] 3(created) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp; Toolserver; Task <10https://jira.toolserver.org/browse/TS-1322> (Merlijn van Deen) [12:42:53] 3(updated) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp <10https://jira.toolserver.org/browse/TS-1322> (Merlijn van Deen) [12:44:52] 3(commented) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp <10https://jira.toolserver.org/browse/TS-1322> (Andre Koopal) [12:45:11] cool [12:45:14] /sql on rosemary is CRITICAL: DISK CRITICAL - free space: /sql 38022 MB (3% inode=99%): [12:46:55] 3(commented) [TS-1322] Add siebrand and valhallasw to the nlwikibots mmp <10https://jira.toolserver.org/browse/TS-1322> (Multichill) [12:51:01] [[Special:Log/newusers]] create 10 * Accepto * (New user account) [13:05:45] [[Job scheduling]] 10https://wiki.toolserver.org/w/index.php?diff=6737&oldid=6722&rcid=8900 * Merlissimo * (-1211) (update to new config) [13:10:09] [[Job scheduling]] M 10https://wiki.toolserver.org/w/index.php?diff=6738&oldid=6737&rcid=8901 * Merlissimo * (+92) () [13:12:44] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [13:13:02] [[Job scheduling]] M 10https://wiki.toolserver.org/w/index.php?diff=6739&oldid=6738&rcid=8902 * Merlissimo * (+0) (/* optional resources */ ) [13:13:44] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [13:14:14] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [13:15:04] Platonides * Re: [Toolserver-l] Willow - Intermittent problems [13:17:44] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [13:19:03] Platonides * Re: [Toolserver-l] MMP interwiki-bot [13:23:53] 3(assigned) [TS-1321] No updates for plwiki_p since 2012-02-28 <10https://jira.toolserver.org/browse/TS-1321> (Marlen Caemmerer) [13:32:49] [[Job scheduling]] 10https://wiki.toolserver.org/w/index.php?diff=6740&oldid=6739&rcid=8903 * Merlissimo * (+485) () [13:47:04] /sql on thyme is OK: DISK OK - free space: /sql 217475 MB (22% inode=99%): [13:55:04] Dr. Trigon * Re: [Toolserver-l] Willow - Intermittent problems [13:56:25] [[Job scheduling]] 10https://wiki.toolserver.org/w/index.php?diff=6741&oldid=6740&rcid=8904 * Merlissimo * (+300) (/* arguments to qsub/qcronsub */ +) [14:02:55] Load avg. on willow is WARNING: WARNING - load average: 15.50, 14.49, 13.88 [14:03:55] Load avg. on willow is OK: OK - load average: 14.35, 14.41, 13.89 [14:08:24] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 26488 MB (6% inode=99%): [14:09:24] /sql on z-dat-s4-a is CRITICAL: DISK CRITICAL - free space: /sql 20231 MB (4% inode=99%): [14:12:44] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [14:13:24] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 25130 MB (6% inode=99%): [14:13:54] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [14:14:24] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [14:15:09] Any admin around? [14:15:21] Fairly urgent issue :) [14:15:35] :/ [14:17:55] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [14:22:56] 3(commented) [MNT-1207] Import commonswiki on daphne <10https://jira.toolserver.org/browse/MNT-1207> (DaB.) [14:28:32] Snowolf I'm not an admin, but maybe I can help. Whats the issue? [14:28:54] Sebaso_WMDE: I'll query you [14:29:08] ok [14:29:59] [[Job scheduling]] 10https://wiki.toolserver.org/w/index.php?diff=6742&oldid=6741&rcid=8905 * Merlissimo * (+2335) (/* Examples */ ) [14:33:56] 3(resolved) [MNT-1207] Import commonswiki on daphne <10https://jira.toolserver.org/browse/MNT-1207> (DaB.) [14:38:00] [[Job scheduling]] 10https://wiki.toolserver.org/w/index.php?diff=6743&oldid=6742&rcid=8906 * Merlissimo * (+569) (/* optional resources */ ) [14:40:52] 3(created) [MNT-1211] Corrected ceradon's cron on willow; Maintenance; Emergency work <10https://jira.toolserver.org/browse/MNT-1211> (DaB.) [14:40:55] 3(updated) [MNT-1211] Corrected ceradon's cron on willow <10https://jira.toolserver.org/browse/MNT-1211> (DaB.) [14:41:29] Afternoon DaBPunkt :) [14:42:02] *wave* bye [14:43:09] Thanks DaBPunkt :) [14:44:56] 3(updated) [MNT-1212] Changed permission of /home/soxred93/Peachy/Configs/ to 600 <10https://jira.toolserver.org/browse/MNT-1212> (DaB.) [14:44:59] 3(created) [MNT-1212] Changed permission of /home/soxred93/Peachy/Configs/ to 600; Maintenance; Emergency work <10https://jira.toolserver.org/browse/MNT-1212> (DaB.) [14:48:33] It's my wikipedia-free-saturday… [14:48:49] @replag [14:48:50] Akoopal: s1-rr-a-c: 4m 34s [-]; s2-user: 3d 15h 38m 45s [-]; s4-rr-a: error [14:49:03] DaBPunkt: then what are you doing here :-) [14:49:19] DaBPunkt same here ;) [14:49:33] Akoopal: fixing stuff for Sebaso_WMDE ;/ [14:49:52] so the wikipedia-free was a fail :-) [14:50:02] DaBPunkt sorry ;( [14:50:17] DaBPunkt, Just run away now ;p [14:50:31] @replag [14:50:31] DaBPunkt: s1-rr-a-c: 5m 50s [-]; s2-user: 3d 15h 33m 39s [-]; s3-rr-a: 1m 25s [-]; s3-user: 1m 25s [-] [14:51:08] Sebaso_WMDE: s2 was moved to cassia some weeks ago. nosy fixed the replication-problem on it tonight as far as I ave read [14:51:41] @replag [14:51:41] DaBPunkt: s1-rr-a-c: 6m 22s [+0.46 s/s]; s2-user: 3d 15h 29m 40s [-3.42 s/s] [14:52:23] @replag [14:52:24] DaBPunkt: s1-rr-a-c: 6m 41s [+0.45 s/s]; s2-user: 3d 15h 27m 38s [-2.87 s/s]; s3-rr-a: 50s [-0.31 s/s]; s3-user: 50s [-0.31 s/s] [14:53:01] cool! s2 is comming back to earth [14:53:38] chicocvenancio: yes, but slow [14:54:51] I can almost see all that happened in February [14:58:14] DaBPunkt yep, I know. I just hope there could be a s2 with a little bit fewer replag on daphne somehow :) [14:59:23] Sebaso_WMDE: if there would be a s2 on daphne, its replag would be even greater [14:59:33] but there is none [14:59:34] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 36576 MB (8% inode=99%): [15:00:35] The strage thing is, that s2 DID replicate in the last days (otherwise we had discovered the problem ealier) [15:03:34] /sql on z-dat-s4-a is CRITICAL: DISK CRITICAL - free space: /sql 22322 MB (5% inode=99%): [15:03:58] 14GB in 3 Minutes??? [15:04:50] DaBPunkt http://toolserver.org/~bryan/stats/replag/?cluster=s2 says it didn't - (but trainwreck says it did. strange) [15:05:04] DaBPunkt: just curious, what was the replication problem? [15:05:34] /sql on z-dat-s4-a is OK: DISK OK - free space: /sql 134057 MB (33% inode=99%): [15:06:15] akoopal https://jira.toolserver.org/browse/TS-1321 [15:07:35] it's BTW a bug in trainwreck, because we do not replicate the job-table for normal… [15:08:15] ok [15:09:06] Sebaso_WMDE: I checked the replag here in the channel the last days form time to time: And it was always down-going (slow, but going [15:09:29] DaBPunkt: it had been going up [15:09:51] but less then 1s/s, so there still was replication, just not quick enough [15:09:57] see the graphs also [15:10:11] [Samstag 03 März 2012] [00:02:35] @replag [15:10:12] [Samstag 03 März 2012] [00:02:35] Akoopal: s2-user: 4d 3h 2m 38s [-1.03 s/s]; s3-rr-a: 18s [-0.00 s/s]; s3-user: 18s [-0.00 s/s]; s6-rr-a: 14s [+0.00 s/s]; s6-user: 14s [+0.00 s/s] [15:10:36] should I post more examples? [15:10:39] yeah, then it was good [15:11:03] DaBPunkt: check http://bit.ly/a7F74v [15:11:42] Akoopal: and? [15:12:11] the graphs shows that the replag was increasing between 28.feb and 2.mar. [15:12:22] you see it first going down, when s2 was started, then after a while going up, and probably when nosy fixed this going down again [15:12:33] But that's exacly the timespan I was checing here in the channel [15:12:43] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [15:12:50] ahhh [15:13:10] nosy fixed the problem tonight, thatÄs the 3.march [15:13:34] (thare was a little bit increasing tonight too, yes) [15:13:50] but I am wondering with you why it first worked, and then seemed to have been broken [15:13:54] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [15:14:24] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [15:17:55] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [15:19:30] no nosy [15:19:43] did replication get fixed? she was asking about it in #-tech [15:20:07] jeremyb: yes [15:20:12] k [15:20:51] btw, who will be in berlin in nearly a month? will there be another toolserver funding/gov session ? [15:21:07] i don't even know who was in the session last year [15:21:24] I thought the berlin-meeting is in june? [15:23:10] wmcon is end of march, hackthon is in june [15:23:12] DaBPunkt: there are 2 [15:24:05] Sebaso_WMDE: I already mailed Johanis: The hacktron is at the same time as the admin-meeting of dewp. Can one of them please move to another date? [15:35:03] re [15:35:39] I will leave now for an hour, cu [15:35:51] DaBPunkt will talk to johannes. cu [15:44:24] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=1.291016/1.00, alarm hl:np_load_long=0.803711/1.50, alarm hl:mem_free=20776.000000M/300M, alarm hl:available=1/0: all.q@ortelius exceedes load threshold: alarm hl:np_load_short=1.291016/1.10, alarm hl:np_load_long=0.803711/1.50, alarm hl:mem_free=20776.000000M/300M, alarm hl:available=1/0 [15:45:25] Sun Grid Engine execd on ortelius is OK: short@ortelius OK: all.q@ortelius OK [16:09:55] [[Special:Log/newusers]] create 10 * Heubergen * (New user account) [16:12:44] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [16:13:54] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [16:14:24] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [16:18:05] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [16:24:52] 3(created) [UTRS-58] Clear out beta test data in preparation for launch; UTRS; Planned work - no user impact <10https://jira.toolserver.org/browse/UTRS-58> (Hersfold) [16:44:23] Aww mean, this is annoying, last night my access to the toolserver was fine and now I cant login and the reall weird thing is I din't change anything. [16:46:35] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 37052 MB (9% inode=99%): [16:50:34] /sql on z-dat-s4-a is OK: DISK OK - free space: /sql 53006 MB (13% inode=99%): [16:54:34] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 37313 MB (9% inode=99%): [17:03:05] Load avg. on willow is WARNING: WARNING - load average: 15.49, 14.89, 12.81 [17:03:35] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=1.259766/1.00, alarm hl:np_load_long=0.783203/1.50, alarm hl:mem_free=21565.000000M/300M, alarm hl:available=1/0: all.q@ortelius exceedes load threshold: alarm hl:np_load_short=1.259766/1.10, alarm hl:np_load_long=0.783203/1.50, alarm hl:mem_free=21565.000000M/300M, alarm hl:available=1/0 [17:04:05] Load avg. on willow is OK: OK - load average: 14.16, 14.64, 12.86 [17:05:45] Sun Grid Engine execd on ortelius is OK: short@ortelius OK: all.q@ortelius OK [17:12:49] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [17:14:09] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [17:14:38] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [17:18:08] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [17:19:51] ceradon: did you subscribe to toolserver-l yet? [17:20:19] willow is overloaded because nightshade is down [17:23:39] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=1.011719/1.00, alarm hl:np_load_long=0.857422/1.50, alarm hl:mem_free=21606.000000M/300M, alarm hl:available=1/0 [17:40:23] ceradon: the throuble last night was caused by yourself. You startet 40 instances of your php script and consumed all cpu resources on willow. [17:41:28] Sun Grid Engine execd on wolfsbane is WARNING: short@wolfsbane exceedes load threshold: alarm hl:np_load_short=0.184570/1.00, alarm hl:np_load_long=0.178223/1.50, alarm hl:mem_free=207.000000M/300M, alarm hl:available=1/0: all.q@wolfsbane exceedes load threshold: alarm hl:np_load_short=0.184570/1.10, alarm hl:np_load_long=0.178223/1.50, alarm hl:mem_free=207.000000M/300M, alarm hl:available=1/0 [17:42:42] Merlissimo: Of course this wasn't my intention, My script must've gone awry. [17:43:11] why didn't you use grid engine? [17:43:28] Sun Grid Engine execd on wolfsbane is OK: short@wolfsbane OK: all.q@wolfsbane OK [17:47:27] Sun Grid Engine execd on wolfsbane is WARNING: short@wolfsbane exceedes load threshold: alarm hl:np_load_short=0.167969/1.00, alarm hl:np_load_long=0.171387/1.50, alarm hl:mem_free=166.000000M/300M, alarm hl:available=1/0: all.q@wolfsbane exceedes load threshold: alarm hl:np_load_short=0.167969/1.10, alarm hl:np_load_long=0.171387/1.50, alarm hl:mem_free=166.000000M/300M, alarm hl:available=1/0 [18:10:56] 3(created) [TS-1323] My account, ceradon; Toolserver: Accounts; Task <10https://jira.toolserver.org/browse/TS-1323> (Symon Klyde) [18:12:58] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [18:14:09] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [18:14:48] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [18:18:09] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [18:22:38] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 37595 MB (9% inode=99%): [18:25:10] Willow seems to be working better this morning. [18:28:39] /sql on z-dat-s4-a is CRITICAL: DISK CRITICAL - free space: /sql 24393 MB (5% inode=99%): [18:29:38] /sql on z-dat-s4-a is OK: DISK OK - free space: /sql 49904 MB (12% inode=99%): [19:01:05] [[Special:Log/newusers]] create 10 * MasterAdministratorBot * (New user account) [19:04:53] 3(resolved) [TS-1323] My account, ceradon <10https://jira.toolserver.org/browse/TS-1323> (Symon Klyde) [19:10:42] [[Job scheduling]] 10https://wiki.toolserver.org/w/index.php?diff=6744&oldid=6743&rcid=8909 * Merlissimo * (+585) (/* Examples */ +) [19:12:58] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [19:14:17] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [19:14:48] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [19:16:53] 3(commented) [LUXO-25] Global block link for IPs at ~luxo/contributions/contributions.php <10https://jira.toolserver.org/browse/LUXO-25> (Luxo) [19:18:18] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [19:22:09] s4 replag on rosemary is WARNING: QUERY WARNING: SELECT ts_rc_age() returned 1919.000000 [19:24:53] [[Wikirita the free Encyclopedia]] !NM 10https://wiki.toolserver.org/w/index.php?oldid=6745&rcid=8910 * MasterAdministratorBot * (+460) (ikiritia is now will open now.) [19:26:07] bad grammer bot [19:28:26] [[Special:Log/delete]] delete 10 * MZMcBride * (deleted "[[02Wikirita the free Encyclopedia10]]": nonsense) [19:28:53] I guess I can leave the account unblocked for now. [19:31:52] 3(closed) [CVN-4] Requesting access to "cvn" project <10https://jira.toolserver.org/browse/CVN-4> (Seahorseruler ) [19:48:12] can someone tell me how to get a cron job to work on the toolserver [19:48:19] and yes I've read all the documentation [19:48:52] crontab -e [19:48:58] I did that [19:49:24] which server does it need to be set up on? [19:49:38] willow or clematis? [19:49:47] kaldari: first login on host submit (ssh submit) [19:49:54] submit=clematis [19:49:57] ok, I'm there [19:50:18] "cronie -e" obens an editor [19:50:34] OK... [19:50:43] nano or vi depends on your shell setting [19:50:51] http://en.wikipedia.org/wiki/Cron#Predefined_scheduling_definitions [19:50:54] how often do you want to start this job? [19:51:34] the documentation breaks down at this point, am I supposed to use cronsub, qcronsub, or just input the job directly? [19:51:54] Once a day [19:52:16] at which time? [19:52:34] 1 am [19:52:43] so something like: [19:52:44] 0 1 * * * $HOME/public_html/hotarticles/runbot.php [19:53:15] but the documentation is contraditory on how I'm supposed to set this up properly [19:53:26] it says I shouldn't just run the command directly [19:53:41] but isn't clear on how it should be set up [19:53:42] just write this line in this editor [19:54:18] then at e.g. "15 1 * * * qcronsub -b y -wd $HOME/public_html/hotarticles $HOME/public_html/hotarticles/runbot.php" [19:54:37] yay [19:54:52] That's 1 a.m., UTC. [19:55:03] Luxo: your must not start job on clematis without sge [19:55:46] I didn't know any jobs or crontabs were going on a non-login host. [19:55:57] if suggestet 1:15 because many other jobs are submitted on the hour, but thats not so important [19:56:03] Usually all cron shit is done on the login hosts (willow and nightshade). [19:56:42] Joan: Yeah, I tried to set it up from login and it did nothing :( [19:57:03] Standard cron should be fine on the login hosts. [19:57:10] I use it. The job scheduling shit confuses me. ;-) [19:57:25] Joan: but if one login server is down the job is not executed [19:57:32] Yes, it would be nice if someone could update the documentation [19:57:39] Merlissimo: True. Luckily I chose willow. :D [19:57:48] And I save my crontab in a text file. [19:57:54] So I can sync to another server if one is down. [19:58:13] mzmcbride@willow:~$ head ~/misc/crontab.txt [19:58:13] # INSTRUCTIONS [19:58:13] # MIN HOUR DAYMONTH MONTH DAYWEEK [19:58:13] # ARTICLES [19:58:13] 0 19 6 * * PYTHONPATH=$HOME/scripts python $HOME/scripts/database-reports/deletedfilesinarticles.py > /dev/null [19:58:40] willow's cron is a bit annoying about setting PYTHONPATH. You can't just declare variables at the top of the file. [19:58:42] if you use submit you don't have to do anything if this host ist down, because onther host will run thris cron instead automatically [19:58:47] cronie is better about this. [19:59:00] Merlissimo: Assuming the system is working properly. :-) [19:59:20] I think cron is more reliable. [19:59:47] Joan: and if willow is busy you start another job thats making willow even more busy [20:00:16] I understand the virtue of load balancing. I just don't trust the current architecture. :-) [20:00:36] Joan: thats why i have changed it during the last month [20:00:37] River set it up and she's gone. Is anyone actively maintaining it? How long before it's dropped for some other system? [20:00:52] cron is forever. [20:01:00] kaldari: how long will your job run? [20:01:12] about 1 or 2 minutes [20:01:26] and how much memory usage do you expect? [20:01:35] very little [20:02:12] did willow get fixed? [20:02:23] then use "0 1 * * * qcronsub -b y -wd $HOME/public_html/hotarticles -l h_rt=20:0 -l virtual_free=50M $HOME/public_html/hotarticles/runbot.php" that limiting you job to 10 minutes and 50M memory [20:02:26] Joan: Personally I would prefer something a bit more modern like Jenkins [20:02:37] Magog_the_Ogre: [20:02:38] mzmcbride@willow:~$ uptime [20:02:38] 20:02pm up 15 days 7:25, 49 users, load average: 12.82, 11.61, 10.75 [20:02:46] yay [20:02:49] but cron would be nice as well, at least everyone knows how to use it :) [20:02:52] was I correct that the problem was simply sudo kill 13912 [20:03:01] Heh, everyone assumes it'll be Vixie's cron. [20:03:11] that a certain user (*cough*) had been idle for 9 days but had a python script running non-stop the whole time? [20:03:19] Dumps are large. [20:03:25] They require a lot of time to process. [20:03:34] Actual time and computing time. [20:04:20] however, this maybe is not the right time to be processing dumps ;-) [20:04:22] Joan: there is much cpu time available on toolserver in gerneral. only willow is very busy [20:04:22] Why don't you guys just kill all processes after 5 minutes, and if someone needs more resources, they offload the lob to an external server. [20:04:48] There's some kind of resource management. Or there should be, when it's working. [20:04:53] kaldari: because that would severely limit the usefulness of the toolserver [20:04:55] slayrd watches memory usage. [20:05:11] valhallasw: There's never a right time to process them. They're so big and cumbersome. :-9 [20:05:14] :-( [20:05:14] right [20:05:15] The toolserver is generally useless from my point of view [20:05:24] maybe certain processes that go over a certain amount of processing time [20:05:30] kaldari: There are wmflabs now too. [20:05:31] Joan: true, but having only a single job server is most certainly /not/ the right time ;-) [20:05:37] although it should be OK if it's nice'd [20:05:42] The Toolserver really is only needed if you need access to the replicated DBs. [20:05:53] Everything else can be done elsewehere. [20:06:00] Everything I've ever set up on the toolserver has been broken within a couple months due to something on the servers changing [20:06:01] any user processes that go over 1000 CPU minutes would be a high end mark, but a reasonable one [20:06:20] but I digress [20:06:22] valhallasw: sge currently uses four server for processing submitted jobs [20:06:48] Merlissimo: three at the moment, and only one for lng-running jobs [20:07:13] valhallasw: four and two for long running job, but one with very limited memory [20:07:35] in that case I missed a change [20:07:38] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=1.021484/1.00, alarm hl:np_load_long=0.838867/1.50, alarm hl:mem_free=21432.000000M/300M, alarm hl:available=1/0 [20:08:14] valhallasw: clematis was added for jobs using very low memory [20:08:38] Sun Grid Engine execd on ortelius is OK: short@ortelius OK: all.q@ortelius OK [20:08:45] Merlissimo: Thanks for the help, I've gotta run! [20:11:39] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 36845 MB (9% inode=99%): [20:12:58] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [20:14:18] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [20:14:48] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [20:16:39] /sql on z-dat-s4-a is OK: DISK OK - free space: /sql 50382 MB (12% inode=99%): [20:18:19] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [20:21:39] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 36253 MB (8% inode=99%): [20:21:57] hey guys, anyone know of a toolserver user / tool who might be accessing stats.grok.se rapidly right now? [20:22:16] Hey henrik. :-) [20:22:19] s4 replag on rosemary is WARNING: QUERY WARNING: SELECT ts_rc_age() returned 3585.000000 [20:22:38] hey :) [20:23:19] Are you getting slammed? Script should have a useful user-agent, otherwise block and wait for an e-mail? [20:23:38] That's what Wikimedia does. [20:23:49] yeah, my loadavg is >10 at the moment. [20:24:03] No user-agent :( [20:24:05] Single source, though? [20:24:28] And I'd hate to block the toolserver IPs, so I thought I'd check in here first. [20:24:54] DaBPunkt: ^ [20:25:08] You should block it when there's no UA provided only, if possible. [20:25:19] s4 replag on rosemary is CRITICAL: QUERY CRITICAL: SELECT ts_rc_age() returned 3602.000000 [20:25:46] not sending a UA is a valid block-reason, yes [20:26:36] It's coming from 91.198.174.[195, 202, 203, 221] [20:27:40] henrik: from where does it come at the moment? [20:27:54] DaBPunkt: all of the above IPs [20:28:12] at the same time? [20:28:13] Also .211 [20:28:15] yes [20:28:35] strange. what's your ip? [20:28:56] (of the server of corse) [20:29:22] stats.grok.se is at 46.253.202.68 [20:31:08] henrik: I will look [20:31:15] thanks [20:31:20] I guess I should really implement some sort of rate limiter too :) [20:37:50] On one of the JIRA tickets, it said Debian was installed on nightshade after it was wiped. Just wondering, is Debian going to be used on nightshade permanently now instead of solaris? (I am hoping so.) [20:38:08] * Seahorse personally dislikes solaris [20:40:30] Seahorse: yes [20:40:47] wonderful. Thank you. [20:51:12] DaBPunkt: finding any likely suspect? [20:51:43] henrik: not yet. The connections are open and closed very fast. So I have problems to find the PID for the connection [20:53:56] I guess they aren't using keep-alive [20:56:10] henrik: I'd almost suggest to add a sleep(10); to the output ;-) [20:57:06] hehe [20:57:08] done :) [20:57:14] wait a moment [20:58:14] eh, hold on [20:58:18] henrik: did it stop now? [21:00:23] still getting requests [21:00:43] ok, 1 more moment [21:00:44] I think [21:00:53] harder too see them now :) [21:02:26] ok, I see some more [21:03:04] 91.198.174.211 - - [03/Mar/2012:22:01:35 +0100] "GET /json/en/201202/Operation_Calendar HTTP/1.0" 200 736 "-" "-" [21:03:09] was one [21:03:21] ok, I killed them all now AFAIS [21:04:20] thank you [21:04:22] Please tell the user to contact me; maybe we can work something better out. [21:04:51] I guess he need just to fix his script… [21:05:05] henrik: how should he contact you? [21:05:33] either here on IRC or en:User talk:Henrik [21:06:01] henrik: ok, thank you very much for your patience [21:06:08] but maybe it was just a runaway script [21:06:26] no worries. Thanks for your help too [21:06:52] so what is the verdict on our cron jobs? it seems like every time I ask I get a different ansewr. [21:08:09] ok, then I can finaly go to dinner [21:09:36] DaBPunkt: enjoy :-) [21:11:44] lol it's like 10PM in Germany [21:11:53] or Denmark or Netherlands [21:11:57] wherever you're from >_> [21:12:58] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [21:14:18] s4 replag on rosemary is WARNING: QUERY WARNING: SELECT ts_rc_age() returned 3479.000000 [21:14:28] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [21:14:48] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [21:17:55] 3(created) [MAGNUS-304] UNKNOWN TAG : 'BR'!; Magnus' tools: CommonsHelper; Bug <10https://jira.toolserver.org/browse/MAGNUS-304> (Patrick Albrecht) [21:18:28] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [21:44:18] s4 replag on rosemary is OK: QUERY OK: SELECT ts_rc_age() returned 1691.000000 [21:48:03] Merlissimo * [Toolserver-l] Grid Engine config change [21:59:38] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [22:02:19] Load avg. on willow is WARNING: WARNING - load average: 15.50, 14.49, 13.17 [22:03:18] Load avg. on willow is OK: OK - load average: 13.86, 14.17, 13.14 [22:08:07] [[Job scheduling]] M 10https://wiki.toolserver.org/w/index.php?diff=6746&oldid=6744&rcid=8912 * Dab * (+1) (/* optional resources */ ) [22:12:18] Load avg. on willow is WARNING: WARNING - load average: 17.16, 15.67, 14.13 [22:12:26] Merlissimo: maxvmem is the value from qacct for virtual_free? [22:12:58] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [22:13:57] that is the maximum memory that the job used during runtime. so this values should be less than virtual_free [22:14:30] ok [22:14:38] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [22:14:49] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [22:18:38] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [22:23:19] Load avg. on willow is OK: OK - load average: 14.32, 14.99, 14.64 [22:23:49] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=2.750000/1.00, alarm hl:np_load_long=1.221680/1.50, alarm hl:mem_free=21090.000000M/300M, alarm hl:available=1/0: all.q@ortelius exceedes load threshold: alarm hl:np_load_short=2.750000/1.10, alarm hl:np_load_long=1.221680/1.50, alarm hl:mem_free=21090.000000M/300M, alarm hl:available=1/0 [22:24:18] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [22:25:49] Sun Grid Engine execd on ortelius is OK: short@ortelius OK: all.q@ortelius OK [22:32:19] Load avg. on willow is WARNING: WARNING - load average: 16.11, 15.60, 14.83 [22:34:19] Load avg. on willow is OK: OK - load average: 13.31, 14.67, 14.57 [22:42:49] Sun Grid Engine execd on ortelius is WARNING: short@ortelius exceedes load threshold: alarm hl:np_load_short=3.238281/1.00, alarm hl:np_load_long=1.203125/1.50, alarm hl:mem_free=21280.000000M/300M, alarm hl:available=1/0: all.q@ortelius exceedes load threshold: alarm hl:np_load_short=3.238281/1.10, alarm hl:np_load_long=1.203125/1.50, alarm hl:mem_free=21280.000000M/300M, alarm hl:available=1/0 [22:47:03] K. Peachey * Re: [Toolserver-l] MMP interwiki-bot [22:47:52] nighty [23:04:22] [[Job scheduling]] M 10https://wiki.toolserver.org/w/index.php?diff=6747&oldid=6746&rcid=8913 * Dab * (+4) (/* arguments to qsub/qcronsub */ ) [23:08:28] Hi! Just found out my cron job (updating an rss feed) hasn't been running since Feb 17. Should I be using SGE instead or can I get it to work? [23:10:38] skagedal: you used nightshade before? [23:11:14] yes, iirc. I logged in to willow now and edited the crontab. that will work? [23:12:16] yes, but if you use sge for your job and use cronie on host submit it will also run if willow may be down [23:13:08] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [23:13:34] Merlissimo: allright. seems like a good idea. is there a simple guide on how to do this? was a bit confused with the docs. just need to run a shell script once every day. [23:14:10] can you post your cron line you used before? [23:14:29] 5 * * * * /home/skagedal/fafafa/do_fafafa.sh [23:14:53] wait, that runs every hour right? [23:15:07] SMF on damiana is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [23:15:18] yes [23:15:36] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [23:15:47] yes, i remember now, i run it every hour to retry if it didn't work last time. [23:16:11] how long does this job runs? [23:16:37] probably like 5 sec :) [23:16:41] max. [23:16:49] and how much memory? [23:17:56] the easiest wy is to and all sge ooptins at top of your sh file. just add [23:18:03] Dr. Trigon * Re: [Toolserver-l] Grid Engine config change [23:18:33] don't know. how can i check? not much, it's a python tool that downloads a file or a couple and outputs an xml file. [23:18:39] #$ -N FaFaFa [23:18:40] #$ -l h_rt=5:00 [23:18:40] #$ -l virtual_free=50M [23:19:28] then add "5 * * * * qcronsub /home/skagedal/fafafa/do_fafafa.sh" to cronie -e on host submit [23:19:37] Sun Grid Engine execd on willow is WARNING: NRPE: Unable to read output [23:19:51] shound't be more than 50M [23:20:16] Load avg. on willow is WARNING: WARNING - load average: 17.77, 15.39, 15.39 [23:20:52] h_rt=5:00 means maximum runtine of 5 minutes [23:21:25] Merlissimo: all right, thanks! so, just add these right after the #! line in do_fafafa.sh? [23:21:35] yes [23:22:19] (and remove the regular crontab from willow again, i assume) [23:22:44] sure, if you won't have your script runs twice [23:23:16] Load avg. on willow is OK: OK - load average: 13.35, 14.18, 14.90 [23:23:44] what was your problem with the documentation? [23:24:20] [[Job scheduling]] ! 10https://wiki.toolserver.org/w/index.php?diff=6748&oldid=6747&rcid=8914 * DrTrigon * (+1534) (/* Managing jobs / Advanced features */ added again from history and enhanced) [23:26:19] [[Job scheduling]] !M 10https://wiki.toolserver.org/w/index.php?diff=6749&oldid=6748&rcid=8915 * DrTrigon * (-20) (/* Managing jobs */ ) [23:27:16] Load avg. on willow is WARNING: WARNING - load average: 15.32, 14.80, 14.97 [23:27:34] Merlissimo: i guess I didn't get the part about cronie and submit.toolserver.org... i see it now... [23:28:16] Load avg. on willow is OK: OK - load average: 14.86, 14.69, 14.92 [23:28:53] [[Job scheduling]] ! 10https://wiki.toolserver.org/w/index.php?diff=6750&oldid=6749&rcid=8916 * DrTrigon * (+98) (/* Managing jobs */ description of some states) [23:30:10] [[Job scheduling]] M 10https://wiki.toolserver.org/w/index.php?diff=6751&oldid=6750&rcid=8917 * Dab * (+2) (/* optional resources */ ) [23:32:40] [[Interwiki bot MMP planning]] !N 10https://wiki.toolserver.org/w/index.php?oldid=6752&rcid=8918 * MF-Warburg * (+2852) (Created page with "For planning the Interwiki bot Multi-Manager Project, according to [http://lists.wikimedia.org/pipermail/toolserver-l/2012-March/004755.html this suggestion]. Please list the in...") [23:34:03] MF-Warburg * Re: [Toolserver-l] MMP interwiki-bot [23:37:11] [[Interwiki bot MMP planning]] ! 10https://wiki.toolserver.org/w/index.php?diff=6753&oldid=6752&rcid=8919 * MF-Warburg * (+81) () [23:41:27] s4 replag on rosemary is WARNING: QUERY WARNING: SELECT ts_rc_age() returned 1894.000000 [23:48:28] s4 replag on rosemary is OK: QUERY OK: SELECT ts_rc_age() returned 1724.000000 [23:50:27] Load avg. on willow is WARNING: WARNING - load average: 20.03, 16.00, 15.76 [23:59:27] Load avg. on willow is OK: OK - load average: 13.43, 14.16, 14.93