[00:05:05] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [00:05:55] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [00:08:45] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.026367/1.95, alarm hl:tmp_free=33273M/100M, alarm hl:np_load_avg=1.402344/2.0, alarm hl:mem_free=347.000000M/350M, alarm hl:available=1/0 [00:13:45] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [00:20:35] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [00:24:05] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [00:28:04] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [00:33:02] Hersfold * Re: [Toolserver-l] Fairness on the toolserver [00:35:56] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.095703/1.95, alarm hl:tmp_free=33221M/100M, alarm hl:np_load_avg=1.177734/2.0, alarm hl:mem_free=198.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=1.095703/2.3, alarm hl:np_load_long=1.237305/2.5, alarm hl:cpu=68.000000/98, alarm hl:mem_free=198.000000M/200M, al [00:37:36] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 48229 [00:38:54] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [00:46:04] /aux0 on hemlock is WARNING: DISK WARNING - free space: /aux0 319433 MB (6% inode=35%): [00:47:15] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 317507 MB (5% inode=34%): [00:50:15] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [00:55:55] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.365723/1.95, alarm hl:tmp_free=33190M/100M, alarm hl:np_load_avg=1.867676/2.0, alarm hl:mem_free=523.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.365723/2.3, alarm hl:np_load_long=1.518066/2.5, alarm hl:cpu=71.900000/98, alarm hl:mem_free=523.000000M/200M, al [01:02:36] Load avg. on willow is WARNING: WARNING - load average: 16.14, 16.54, 13.45 [01:04:36] Load avg. on willow is OK: OK - load average: 10.20, 14.01, 12.87 [01:05:16] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [01:06:05] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [01:07:35] Load avg. on willow is WARNING: WARNING - load average: 19.83, 17.63, 14.51 [01:20:45] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [01:24:03] Daniel Schwen * Re: [Toolserver-l] Fairness on the toolserver [01:24:14] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [01:28:05] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [01:37:48] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 50739 [01:47:48] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 317437 MB (5% inode=34%): [01:50:15] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [02:05:48] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [02:06:15] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [02:14:50] was willow rebooted? [02:14:56] Sun Grid Engine execd on ortelius is WARNING: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.001953/1.00, alarm hl:np_load_long=0.847656/1.50, alarm hl:mem_free=10902.000000M/600M, alarm hl:tmp_free=13105M/100M, alarm hl:available=1/0 [02:15:56] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [02:16:03] Betacommand: yesterday, yes [02:16:30] Merlissimo: about how many hours ago? [02:16:45] "yesterday" is a subjective term [02:17:58] Betacommand: https://jira.toolserver.org/browse/MNT-1240 [02:18:51] all sge jobs were sucessfully restartet [02:20:47] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [02:22:56] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.115234/1.10, alarm hl:np_load_long=0.902344/1.55, alarm hl:mem_free=11337.000000M/500M, alarm hl:tmp_free=13107M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.115234/1.00, alarm hl:np_load_long=0.902344/1.50, alarm hl:mem_free=11337.000000M/600M, alarm hl:tmp_free [02:23:47] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [02:24:48] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [02:26:27] [[Kribo]] 10https://wiki.toolserver.org/w/index.php?diff=7242&oldid=6706&rcid=9661 * Krinkle * (-3700) (Replaced content with "'''Kribo''' is a small PHP-framework for creating simple IRC bots. By default it has little function but it's power is in the extensibility through plugins.== Documentation...") [02:26:43] [[Kribo]] 10https://wiki.toolserver.org/w/index.php?diff=7243&oldid=7242&rcid=9662 * Krinkle * (-1) () [02:28:36] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [02:28:48] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [02:37:56] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 53235 [02:39:10] dschwen: you haven't been reading the mailing list, http://www.mail-archive.com/toolserver-l@lists.wikimedia.org/msg01902.html [02:43:17] [[WmfDbBot]] 10https://wiki.toolserver.org/w/index.php?diff=7244&oldid=7003&rcid=9663 * Krinkle * (-2426) (Replaced content with "{{DISPLAYTITLE:wmfDbBot}}'''wmfDbBot''' is a PHP interface to retrieve information about the databases of Wikimedia Foundation wikis. == Documentation ==* https://github....") [02:48:47] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 315074 MB (5% inode=34%): [02:50:25] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [03:06:15] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [03:06:47] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [03:12:26] [[Special:Log/delete]] delete 10 * Krinkle * (deleted "[[02Kribo10]]": https://github.com/Krinkle/ts-krinkle-Kribo) [03:12:57] [[Special:Log/delete]] restore 10 * Krinkle * (restored "[[02Kribo10]]": 5 revisions restored) [03:13:05] [[Kribo]] 10https://wiki.toolserver.org/w/index.php?diff=7245&oldid=7243&rcid=9666 * Krinkle * (-13) () [03:13:24] [[Special:Log/delete]] delete 10 * Krinkle * (deleted "[[02Dbbot-wm10]]": [[mw:dbbot-wm]]) [03:20:57] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [03:25:47] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [03:28:49] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [03:32:06] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.274414/1.10, alarm hl:np_load_long=0.910156/1.55, alarm hl:mem_free=11035.000000M/500M, alarm hl:tmp_free=14806M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.274414/1.00, alarm hl:np_load_long=0.910156/1.50, alarm hl:mem_free=11035.000000M/600M, alarm hl:tmp_free [03:36:16] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=0.966309/1.95, alarm hl:tmp_free=32924M/100M, alarm hl:np_load_avg=1.330078/2.0, alarm hl:mem_free=221.000000M/350M, alarm hl:available=1/0 [03:38:56] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 55071 [03:39:06] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [03:39:15] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [03:43:07] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.203125/1.10, alarm hl:np_load_long=1.052734/1.55, alarm hl:mem_free=10364.000000M/500M, alarm hl:tmp_free=14792M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.203125/1.00, alarm hl:np_load_long=1.052734/1.50, alarm hl:mem_free=10364.000000M/600M, alarm hl:tmp_free [03:43:16] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=3.041992/1.95, alarm hl:tmp_free=32914M/100M, alarm hl:np_load_avg=1.940430/2.0, alarm hl:mem_free=422.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=3.041992/2.3, alarm hl:np_load_long=1.670899/2.5, alarm hl:cpu=78.500000/98, alarm hl:mem_free=422.000000M/200M, al [03:48:47] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 315246 MB (5% inode=34%): [03:50:36] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [04:02:56] Load avg. on willow is WARNING: WARNING - load average: 14.27, 17.16, 14.48 [04:05:05] Load avg. on willow is OK: OK - load average: 7.94, 13.95, 13.61 [04:06:26] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [04:06:56] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [04:12:16] Sun Grid Engine execd on ortelius is WARNING: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.001953/1.00, alarm hl:np_load_long=0.949219/1.50, alarm hl:mem_free=9807.000000M/600M, alarm hl:tmp_free=14774M/100M, alarm hl:available=1/0 [04:13:17] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [04:21:16] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [04:23:46] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [04:25:56] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [04:28:56] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [04:30:15] [[Wiki server assignments]] ! 10https://wiki.toolserver.org/w/index.php?diff=7246&oldid=7198&rcid=9668 * 91.198.174.202 * (-128) (updated page) [04:34:15] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.158203/1.10, alarm hl:np_load_long=0.886719/1.55, alarm hl:mem_free=8800.000000M/500M, alarm hl:tmp_free=14746M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.158203/1.00, alarm hl:np_load_long=0.886719/1.50, alarm hl:mem_free=8800.000000M/600M, alarm hl:tmp_free=1 [04:39:07] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 57045 [04:48:57] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [04:49:07] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 315193 MB (5% inode=34%): [04:51:27] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [05:05:07] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.135742/1.10, alarm hl:np_load_long=1.021484/1.55, alarm hl:mem_free=10926.000000M/500M, alarm hl:tmp_free=14709M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.135742/1.00, alarm hl:np_load_long=1.021484/1.50, alarm hl:mem_free=10926.000000M/600M, alarm hl:tmp_free [05:06:08] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [05:07:08] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [05:07:17] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [05:22:07] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [05:26:07] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [05:29:08] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [05:33:07] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.359375/1.95, alarm hl:tmp_free=32728M/100M, alarm hl:np_load_avg=2.515137/2.0, alarm hl:mem_free=522.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.359375/2.3, alarm hl:np_load_long=1.998535/2.5, alarm hl:cpu=57.600000/98, alarm hl:mem_free=522.000000M/200M, al [05:33:16] Load avg. on willow is WARNING: WARNING - load average: 12.30, 18.19, 15.50 [05:35:08] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [05:38:17] Load avg. on willow is OK: OK - load average: 9.55, 13.55, 14.16 [05:39:18] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 58272 [05:44:17] Load avg. on willow is WARNING: WARNING - load average: 14.62, 15.56, 14.77 [05:49:16] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 315114 MB (5% inode=34%): [05:51:47] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [06:03:08] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.822266/1.95, alarm hl:tmp_free=32575M/100M, alarm hl:np_load_avg=2.786621/2.0, alarm hl:mem_free=465.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.822266/2.3, alarm hl:np_load_long=2.252441/2.5, alarm hl:cpu=91.200000/98, alarm hl:mem_free=465.000000M/200M, al [06:07:29] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [06:07:49] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [06:22:10] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [06:26:19] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [06:29:19] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [06:39:20] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 59250 [06:49:20] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 314963 MB (5% inode=34%): [06:52:09] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [06:54:19] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [06:57:10] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.541016/1.95, alarm hl:tmp_free=32419M/100M, alarm hl:np_load_avg=2.075195/2.0, alarm hl:mem_free=453.000000M/350M, alarm hl:available=1/0 [06:58:09] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [07:07:33] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [07:07:53] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [07:21:23] Load avg. on willow is WARNING: WARNING - load average: 11.24, 13.57, 15.30 [07:22:13] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [07:22:23] Load avg. on willow is OK: OK - load average: 9.15, 12.47, 14.79 [07:26:23] Load avg. on willow is WARNING: WARNING - load average: 13.68, 14.28, 15.02 [07:26:23] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [07:27:24] Load avg. on willow is OK: OK - load average: 10.63, 13.29, 14.62 [07:29:31] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [07:39:23] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 60132 [07:49:23] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 314858 MB (5% inode=34%): [07:52:13] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [07:59:03] fisheye.toolserver.org on web.amaranth is OK: HTTP OK: HTTP/1.1 200 OK - 273 bytes in 11.339 second response time [07:59:53] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 40350 MB (9% inode=99%): [08:02:23] Load avg. on willow is WARNING: WARNING - load average: 19.75, 17.04, 14.35 [08:03:23] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.815918/1.95, alarm hl:tmp_free=32294M/100M, alarm hl:np_load_avg=2.008789/2.0, alarm hl:mem_free=583.000000M/350M, alarm hl:available=1/0 [08:04:24] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [08:04:25] Load avg. on willow is OK: OK - load average: 9.47, 14.08, 13.57 [08:07:32] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [08:08:03] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [08:08:53] /sql on z-dat-s4-a is CRITICAL: DISK CRITICAL - free space: /sql 23564 MB (5% inode=99%): [08:14:23] Load avg. on willow is WARNING: WARNING - load average: 13.14, 16.11, 14.62 [08:17:43] 3(resolved) [TS-1380] Crontab for MMT cvn on willow is not running <10https://jira.toolserver.org/browse/TS-1380> (Marlen Caemmerer) [08:22:24] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [08:23:53] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 32965 MB (8% inode=99%): [08:24:53] /sql on z-dat-s4-a is CRITICAL: DISK CRITICAL - free space: /sql 24262 MB (5% inode=99%): [08:26:33] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [08:30:23] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [08:33:24] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.122070/1.95, alarm hl:tmp_free=32225M/100M, alarm hl:np_load_avg=2.479980/2.0, alarm hl:mem_free=416.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.122070/2.3, alarm hl:np_load_long=2.131348/2.5, alarm hl:cpu=98.600000/98, alarm hl:mem_free=416.000000M/200M, al [08:39:33] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 61528 [08:39:34] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [08:49:33] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 313053 MB (5% inode=34%): [08:56:23] Environment IPMI on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [08:58:23] Environment IPMI on adenia is OK: ok: temperature ok fan ok voltage ok chassis ok [09:02:34] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.994629/1.95, alarm hl:tmp_free=32152M/100M, alarm hl:np_load_avg=1.937500/2.0, alarm hl:mem_free=779.000000M/350M, alarm hl:available=1/0 [09:03:33] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [09:07:43] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [09:08:02] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [09:22:34] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [09:23:43] Sun Grid Engine execd on ortelius is WARNING: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.081055/1.00, alarm hl:np_load_long=0.695312/1.50, alarm hl:mem_free=12373.000000M/600M, alarm hl:tmp_free=14412M/100M, alarm hl:available=1/0 [09:24:43] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [09:26:34] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.797852/1.95, alarm hl:tmp_free=32115M/100M, alarm hl:np_load_avg=1.956055/2.0, alarm hl:mem_free=393.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.797852/2.3, alarm hl:np_load_long=1.697266/2.5, alarm hl:cpu=54.000000/98, alarm hl:mem_free=393.000000M/200M, al [09:26:43] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [09:30:44] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [09:33:53] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [09:39:43] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 62415 [09:49:43] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 312898 MB (5% inode=34%): [09:51:39] [[Interwiki bot MMP planning]] ! 10https://wiki.toolserver.org/w/index.php?diff=7247&oldid=7196&rcid=9669 * 77.250.0.39 * (+141) (/* Bot list */ ) [09:51:57] [[Interwiki bot MMP planning]] ! 10https://wiki.toolserver.org/w/index.php?diff=7248&oldid=7247&rcid=9670 * 77.250.0.39 * (+19) (/* Bot list */ ) [09:52:56] [[Interwiki bot MMP planning]] ! 10https://wiki.toolserver.org/w/index.php?diff=7249&oldid=7248&rcid=9671 * 77.250.0.39 * (+75) (/* Bot list */ ) [10:02:52] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.107910/1.95, alarm hl:tmp_free=32039M/100M, alarm hl:np_load_avg=1.854004/2.0, alarm hl:mem_free=845.000000M/350M, alarm hl:available=1/0 [10:03:51] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [10:07:52] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [10:08:23] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [10:14:31] fisheye.toolserver.org on web.amaranth is WARNING: HTTP WARNING: HTTP/1.1 200 OK - 272 bytes in 15.108 second response time [10:14:55] turnera will be offline soon for hard disk replacement [10:22:52] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [10:23:59] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [10:26:39] Free Memory on damiana is CRITICAL: CRITICAL - 3.3% (138840 kB) free! [10:29:40] Free Memory on damiana is WARNING: WARNING - 5.1% (214620 kB) free! [10:40:55] Free Memory on damiana is OK: OK - 54.4% (2275616 kB) free. [10:41:46] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 63858 [10:42:45] Sun Grid Engine execd on willow is CRITICAL: medium-sol@willow in error state: QERROR as result of job 2048871s failure: longrun-sol@willow in error state: QERROR as result of job 2048871s failure [10:50:15] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 312776 MB (5% inode=34%): [11:00:06] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.451172/1.10, alarm hl:np_load_long=0.671875/1.55, alarm hl:mem_free=11644.000000M/500M, alarm hl:tmp_free=14317M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.451172/1.00, alarm hl:np_load_long=0.671875/1.50, alarm hl:mem_free=11644.000000M/600M, alarm hl:tmp_free [11:04:35] CAM on hemlock is CRITICAL: CRITICAL - Storage ts-array5 (1 error): null :OSGi.com.sun.storage.cam.agent(device.2530):event.ProblemEvent.REC_SAS_PORT_DEGRADED.description:S27:Tray.85.Controller.A.Port.2: [11:04:46] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [11:04:46] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [11:07:36] CAM on hemlock is OK: OK - cam detected no new errors [11:08:26] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [11:08:26] [[Admin:Service processors]] 10https://wiki.toolserver.org/w/index.php?diff=7250&oldid=6683&rcid=9672 * Nosy * (+296) (/* Sun X2** (ELOM) */ ) [11:08:46] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [11:21:06] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [11:23:16] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [11:24:36] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [11:25:06] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.708984/1.10, alarm hl:np_load_long=1.210938/1.55, alarm hl:mem_free=11518.000000M/500M, alarm hl:tmp_free=14275M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.708984/1.00, alarm hl:np_load_long=1.210938/1.50, alarm hl:mem_free=11518.000000M/600M, alarm hl:tmp_free [11:42:12] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 64927 [11:43:11] Sun Grid Engine execd on willow is CRITICAL: medium-sol@willow in error state: QERROR as result of job 2048871s failure: longrun-sol@willow in error state: QERROR as result of job 2048871s failure [11:47:11] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [11:51:12] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 312564 MB (5% inode=34%): [12:00:12] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.216797/1.10, alarm hl:np_load_long=1.238281/1.55, alarm hl:mem_free=11774.000000M/500M, alarm hl:tmp_free=14206M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.216797/1.00, alarm hl:np_load_long=1.238281/1.50, alarm hl:mem_free=11774.000000M/600M, alarm hl:tmp_free [12:05:11] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [12:05:11] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [12:07:11] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [12:08:40] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [12:09:41] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [12:13:11] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.846191/1.95, alarm hl:tmp_free=31366M/100M, alarm hl:np_load_avg=1.877930/2.0, alarm hl:mem_free=402.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.846191/2.3, alarm hl:np_load_long=1.607910/2.5, alarm hl:cpu=72.500000/98, alarm hl:mem_free=402.000000M/200M, al [12:14:11] Sun Grid Engine execd on willow is OK: testqueue@willow disabled: medium-sol@willow OK: longrun-sol@willow OK [12:23:20] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [12:24:41] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [12:43:08] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 66456 [12:45:18] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.517578/1.10, alarm hl:np_load_long=0.935547/1.55, alarm hl:mem_free=11695.000000M/500M, alarm hl:tmp_free=14162M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.517578/1.00, alarm hl:np_load_long=0.935547/1.50, alarm hl:mem_free=11695.000000M/600M, alarm hl:tmp_free [12:50:19] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [12:51:18] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 312386 MB (5% inode=34%): [12:57:45] 3(commented) [TS-1299] Create user sgeadmin also local on each server <10https://jira.toolserver.org/browse/TS-1299> (Marlen Caemmerer) [13:05:24] hello all [13:06:08] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [13:06:08] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [13:07:39] DaBPunkt: I was wondering how difficult it would be to generate toolserver resource usage reports on a regular basis? [13:08:57] Betacommand: in which way? [13:09:38] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [13:09:47] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [13:09:50] amount of RAM/CPU usage per user over time [13:12:27] sounds like many-work for me [13:13:18] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=3.540527/1.95, alarm hl:tmp_free=31265M/100M, alarm hl:np_load_avg=2.129883/2.0, alarm hl:mem_free=405.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=3.540527/2.3, alarm hl:np_load_long=1.764649/2.5, alarm hl:cpu=73.100000/98, alarm hl:mem_free=405.000000M/200M, al [13:13:19] DaBPunkt: I just thought there might have been tools already available for you to use [13:14:51] DaBPunkt: for example what was said in http://lists.wikimedia.org/pipermail/toolserver-l/2012-May/004957.html [13:14:57] Betacommand: it is a differnce between "How much cpu-time uses a user at the moment" and "how much cpu-time did he/she uses today". For the latter you have to collect these data (for all users) and a maschine has to do that [13:15:17] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [13:16:12] DaBPunkt: how difficult would it be to take a snapshot say every 10 minutes or so and graph those? [13:18:42] it's depens how maschine-readable the data is [13:19:03] I never worked with ulimit or sar before [13:23:28] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [13:24:48] fisheye.toolserver.org on web.amaranth is CRITICAL: CRITICAL - Socket timeout after 21 seconds [13:34:39] fisheye.toolserver.org on web.amaranth is OK: HTTP OK: HTTP/1.1 200 OK - 274 bytes in 10.447 second response time [13:43:08] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 67843 [13:44:16] DaBPunkt: sorry to bother you again, what would the procedure for changing the ssh key for my account be? I still have the old one but it's been 4 years and I haven't got a clue what its password is, sorry :( [13:44:54] Snowolf: open a jira-ticket in the ts-queue [13:45:00] DaBPunkt: many thanks [13:46:18] Sun Grid Engine execd on ortelius is WARNING: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.008789/1.00, alarm hl:np_load_long=0.869140/1.50, alarm hl:mem_free=10899.000000M/600M, alarm hl:tmp_free=13902M/100M, alarm hl:available=1/0 [13:48:18] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [13:50:18] Load avg. on adenia is WARNING: WARNING - load average: 17.81, 12.53, 7.78 [13:50:18] (done) [13:50:44] 3(created) [TS-1382] Change of ssh key for my account (snowolf); Toolserver: Accounts; Task <10https://jira.toolserver.org/browse/TS-1382> (Snowolf ) [13:51:17] Load avg. on adenia is OK: OK - load average: 13.09, 12.28, 8.00 [13:51:17] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 312205 MB (5% inode=34%): [14:06:08] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [14:06:18] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [14:09:37] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [14:09:48] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [14:13:43] 3(assigned) [TS-1382] Change of ssh key for my account (snowolf) <10https://jira.toolserver.org/browse/TS-1382> (DaB.) [14:23:08] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=0.845703/1.95, alarm hl:tmp_free=31149M/100M, alarm hl:np_load_avg=1.188965/2.0, alarm hl:mem_free=278.000000M/350M, alarm hl:available=1/0 [14:23:28] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [14:24:17] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [14:30:47] 3(commented) [MNT-1239] Clean-up old osm-titles on hemlock <10https://jira.toolserver.org/browse/MNT-1239> (DaB.) [14:30:48] 3(resolved) [MNT-1239] Clean-up old osm-titles on hemlock <10https://jira.toolserver.org/browse/MNT-1239> (DaB.) [14:33:18] Load avg. on willow is WARNING: WARNING - load average: 11.62, 15.53, 13.35 [14:33:19] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.133789/1.10, alarm hl:np_load_long=0.795898/1.55, alarm hl:mem_free=10515.000000M/500M, alarm hl:tmp_free=13944M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.133789/1.00, alarm hl:np_load_long=0.795898/1.50, alarm hl:mem_free=10515.000000M/600M, alarm hl:tmp_free [14:34:18] Load avg. on willow is OK: OK - load average: 8.28, 13.91, 12.92 [14:34:18] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [14:43:11] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 69378 [14:43:21] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.154297/1.10, alarm hl:np_load_long=0.803711/1.55, alarm hl:mem_free=10229.000000M/500M, alarm hl:tmp_free=13891M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.154297/1.00, alarm hl:np_load_long=0.803711/1.50, alarm hl:mem_free=10229.000000M/600M, alarm hl:tmp_free [14:51:21] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 310683 MB (5% inode=34%): [14:54:17] so, I finally switched my bot runs to cronsub [14:54:26] ist that system going to stay? [14:54:40] or is SGE going away in the near future? [14:55:04] dschwen sge is the way to go [14:55:27] and why is one page on the wiki sugegsting to use cronsub and another one is suggesting qcronsub? [14:55:39] this confuses my tiny little brain [14:55:59] dschwen linky? [14:57:19] https://wiki.toolserver.org/view/Batch_job_scheduling#Running_interwiki_bots_from_cron [14:57:31] https://wiki.toolserver.org/view/Cron [14:57:36] dschwen dont run interwiki bots [14:57:51] I don't [14:57:57] I never have [14:58:00] I never will [14:58:12] However SGE replaced cron [14:58:17] just look at the pages, not the links [14:58:32] dschwen: qcronsub is a re-written version of cronsub. [14:58:38] but both uses SGE [14:59:07] is qcronsub a dropin replacement for cronsub? [14:59:19] if you find somewhere cronsub in the wiki, speak with Merlissimo so he can replace it with the right syntax [14:59:19] Yeah, I noticed they call qsub [14:59:38] here: https://wiki.toolserver.org/view/Cron [14:59:42] (AFAIR qcronsub is backwards-compatible) [15:01:19] DaBPunkt: any eta on nightshade and yarrow getting back up as login servers? [15:01:45] Betacommand: yes, next week probably [15:01:56] thanks [15:02:14] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [15:03:21] Load avg. on willow is WARNING: WARNING - load average: 14.01, 15.75, 14.16 [15:03:30] Snowolf: did my email arrived? [15:03:57] Betacommand: no need to. I also tired of the willow-nagios-warnings ;) [15:04:20] DaBPunkt: I had to /ignore it [15:04:21] Load avg. on willow is OK: OK - load average: 11.25, 14.56, 13.83 [15:04:34] I can't do that ;/ [15:05:46] 3(created) [ACCAPP-511] Need an access to read the data bases of articles; Account Approval; New Account <10https://jira.toolserver.org/browse/ACCAPP-511> (Andriy Rodchenko) [15:06:00] qcronsub -N gpsexifbot -l sql-s2-user-readonly=1 $HOME/dschwen_bot/gps_exif_bot2.py [15:06:07] o hai Betacommand :3 [15:06:08] so that would be a valid command [15:06:11] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [15:06:20] is there wany way I can help you? [15:06:56] ToAruShiroiNeko: No, I was just checking to see if/when server resources wouldnt be as tight as they are [15:07:11] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [15:07:13] aw [15:07:24] I can shake pon pons to motivate TS servers [15:09:51] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [15:10:00] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [15:10:22] dschwen: looks good. But you can just try it (nothing will break on the SGE if something is wrong) [15:10:41] no, did not work [15:10:53] my job got scheduled on clematis and threw an error [15:10:58] dschwen: error-message? [15:11:01] ImportError: No module named PHPUnserialize [15:11:28] so apparently not all python modules are available on all cluster nodes [15:11:32] thta do not look like a SHE-error [15:11:34] SGE [15:11:41] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [15:11:48] is that not clear? [15:11:58] my job got scheduled, ran and failed [15:12:09] because a python module was missing on the host it ran on [15:12:15] yes, I understand :) [15:12:21] oh, ok, sorry [15:13:09] i did not see a -l option to request python modules as a resource [15:13:30] there is a parameter to specify servers where the program is allowed to run AFAIR. ask Merlissimo for that – but only as a temporary solution! [15:13:31] Sun Grid Engine execd on ortelius is WARNING: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.047851/1.00, alarm hl:np_load_long=0.828125/1.50, alarm hl:mem_free=11517.000000M/600M, alarm hl:tmp_free=13926M/100M, alarm hl:available=1/0 [15:13:36] so I suggest you just make sure everything that is installed on willw is also installed on all hosts that may run SGE jobs [15:14:16] this is not very satisfactory. it adds quite a bit of additional burden and uncertainty [15:14:20] Load avg. on willow is WARNING: WARNING - load average: 14.61, 15.99, 14.64 [15:14:31] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [15:14:35] dschwen: the problem is that clematis and hawthron are only working as a temporary solution for SGE – normale these are not userland-servers [15:14:39] to be on the safe side I will only specify willow as a server [15:14:50] but that defeats a lot of the purpose of SGE [15:15:04] and wolfsbane and ortelius [15:15:14] is there a problem just installing a few more packages on those two servers [15:15:37] installable software only takes a small amount of HD space [15:15:48] yes. Especialy the deinstallation [15:15:56] apt-get remove [15:15:58] ;-) [15:16:11] purge . but these are not debian-boxes… [15:16:48] oh man, Solaris may be really nice, but I think this was not the best long-term solution for the ts cluster... [15:16:48] in a week or 2 there wil be debian-boxes again and they we can do this [15:17:03] yeah, some debian boxes [15:17:19] you need a homogeneous infrastructure [15:17:50] dschwen: peu a peu the most boxes will move to debian [15:17:53] sorry, not trying to tell you how to do your job. I'm grateful to have what we have! [15:18:14] sounds good [15:18:34] even though it will mean a bunch of recompilations for me [15:19:45] 3(commented) [TS-1382] Change of ssh key for my account (snowolf) <10https://jira.toolserver.org/browse/TS-1382> (DaB.) [15:21:11] Environment IPMI on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [15:21:47] 3(commented) [TS-1382] Change of ssh key for my account (snowolf) <10https://jira.toolserver.org/browse/TS-1382> (Snowolf ) [15:21:50] Environment IPMI on adenia is OK: ok: temperature ok fan ok voltage ok chassis ok [15:23:30] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [15:26:11] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.276855/1.95, alarm hl:tmp_free=31044M/100M, alarm hl:np_load_avg=1.448730/2.0, alarm hl:mem_free=258.000000M/350M, alarm hl:available=1/0 [15:28:11] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [15:32:11] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.114746/1.95, alarm hl:tmp_free=31032M/100M, alarm hl:np_load_avg=1.888672/2.0, alarm hl:mem_free=470.000000M/350M, alarm hl:available=1/0 [15:43:11] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 70582 [15:51:31] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 311161 MB (5% inode=34%): [16:06:11] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [16:07:12] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=3.455078/1.95, alarm hl:tmp_free=30973M/100M, alarm hl:np_load_avg=2.097168/2.0, alarm hl:mem_free=277.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=3.455078/2.3, alarm hl:np_load_long=1.777344/2.5, alarm hl:cpu=86.200000/98, alarm hl:mem_free=277.000000M/200M, al [16:07:13] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [16:09:11] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [16:10:00] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [16:10:10] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [16:23:31] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [16:27:11] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.403320/1.95, alarm hl:tmp_free=30941M/100M, alarm hl:np_load_avg=1.445801/2.0, alarm hl:mem_free=236.000000M/350M, alarm hl:available=1/0 [16:43:11] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 71797 [16:51:21] /sql on z-dat-s4-a is WARNING: DISK WARNING - free space: /sql 42860 MB (10% inode=99%): [16:51:30] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 310945 MB (5% inode=34%): [17:00:20] /sql on z-dat-s4-a is OK: DISK OK - free space: /sql 75034 MB (18% inode=99%): [17:03:20] / on wolfsbane is WARNING: DISK WARNING - free space: / 6233 MB (20% inode=93%): [17:06:11] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [17:07:20] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [17:10:01] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [17:10:11] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [17:23:40] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [17:38:21] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.553711/1.95, alarm hl:tmp_free=30821M/100M, alarm hl:np_load_avg=1.629883/2.0, alarm hl:mem_free=191.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=1.553711/2.3, alarm hl:np_load_long=1.551270/2.5, alarm hl:cpu=74.200000/98, alarm hl:mem_free=191.000000M/200M, al [17:39:20] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [17:43:11] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 72767 [17:47:11] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:51:30] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 310795 MB (5% inode=34%): [17:51:50] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [17:55:20] SSH on z-dat-s4-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:21] RAID on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:55:21] SMTP on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:21] SMTP on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:21] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:21] SSH on z-dat-s3-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:21] SMTP on z-dat-s3-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:30] SSH on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:40] SMF on z-dat-s4-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:55:40] SMF on z-dat-s3-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:55:50] / on z-dat-s3-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:55:51] /sql on z-dat-s3-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:55:51] / on z-dat-s4-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:55:51] /sql on z-dat-s4-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:56:01] / on z-dat-s7-a is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [17:56:11] SMF on z-dat-s3-a is OK: OK - all services online [17:56:11] SMF on z-dat-s4-a is OK: OK - all services online [17:56:21] / on z-dat-s3-a is OK: DISK OK - free space: / 8317 MB (27% inode=85%): [17:56:21] /sql on z-dat-s3-a is OK: DISK OK - free space: /sql 146283 MB (15% inode=99%): [17:56:21] / on z-dat-s4-a is OK: DISK OK - free space: / 8317 MB (27% inode=85%): [17:56:21] SMTP on z-dat-s7-a is OK: SMTP OK - 6.663 sec. response time [17:56:21] /sql on z-dat-s4-a is OK: DISK OK - free space: /sql 80002 MB (19% inode=99%): [17:56:31] / on z-dat-s7-a is OK: DISK OK - free space: / 8317 MB (27% inode=85%): [17:56:41] s4 replag on z-dat-s4-a is CRITICAL: (Service Check Timed Out) [17:56:51] s4 replag on z-dat-s4-a is OK: QUERY OK: SELECT ts_rc_age() returned 226.000000 [17:57:00] RAID on hyacinth is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [17:57:11] SSH on z-dat-s4-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [17:57:11] SMTP on z-dat-s6-a is OK: SMTP OK - 0.009 sec. response time [17:57:11] SMTP on z-dat-s3-a is OK: SMTP OK - 0.002 sec. response time [17:57:11] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [17:57:11] SSH on z-dat-s3-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [17:57:20] SSH on z-dat-s6-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [18:02:20] Load avg. on willow is WARNING: WARNING - load average: 20.71, 18.37, 14.09 [18:03:21] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.754395/1.95, alarm hl:tmp_free=30736M/100M, alarm hl:np_load_avg=2.122559/2.0, alarm hl:mem_free=810.000000M/350M, alarm hl:available=1/0 [18:03:21] / on wolfsbane is WARNING: DISK WARNING - free space: / 6151 MB (20% inode=93%): [18:04:21] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [18:06:10] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [18:07:21] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [18:09:21] Load avg. on willow is OK: OK - load average: 10.82, 14.14, 13.57 [18:10:10] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [18:10:11] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [18:15:41] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.500976/1.10, alarm hl:np_load_long=0.899414/1.55, alarm hl:mem_free=10707.000000M/500M, alarm hl:tmp_free=13555M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.500976/1.00, alarm hl:np_load_long=0.899414/1.50, alarm hl:mem_free=10707.000000M/600M, alarm hl:tmp_free [18:23:41] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [18:26:41] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [18:29:41] Sun Grid Engine execd on ortelius is WARNING: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.085938/1.00, alarm hl:np_load_long=1.087891/1.50, alarm hl:mem_free=11356.000000M/600M, alarm hl:tmp_free=13551M/100M, alarm hl:available=1/0 [18:43:21] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 74239 [18:50:21] Load avg. on willow is WARNING: WARNING - load average: 14.79, 16.00, 14.38 [18:51:20] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.331543/1.95, alarm hl:tmp_free=30425M/100M, alarm hl:np_load_avg=1.853027/2.0, alarm hl:mem_free=330.000000M/350M, alarm hl:available=1/0 [18:51:21] Load avg. on willow is OK: OK - load average: 11.18, 14.62, 13.99 [18:51:30] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 310663 MB (5% inode=34%): [18:52:21] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [18:56:21] Load avg. on willow is WARNING: WARNING - load average: 15.55, 15.46, 14.38 [19:03:21] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.751953/1.95, alarm hl:tmp_free=30356M/100M, alarm hl:np_load_avg=2.249512/2.0, alarm hl:mem_free=251.000000M/350M, alarm hl:available=1/0 [19:03:21] / on wolfsbane is WARNING: DISK WARNING - free space: / 6072 MB (20% inode=93%): [19:06:21] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [19:07:33] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [19:10:11] SSH on hyacinth is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:11] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [19:10:21] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [19:10:26] SMTP on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:31] SSH on z-dat-s4-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:31] SSH on z-dat-s7-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:31] SSH on z-dat-s3-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:31] SSH on z-dat-s6-a is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:31] RAID on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [19:10:31] Environment IPMI on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [19:10:51] Load avg. on hyacinth is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [19:11:20] s4 replag on z-dat-s4-a is CRITICAL: (Service Check Timed Out) [19:11:20] MySQL on z-dat-s6-a is CRITICAL: (Service Check Timed Out) [19:11:20] MySQL on z-dat-s7-a is CRITICAL: (Service Check Timed Out) [19:11:30] MySQL slave on z-dat-s7-a is CRITICAL: (Service Check Timed Out) [19:11:31] MySQL on z-dat-s3-a is CRITICAL: (Service Check Timed Out) [19:11:31] MySQL slave on z-dat-s3-a is CRITICAL: (Service Check Timed Out) [19:11:31] Load avg. on hyacinth is OK: OK - load average: 0.15, 1.71, 2.59 [19:11:31] MySQL on z-dat-s6-a is OK: Uptime: 1637443 Threads: 10 Questions: 367817129 Slow queries: 105188 Opens: 2421452 Flush tables: 2 Open tables: 2370 Queries per second avg: 224.628 [19:11:31] SSH on z-dat-s6-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [19:11:31] MySQL on z-dat-s7-a is OK: Uptime: 1637445 Threads: 12 Questions: 589431474 Slow queries: 49560 Opens: 4830796 Flush tables: 1 Open tables: 6516 Queries per second avg: 359.970 [19:11:32] s4 replag on z-dat-s4-a is OK: QUERY OK: SELECT ts_rc_age() returned 192.000000 [19:11:32] Environment IPMI on hyacinth is OK: ok: temperature ok fan ok voltage ok chassis ok [19:11:40] MySQL on z-dat-s3-a is OK: Uptime: 1637452 Threads: 24 Questions: 1762533286 Slow queries: 110038 Opens: 16251208 Flush tables: 1 Open tables: 16384 Queries per second avg: 1076.387 [19:11:41] MySQL slave on z-dat-s3-a is OK: Uptime: 1637452 Threads: 23 Questions: 1762533287 Slow queries: 110038 Opens: 16251208 Flush tables: 1 Open tables: 16384 Queries per second avg: 1076.387 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 273 [19:11:41] MySQL slave on z-dat-s7-a is OK: Uptime: 1637452 Threads: 11 Questions: 589432826 Slow queries: 49563 Opens: 4830799 Flush tables: 1 Open tables: 6516 Queries per second avg: 359.969 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 228 [19:12:01] SSH on hyacinth is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [19:12:17] RAID on hyacinth is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [19:12:17] SMTP on z-dat-s7-a is OK: SMTP OK - 0.003 sec. response time [19:12:21] SSH on z-dat-s4-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [19:12:21] SSH on z-dat-s3-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [19:12:21] SSH on z-dat-s7-a is OK: SSH OK - OpenSSH_5.8p2-hpn13v11 (protocol 2.0) [19:13:41] Free Memory on damiana is WARNING: WARNING - 6.9% (288408 kB) free! [19:14:40] Free Memory on damiana is OK: OK - 7.2% (300896 kB) free. [19:23:50] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [19:27:42] 3(created) [MNT-1241] run logadm (logrotate) on wolfsbane; Maintenance; Emergency work <10https://jira.toolserver.org/browse/MNT-1241> (DaB.) [19:28:21] / on wolfsbane is OK: DISK OK - free space: / 13001 MB (43% inode=93%): [19:33:02] DaB. * [Toolserver-l] Quota (was: Fairness on the toolserver) [19:37:13] RAID on adenia is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [19:43:51] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 75244 [19:44:40] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.287109/1.10, alarm hl:np_load_long=0.800781/1.55, alarm hl:mem_free=11190.000000M/500M, alarm hl:tmp_free=13412M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.287109/1.00, alarm hl:np_load_long=0.800781/1.50, alarm hl:mem_free=11190.000000M/600M, alarm hl:tmp_free [19:45:41] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [19:51:49] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 310478 MB (5% inode=34%): [19:56:41] RAID on adenia is OK: OK - TOTAL: 2: FAILED: 0: DEGRADED: 0 [19:57:20] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=0.988769/1.95, alarm hl:tmp_free=30129M/100M, alarm hl:np_load_avg=1.343750/2.0, alarm hl:mem_free=275.000000M/350M, alarm hl:available=1/0 [19:58:21] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [20:06:21] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [20:07:21] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [20:08:21] Load avg. on willow is WARNING: WARNING - load average: 18.93, 17.01, 14.17 [20:10:10] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [20:10:31] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [20:11:21] Load avg. on willow is OK: OK - load average: 13.56, 14.79, 13.75 [20:14:21] Load avg. on willow is WARNING: WARNING - load average: 15.56, 16.46, 14.68 [20:23:51] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [20:32:21] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.366699/1.95, alarm hl:tmp_free=29929M/100M, alarm hl:np_load_avg=2.267578/2.0, alarm hl:mem_free=665.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.366699/2.3, alarm hl:np_load_long=1.935547/2.5, alarm hl:cpu=65.200000/98, alarm hl:mem_free=665.000000M/200M, al [20:34:21] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [20:43:41] Sun Grid Engine execd on ortelius is WARNING: short-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.199219/1.10, alarm hl:np_load_long=0.769531/1.55, alarm hl:mem_free=11167.000000M/500M, alarm hl:tmp_free=13338M/200M, alarm hl:available=1/0: medium-sol@ortelius exceedes load threshold: alarm hl:np_load_short=1.199219/1.00, alarm hl:np_load_long=0.769531/1.50, alarm hl:mem_free=11167.000000M/600M, alarm hl:tmp_free [20:44:00] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 77259 [20:44:41] Free Memory on damiana is WARNING: WARNING - 7.0% (293600 kB) free! [20:45:40] Free Memory on damiana is OK: OK - 7.4% (310128 kB) free. [20:45:41] Sun Grid Engine execd on ortelius is OK: short-sol@ortelius OK: medium-sol@ortelius OK [20:51:32] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 310310 MB (5% inode=34%): [20:55:47] 3(resolved) [TS-1382] Change of ssh key for my account (snowolf) <10https://jira.toolserver.org/browse/TS-1382> (DaB.) [20:57:38] DaBPunkt: thanks! :) [20:57:46] no problem [21:02:32] Load avg. on willow is WARNING: WARNING - load average: 18.79, 17.32, 14.60 [21:02:32] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.411133/1.95, alarm hl:tmp_free=29788M/100M, alarm hl:np_load_avg=2.160156/2.0, alarm hl:mem_free=954.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.411133/2.3, alarm hl:np_load_long=1.815918/2.5, alarm hl:cpu=84.500000/98, alarm hl:mem_free=954.000000M/200M, al [21:04:32] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [21:05:31] Load avg. on willow is OK: OK - load average: 12.98, 14.72, 14.02 [21:06:31] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [21:07:32] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [21:08:31] Load avg. on willow is WARNING: WARNING - load average: 18.15, 17.88, 15.50 [21:08:31] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.309570/1.95, alarm hl:tmp_free=29745M/100M, alarm hl:np_load_avg=2.235840/2.0, alarm hl:mem_free=812.000000M/350M, alarm hl:available=1/0: longrun-sol@willow exceedes load threshold: alarm hl:np_load_short=2.309570/2.3, alarm hl:np_load_long=1.930664/2.5, alarm hl:cpu=89.000000/98, alarm hl:mem_free=812.000000M/200M, al [21:10:31] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [21:10:41] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [21:14:40] Free Memory on damiana is WARNING: WARNING - 7.0% (295036 kB) free! [21:16:41] Free Memory on damiana is OK: OK - 7.1% (295592 kB) free. [21:24:01] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [21:24:41] Free Memory on damiana is WARNING: WARNING - 6.4% (267132 kB) free! [21:38:55] /home will be away for a moment [21:40:45] done [21:43:06] [[Special:Log/newusers]] create 10 * Nikolaj'u * (New user account) [21:44:01] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 78355 [21:51:42] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 307149 MB (5% inode=34%): [22:06:43] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [22:07:42] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [22:10:34] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [22:11:04] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [22:15:34] nacht ts [22:24:04] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [22:26:35] Load avg. on willow is WARNING: WARNING - load average: 15.11, 15.14, 13.63 [22:27:34] Load avg. on willow is OK: OK - load average: 12.66, 14.38, 13.45 [22:32:34] Load avg. on willow is WARNING: WARNING - load average: 15.08, 16.69, 14.64 [22:44:05] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 80354 [22:51:44] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 300325 MB (5% inode=33%): [22:56:45] 3(created) [TS-1383] Add snowolf to stewardbots MMT; Toolserver; Task <10https://jira.toolserver.org/browse/TS-1383> [23:02:45] Load avg. on willow is WARNING: WARNING - load average: 13.14, 15.70, 13.75 [23:02:45] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=2.006836/1.95, alarm hl:tmp_free=29522M/100M, alarm hl:np_load_avg=2.048340/2.0, alarm hl:mem_free=720.000000M/350M, alarm hl:available=1/0 [23:03:44] Load avg. on willow is OK: OK - load average: 9.55, 14.23, 13.34 [23:03:44] Sun Grid Engine execd on willow is OK: testqueue@willow OK: medium-sol@willow OK: longrun-sol@willow OK [23:06:45] DiskSuite on turnera is CRITICAL: CRITICAL - submirror d42 of mirror d40 is Needs and submirror d32 of mirror d30 is Needs and submirror d22 of mirror d20 is Needs and submirror d12 of mirror d10 is Needs [23:07:53] SMF on turnera is CRITICAL: ERROR - offline: svc:/system/cluster/scsymon-srv:default [23:10:44] SMF on willow is CRITICAL: ERROR - maintenance: svc:/network/puppetmasterd:default [23:11:23] FMA on yarrow is CRITICAL: ERROR - unexpected output from snmpwalk [23:13:45] Sun Grid Engine execd on willow is WARNING: medium-sol@willow exceedes load threshold: alarm hl:np_load_short=1.993164/1.95, alarm hl:tmp_free=29502M/100M, alarm hl:np_load_avg=1.795899/2.0, alarm hl:mem_free=462.000000M/350M, alarm hl:available=1/0 [23:24:14] SMF on damiana is CRITICAL: ERROR - maintenance: svc:/network/ldap/client:default offline: svc:/system/cluster/scsymon-srv:default [23:32:45] Load avg. on willow is WARNING: WARNING - load average: 13.34, 15.15, 14.24 [23:33:44] Load avg. on willow is OK: OK - load average: 9.96, 13.88, 13.84 [23:44:15] MySQL slave on z-dat-s6-a is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 81990 [23:51:44] /aux0 on hemlock is CRITICAL: DISK CRITICAL - free space: /aux0 300238 MB (5% inode=33%): [23:55:42] why do my stored procedures disappear sometimes (mysql db) [23:56:12] Any idea?