[02:04:46] PROBLEM - check_zombie on civi1001 is CRITICAL: CRITICAL - Plugin timed out after 10 seconds [02:06:36] PROBLEM - check_load on civi1001 is CRITICAL: CRITICAL - load average: 35.31, 26.45, 20.57 [02:14:06] RECOVERY - check_zombie on civi1001 is OK: PROCS OK: 0 processes with STATE = Z [02:27:16] PROBLEM - check_load on civi1001 is CRITICAL: CRITICAL - load average: 22.36, 21.65, 20.35 [02:48:06] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:51:56] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:58:06] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:58:39] cwd or Jeff_Green - looks like raid is taking lots of CPU on civi1001 [02:59:12] seeing mp3_raid1 and md3_resync consistently at the top of 'top' [02:59:59] took a long time to log in [03:00:13] guessing that's what's hanging the jobs and causing all the failmail [03:02:56] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:07:16] PROBLEM - check_load on civi1001 is CRITICAL: CRITICAL - load average: 18.99, 21.58, 20.08 [03:07:56] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:08:19] cwd does that look like disk problems on civi1001? [03:08:29] guessing we should shut down most of the jobs [03:12:06] !log disabled all fundraising scheduled jobs - something that looks like disk issues on civi1001 [03:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:16] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:18:26] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:18:46] PROBLEM - check_puppetrun on civi1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:23:26] RECOVERY - check_puppetrun on civi1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:25:41] yeah, anything involving disk access seems really slow  [03:27:35] ejegg: darn that sounds bad [03:28:16] well, not much we can do about it from afar! [03:28:29] Hrrrghh [03:28:53] Who's in the data center this weekend? [03:29:10] Ooh, that would be a good thing to know [03:29:49] Which one is it? eqiad? [03:29:59] Yeah, I think so [03:30:21] It's still civi1001 and not mintaka, so yeah [03:30:52] https://wikimediafoundation.org/profile/papaul-tshibamba/ [03:31:43] hmm, disk issues usually have a whole nother set of alarms though, right? S.M.A.R.T. monitor or something? [03:32:06] I have no idea [03:32:17] Maybe there's some dashboard? [03:32:31] Did you get Papaul's name from an on-duty list someplace? [03:32:59] No, just looking through the staff page, it's the one I saw that said "data center" [03:35:26] Just asking about alerts on -operations [03:37:50] Not a great time to get a response on IRC [03:41:51] nah, I guess I'll probably just check back in the morning [03:41:57] yeah [03:42:10] feliz noche! [03:42:16] ¡Hasta mañana! [03:42:23] Que descanses [03:42:30] igualmente [03:42:33] :) [03:43:18] Yeah I'm beat... Just left the kids off with their mom and then did a supermarket run, so yeah also gonna turn in for the night [03:48:16] RECOVERY - check_load on civi1001 is OK: OK - load average: 0.78, 1.24, 4.81 [08:06:33] hey just opened my email & tonnes of failmail - is it sorted? [14:58:53] ejegg|away: cwd: talking to volans in -operations now about the disk stuff [15:20:48] ejegg|away: cwd it was a monthly mdadm check thing [15:20:55] sending details via e-mail now [15:37:00] fr-tech I think we can turn jobs back on (see e-mail) [20:04:05] AndyRussG: thanks! [20:27:24] ejegg: yw, thank u also looking at it and figuring out the general area that was borking [20:27:40] hope I didn't destroy anything by turning jobs back on [20:27:46] back in a few! [23:51:50] AndyRussG: thanks a lot for sorting that out [23:51:55] i was out of town