[03:44:04] ORES is acting up? [03:44:18] Maybe. Checking on it. Looks like there are a lot of timeouts happening. [03:44:21] o/ yuvipanda [03:44:33] Know if anything weird is going on with labs? [03:44:54] uh [03:44:54] Usually when I see 2 minute downtime alerts, it's because of labs having a hiccup. [03:44:56] no [03:45:01] But, this time, it's too often. [03:45:02] I've mostly managed to be unaware of anything [03:45:06] kk [03:45:08] in the last few weeks [03:45:10] * halfak digs into workers [03:45:18] am going to head off again now too, unfortunately :( [03:45:19] sorry! [03:45:35] * yuvipanda is back working from tuesday - friday again [03:45:35] No worries. [03:45:43] I got this. [03:45:46] :) [03:45:59] Thanks for responding. Enjoy vacation :D [03:47:18] Hmm... Looks like we are getting ~twice as many requests per minute since last night. [03:48:24] It doesn't look like we're overloaded. [03:50:06] Yeah. Queue is good. [03:50:14] Even cached requests are responding slowly. [03:51:30] Load balancer isn't overloaded. [03:51:46] Web-01 looks OK [03:52:40] We'got all of our uwsgi processes running. Neither of the webnodes are overloaded [03:52:44] Weird! [03:56:48] OK! Now I'm getting somewhere. [03:57:01] Looks like web-01 responds really slow, but web-02 responds fast [03:57:25] * halfak checks the log [03:57:40] Nothing fancy [04:00:01] Ok. Restarting service. [04:00:10] Already told morebots [04:03:42] Woah. Now -02 is slow [06:17:44] Hmm... Still not back to normal. [06:18:06] Checking out worker-04, it seems that only one worker process was running (at 100%) [06:22:31] Restarted workers on -04 and -02 [06:25:12] * halfak watches precached for a bit [10:27:24] yuvipanda: hey, around? [16:21:16] Arg. Looks like we're intermittently slow again [16:22:02] Weird. Responses are *fast( [16:23:15] Looks like the celery workers are down on worker-02 [16:23:25] We get into this state where there is one worker using 100% CPU [16:28:27] Looks like ores-worker-03 flatlined at 24% CPU 24 hours ago [16:28:34] I must have missed it in my restarts last night. [16:29:51] Yeah. worker-04 looks to have been the same way. [16:30:02] Weird since the top of -04 looked normal. [16:31:09] Oh wait. I misread. -04 looks like it went back to normal with last night's reboots. [16:31:30] precached suggests that we've solved something. [16:32:04] OK. Back day-off. [16:32:05] o/