[11:33:32] Hi, today ores.wmflabs.org queries return 503 with error code: "server overloaded". Is it a known problem? [13:04:08] Hey rotpunkt [13:04:10] It is not. [13:04:12] looking into it [13:07:37] For some reason, the workers are all down. [13:07:46] This might be due to the NFS issues in labs last night [13:08:09] Bringing them up now [13:11:00] OK. It looks like we have recovered. [13:11:13] And. we're offline again. [13:15:48] o/ Amir1 [13:16:09] halfak: hey :) [13:16:12] sup? [13:16:18] We're down :( [13:16:27] Trying to figure out why [13:16:50] Seriously considering re-routing to ores-staging [13:17:13] halfak, ok thanks [13:17:24] thanks for reporting rotpunkt [13:17:34] We'll need to fix out monitoring so that I can catch outages like this one. [13:19:53] I've some good news, since the UN closed Iran nuclear program case. All of bans related to nuclear program will be lifted, this include ban on SWIFT, which means I can easily work for WMDE and get money from them directly [13:19:53] we were [13:19:54] I wanted to fix it but I didn't know what to do, now I checked and we're up [13:20:22] Amir1, I'm restarting the workers after restarting redis. [13:20:29] for example http://ores.wmflabs.org/scores/itwiki/?models=reverted&revids=1 returns 500 Internal server error now [13:20:29] They don't seem to stay online for long. [13:20:32] halfak: ^ [13:20:51] I get a 500 [13:21:21] instead for enwiki: http://ores.wmflabs.org/scores/enwiki/?models=reverted&revids=1 it's ok [13:22:17] rotpunkt, the enwiki one is failing for me [13:22:31] Oh! rotpunkt that one is cached. [13:22:37] Try any other rev_id [13:22:43] http://ores.wmflabs.org/scores/enwiki/?models=reverted&revids=2 [13:23:25] I'm restarting the web nodes [13:24:03] halfak: right, http://ores.wmflabs.org/scores/enwiki/?models=reverted&revids=2 returns 500 [13:24:19] Arg. [13:25:29] Yup. It looks like we are fully derp'd [13:25:47] I'm going to flip the proxy to the staging server and call Yuvi [13:26:16] :( [13:26:24] he will wake up soon I hope [13:27:21] ok, it will be even better :) [13:29:18] OK. We're "up", but we're on our lower-capacity staging server. [13:29:23] Yikes. [13:32:10] halfak: tools were down because of NFS hardware issues, it's maybe related [13:32:14] I can't say for sure [13:35:13] Yeah. Not sure. But it seems possible. [13:35:43] All workers are down [13:35:46] Damn. [13:35:51] Not just offline, but fully off. [13:36:27] I wonder if I should reboot the VMs. [13:36:35] That seems like a fine thing to do. [13:36:58] lol a worker just came back online by itself! [13:37:58] halfak: do you want to share this situation with #wikimedia-labs people? [13:39:03] * halfak imagines they are all still asleep [13:43:19] * halfak watches workers come back online and considers aiming the proxy back at the prod cluster [13:47:53] And the workers are suddenly offline again [13:47:58] Switching the proxy back to staging [13:49:20] wtf [13:49:50] Seriously [13:50:00] We're maxing out CPU on the staging box. [13:50:03] Lots of requests. [13:50:10] DDos? [13:50:14] *DDoS? [13:50:23] or crazy bots? [13:50:37] probably good requests [13:51:15] I just called Yuvi to wake him up [13:51:22] :D [13:51:27] * halfak feels aweful waking him [13:53:04] btw. I spent some time today to make this tool a little bit better http://tools.wmflabs.org/wd-analyst/index.php?p=P31&limit=24 [13:53:37] Can I be admin in this project?https://wikitech.wikimedia.org/wiki/Nova_Resource:Revscoring [13:53:57] I wanted to set proxy or add Adam but I couldn't and everyone was asleep [13:56:17] Amir1, will set up [13:56:26] Thanks [13:58:26] hey [13:58:42] halfak: gimme a low down? what's up? [13:58:47] workers dead? [13:59:07] Workers keep going down and coming up on their own. [13:59:16] I can restart and bring them up for short periods. [13:59:23] ok [13:59:28] Tried restarting web, worker and redis services. [13:59:29] * YuviPanda suspects redis immediately [13:59:33] anything on logs? [13:59:35] Currently traffic is routed to staging. [13:59:43] I didn't see anything obvious. [13:59:59] Oh wait... no I did see some redis crap in the worker logs. [14:00:03] Should have copy-pasted. [14:01:13] ok redis is up but 'monitor' shows no activity outside of a discover from somewhere [14:01:14] redis "no space left on device"? [14:01:30] redis.exceptions.ResponseError: MISCONF Errors writing to the AOF file: No space left on device [14:01:53] Could it be that our lru policy is set wrong on the redis server? [14:02:01] I restarted -01 [14:02:04] lololol [14:02:40] I think the memory is writing checks the disk can't cache [14:03:08] is it back now? [14:03:31] can you tell me how the cache keys look like? [14:03:33] Looks like all but one worker is still offline [14:03:40] yes I started just one back up [14:03:51] yeah and that one is dead again too [14:03:53] ores::: [14:04:51] ok [14:06:13] lol. I just lost the connection to redis from the redis server itself. [14:07:20] yes [14:07:22] I restarted it [14:07:42] It looks like our eviction policy is right. [14:08:10] we just ran out of disk space, I think. I'm trying to bring redis back with disk pesistance disabled [14:08:36] YuviPanda, we'll want disk persistence back eventually, but +1 for now [14:08:38] halfak: in the meantime, can you bring up an ores-redis-03? 'medium' instance, create it first, then apply the 'srv' role [14:08:44] and *then* apply the redis role [14:08:50] GOtcha. [14:09:24] security group is "default" [14:09:29] no "web", right? [14:10:00] "failed to create instance" [14:10:04] Out of CPUs? [14:10:16] YuviPanda, ^ [14:10:25] ow really? I thought I Gave us enough [14:11:01] message is non-specifc [14:11:05] could be something else? [14:11:20] Our workers take up 16 cpus by themselves [14:11:23] hmm, can you go to 'manage project' and look at your quota? [14:11:34] kk [14:11:50] instances 10/10 [14:34:00] OK. We are back online and at full capacity [14:57:40] \o/ [16:57:23] Here's the report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20151216-ores [16:57:28] Amir1, ^ [16:57:33] rotpunkt, ^ [17:07:13] See also https://meta.wikimedia.org/wiki/Talk:Objective_Revision_Evaluation_Service#3_hours_of_downtime_today [17:08:27] great work, it's experience, for when ORES will serve thousand of applications :) [17:27:45] rotpunkt, +1 [17:27:54] GOod to work out the potential issues with scaling now :) [20:38:01] WOOO! Tests passing for diffs! [20:38:23] * halfak is starting to get close to finishing his rewrite of the features structure. [20:45:13] It's always a good sign when fixing up the tests gets faster with each minor refactoring. [20:45:18] Suggests modularity [20:52:15] halfak: one idea of mine is to dig through wikidata revision to count how many statements/values someone adds to a wikidata item, and i think your tools would be useful for that [20:59:15] halfak: just waking up, but more sleep thoughts: [20:59:30] 1. we should separate redis celery from caching celery. former is far more critical [20:59:54] 2. we should move the config for both these redises into deployment setup rather than just the 'ores-redis' hack [20:59:58] YuviPanda, +1 [21:00:00] Makes sense [21:00:18] I'm going to go to the office today since it at least has heating [21:00:29] once I can convince my body to get out from being enveloped by this blanket... [21:00:34] YuviPanda, we can specify two different redises in the ORES config. [21:00:45] halfak: awesome, so (2) is already there? [21:00:48] How will we do staging without the ores-redis hack? [21:00:58] we'll just need to push different config to staging [21:01:05] need to figure that out [21:01:10] Doesn't that defeat the purpose of staging? [21:01:14] no? [21:01:19] it's just getting different addresses [21:01:22] Since it would be a different configuration? [21:01:35] What if we forget to update one config or the other? [21:01:45] oh yeah, so that's the 'figure that out' part :) [21:01:52] Gotcha :) [21:02:02] Maybe we can just do some simple merging like you wanted [21:02:02] for grrrit-wm for example, there's a separate 'connections' config which is different from bot config [21:02:10] and then they get merged (sortof) [21:02:16] and connections is varied but config is not [21:02:55] Gotcha. [21:03:04] * halfak files a feature request at yamlconf [21:03:36] ok [21:03:55] halfak: in prod we'll also need to use passwords for our redises [21:04:02] but we can cross that bridge once we get there [21:04:21] halfak: also can you pass port number to redis config already? if not you should add that too :D [21:04:38] halfak: I'm going to configure two redises - a celery one with no persistence and a caching one with rdb [21:05:28] https://github.com/halfak/yamlconf/issues/3 [21:05:48] +1 for passwords [21:05:56] We can pass ports [21:06:04] ok [21:31:17] halfak: managed to get out of bed. do you think we can get the config stuff done today so we can switch to a nicer redis setup? [21:31:54] YuviPanda, could make the switch now, yeah. [21:32:00] I'm a little worried about testing. [21:32:09] Also, do you want to get rid of the ores-hack? [21:32:15] *ores-redis hack [21:32:24] so there's two ways to do it [21:32:27] one keeps the hack [21:32:29] That'll require a bit more work, but probably not too much. [21:32:32] and in the future we do the config [21:32:40] other does the right thing now [21:32:42] I'm up for either [21:32:50] depeding on when you wanna do the config stuff [21:33:04] How do you want to test it to make sure we don't take down ORES? [21:33:08] New cluster and switch over? [21:35:28] take one webserver out of rotation, switch that, do 2 workers, put webserver back in rotation, take other out of rotation, switch the rest, put everything back [21:36:27] Maybe we can spin up an additional web server and worker nodes so that we don't lower capacity while we work? [21:37:01] I don't think we're operating at >50% capacity that'll cause problems with just switching [21:37:36] Good point [21:37:56] this is also why you should always aim to be under 50% util [21:38:01] anyway [21:38:04] I have to conquer this cold [21:38:10] I'll brb when I'm warmer [21:38:17] OK. I'll look at the config issues. [21:38:22] ok [21:38:49] halfak: I'm thinking of doing the switchover in about 2.5h (with or without the tools-redis hack, based on what you feel comfortable with) [21:39:17] I need to be afk this evening. [21:39:26] can do this tomorrow [21:39:30] Jenny just passed her oral prelim (first major test of grad school) [21:39:32] our hack will survive [21:39:37] wooo! congrats [21:39:39] I'll get prepared :) [21:39:46] :) [21:40:06] * YuviPanda is also completely bleary-eyed and a bit zombie and still dealing with fallout from the other two outages [21:40:20] halfak: ok, so tomorrow then? about the same time as now? [21:40:55] Yup [21:40:57] Works for me [21:41:20] halfak: can you make a calander event so I don't forget and you don't get meeting-sniped? [21:41:38] Already on it. [21:41:44] halfak: also did you get alerted by page or by people? [21:41:49] by people [21:41:53] hmm [21:41:56] I set up a task for adding monitoring to the workers [21:41:58] so page needs to be fixed too [21:42:02] ORES web was up and happy [21:42:05] right [21:42:15] so we can't really easily do that in labs because everything sucks [21:42:23] so we need a URL that'll hit the workers every time [21:42:25] and not hit cache [21:42:45] in mediawiki you do this with ?debug=true [21:42:50] not sure if that's the best way to do it [21:42:53] but food for thought [21:42:58] +1 [21:43:01] Can do this. [21:43:08] I still haven't managed to get out of the blanket. [21:43:18] We have a dummy model now [21:43:21] It'll work perfect [21:43:21] no willpower to deal with the 'cold' [21:43:24] right [21:43:51] /scores/testwiki/reverted/ [21:43:57] oooh [21:44:00] yes [21:44:02] that'll work [21:44:02] :) [21:44:13] I've to write a custom check for it but that's doable [21:44:20] put that in the task and cc me and daniel zahn? [21:45:47] Will do. [21:45:54] thanks [21:48:18] halfak: before eI go wanted to remind to not turn puppet back on in ores-redis-01. just fyi [21:48:35] Sure. I won't be running puppet there [21:48:44] Is it turned of on a wikitech page? [21:48:47] *off [21:49:07] no [21:49:10] sudo puppet agent --disable [21:49:25] Gotcha [22:23:06] * halfak sees email about ORES machines coming in :)