[17:06:20] o/ schana [17:06:26] hey halfak [17:06:39] When do you think you'd like to take a look at these downtime issues? [17:06:44] In ORES [17:06:57] * halfak is excited to bounce ideas off of you [17:07:16] I'm free(ish) now [17:08:54] Cool! /me gathers notes. [17:09:28] I think that https://phabricator.wikimedia.org/T123678 is a good place to start [17:10:27] I'm looking for a *maybe* related issue that we haven't worked out yet. [17:11:24] Ahh yes. This one https://phabricator.wikimedia.org/T127975 [17:11:30] I'm not convinced this one is actually done. [17:11:45] Amir1, are you around? [17:11:53] (might be between University and home) [17:13:58] schana, one more note. I've experienced an event where one of our two "web" nodes will become excessively slow. [17:14:04] Or will not respond at all. [17:14:42] I'm not quite sure how to investigate because the logs don't tell me much. [17:15:00] Note that both the "web" and "worker" nodes use redis. [17:15:29] The "worker" nodes use redis as a task queue & results store. The "web" node uses redis as a cache for already generated scores. [17:16:57] schana, we've done some work in the past to set up socket timeouts for redis connections and that seemed to buy us some better uptime for a long while. [17:17:04] * halfak gets links to code. [17:17:16] I'm around now [17:17:33] See cache socket timeout: https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L17 [17:17:46] * schana keeps reading [17:17:47] And broker socket timeout: https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L53 [17:20:13] I think that's all the thoughts. [17:20:45] If I had a bunch of time to devote to this, I'd be setting up a test cluster, kicking the redis node and running tests. [17:22:32] Figured you might have a better idea for next steps. [17:23:01] I'm trying to draw a picture of it now (I'm more of a visual person) [17:24:40] what's running on the web and worker instances? and what went down? [17:26:55] halfak ^ [17:26:56] "what's running on the web and worker instances?" ?? [17:27:00] web is uwsgi [17:27:03] worker is celeryd? [17:27:07] Is that what you mean? [17:27:30] what are they doing? [17:27:44] or what task are they performing? [17:28:02] Hmm. Have I ever given you an overview of the system before? [17:28:11] not that I recall [17:28:21] OK. I've given it so many times... I just misremembered. [17:28:28] OK. Let me get a diagram. [17:28:32] yay! [17:28:42] * schana loves diagrams [17:28:43] https://phabricator.wikimedia.org/T110072#1725724 [17:30:29] so in the second drawing, are the 'app servers' the ores-web instances and 'celery workers' ores-worker? [17:30:44] yup. That's right./ [17:30:58] * halfak works on a nicer diagram [17:32:50] a sequence diagram may also be helpful https://en.wikipedia.org/wiki/Sequence_diagram [17:35:44] schana, :) you're the first person who has asked for it! will do [17:36:22] specifically, a sequence diagram would be helpful in describing the relationship between ores-web, ores-worker and ores-redis [17:38:19] schana, roger [17:38:22] halfak: so is the downtime specifically the ores-worker instances are not accepting any more tasks from redis? [17:38:30] schana, good question [17:38:35] These are hypotheses of mine [17:38:45] ah, so this needs investigated? [17:39:19] :P [17:39:30] * halfak looks at task title "Investigating downtime (2016-01-14)" [17:39:49] :D I'm just trying to grasp exactly what needs investigated [17:40:23] Gotcha. [17:40:34] Intermittent downtime with no error in the logs. [17:41:00] is the log message in the description from the redis host? [17:41:28] ah, nvm, it looks like ores-worker-01 [17:41:35] Yuvipanda didn't say. [17:41:47] Oh yeah. [17:41:49] I see it too :) [17:42:26] one more: do we have any insight into the health of the redis host during the downtimes? [17:42:59] schana, nothing in my notes. [17:44:11] are logs in the standard places on the machines? [17:44:37] schana, not sure what the standard is, but I've been instructed to use 'journalctrl' to access them. [17:45:18] * halfak gets example call [17:46:15] on a web node: $ sudo journalctl -u uwsgi-ores-web | less [17:46:30] Will get load the logs into less on a web node. [17:47:01] helpful, thanks [17:47:30] On a worker node: $ sudo journalctl -u celery-ores-worker | less [17:49:04] on github, is wiki-ai/ores what's on ores-web? and wiki-ai/revscoring what's on ores-worker? [17:49:31] schana, negative. Both are running ores. [17:50:02] Celery work in such a way that you configure your client the same way as you configure your workers. [17:50:29] See https://github.com/wiki-ai/ores-wikimedia-config/blob/master/ores_wsgi.py [17:50:35] And https://github.com/wiki-ai/ores-wikimedia-config/blob/master/ores_celery.py [17:51:35] Both of those files leave a variable call "application" in the main namespace. This is a convention used by uwsgi and celeryd. [17:51:52] I see, so the 'revscoring' repo is used for training models then? [17:52:10] and not really applicable? [17:52:16] (to this) [17:52:44] Yeah. It's really unlikely that the issues we have will be due to a bug in revscoring itself. [17:53:05] BTW: http://pythonhosted.org/revscoring/ [17:53:55] okay, I think I have enough info to get started - I'm sure I'll be asking more questions as they come up [17:54:59] Sounds good. :) [17:55:07] I'll have some more diagrams for you shortly. [17:55:12] :) [18:06:36] * halfak wonders how to depict async elements in a sequence diagram [18:16:00] I think I did OK [18:19:36] schana, here come some diagrams. [18:19:39] ORES cluster: https://upload.wikimedia.org/wikipedia/commons/0/07/ORES_cluster.svg [18:19:50] Sequence for when the score is cached: https://commons.wikimedia.org/wiki/File:ORES.request_sequence_diagram_(cached).svg [18:20:06] Sequence for when the score is not cached: https://commons.wikimedia.org/wiki/File:ORES.request_sequence_diagram_(not_cached).svg [18:23:04] halfak: those are super helpful, thanks! [18:23:21] No problem. Good reason to make them. Now to figure out where in our documentation they belong. [18:23:25] Probably on wikitech. [18:31:11] schana, for future reference, see https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/Documentation#Cluster_layout [18:31:27] All the images I just linked to are there :) [18:31:33] * halfak runs away for lunch [18:31:34] o/ [19:34:32] halfak, DarTar: I was thinking of submitting an Inspire campaign proposal to beef up the efforts around Librarybase [19:34:56] harej, interesting. How would you fit it into the review/curation theme? [19:36:13] Will have to read the criteria more closely but it is a significant challenge to curate the source metadata and doing so will allow new opportunities including improved source reviewing and source recommendations [19:37:18] harej, I might also pitch the project as an exploration in Wikidata curation strategies. [19:37:40] -- starting up a parallel wikibase and negotiating content loads separate from wikidata's database. [19:37:58] As in, you are going to separately propose this, or are you recommending that I include that? [19:38:04] If this works well, it may be good new pattern for loading data into wikidata. [19:38:13] harej, feel free to include. [19:40:16] I'm not sure how necessary it is as a general Wikidata strategy though. Most Wikidata imports aren't at this scale. [19:40:43] I'm trying to take on the Census Bureau as a client, and even for them I would just recommend direct input into Wikidata. [19:41:20] harej, fair enough. I'd like to start documenting datasets with wikibase though. [19:41:23] I'm taking this approach for source metadata because it's a uniquely complex problem and because I have a need for a citation database that may or may not be consistent with the vision for Wikidata [19:41:28] I don't think that starting in wikidata is a good way to go. [19:41:50] For your use case, perhaps. [19:42:41] Yeah. And likely for others too. [19:42:42] Not all. [20:33:35] hey harej – just back from lunch. I’d love to read a proposal if you submit something. We’re making good progress with the organization of this thing in May, btw. I should be able to send out updates shortly [20:34:24] agree WD is a fascinating testbed for curation/data imports