[17:06:20] <halfak>	 o/ schana
[17:06:26] <schana>	 hey halfak
[17:06:39] <halfak>	 When do you think you'd like to take a look at these downtime issues?
[17:06:44] <halfak>	 In ORES
[17:06:57] * halfak is excited to bounce ideas off of you
[17:07:16] <schana>	 I'm free(ish) now
[17:08:54] <halfak>	 Cool!  /me gathers notes. 
[17:09:28] <halfak>	 I think that https://phabricator.wikimedia.org/T123678 is a good place to start
[17:10:27] <halfak>	 I'm looking for a *maybe* related issue that we haven't worked out yet. 
[17:11:24] <halfak>	 Ahh yes.  This one https://phabricator.wikimedia.org/T127975
[17:11:30] <halfak>	 I'm not convinced this one is actually done. 
[17:11:45] <halfak>	 Amir1, are you around?
[17:11:53] <halfak>	 (might be between University and home)
[17:13:58] <halfak>	 schana, one more note.  I've experienced an event where one of our two "web" nodes will become excessively slow. 
[17:14:04] <halfak>	 Or will not respond at all. 
[17:14:42] <halfak>	 I'm not quite sure how to investigate because the logs don't tell me much. 
[17:15:00] <halfak>	 Note that both the "web" and "worker" nodes use redis. 
[17:15:29] <halfak>	 The "worker" nodes use redis as a task queue & results store.  The "web" node uses redis as a cache for already generated scores. 
[17:16:57] <halfak>	 schana, we've done some work in the past to set up socket timeouts for redis connections and that seemed to buy us some better uptime for a long while. 
[17:17:04] * halfak gets links to code. 
[17:17:16] <Amir1>	 I'm around now
[17:17:33] <halfak>	 See cache socket timeout: https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L17
[17:17:46] * schana keeps reading
[17:17:47] <halfak>	 And broker socket timeout: https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L53
[17:20:13] <halfak>	 I think that's all the thoughts. 
[17:20:45] <halfak>	 If I had a bunch of time to devote to this, I'd be setting up a test cluster, kicking the redis node and running tests. 
[17:22:32] <halfak>	 Figured you might have a better idea for next steps. 
[17:23:01] <schana>	 I'm trying to draw a picture of it now (I'm more of a visual person)
[17:24:40] <schana>	 what's running on the web and worker instances? and what went down?
[17:26:55] <schana>	 halfak ^
[17:26:56] <halfak>	 "what's running on the web and worker instances?"  ?? 
[17:27:00] <halfak>	 web is uwsgi
[17:27:03] <halfak>	 worker is celeryd?
[17:27:07] <halfak>	 Is that what you mean?
[17:27:30] <schana>	 what are they doing?
[17:27:44] <schana>	 or what task are they performing?
[17:28:02] <halfak>	 Hmm.  Have I ever given you an overview of the system before? 
[17:28:11] <schana>	 not that I recall
[17:28:21] <halfak>	 OK.  I've given it so many times... I just misremembered. 
[17:28:28] <halfak>	 OK.  Let me get a diagram. 
[17:28:32] <schana>	 yay!
[17:28:42] * schana loves diagrams
[17:28:43] <halfak>	 https://phabricator.wikimedia.org/T110072#1725724
[17:30:29] <schana>	 so in the second drawing, are the 'app servers' the ores-web instances and 'celery workers' ores-worker?
[17:30:44] <halfak>	 yup.  That's right./ 
[17:30:58] * halfak works on a nicer diagram 
[17:32:50] <schana>	 a sequence diagram may also be helpful https://en.wikipedia.org/wiki/Sequence_diagram
[17:35:44] <halfak>	 schana, :)  you're the first person who has asked for it!  will do
[17:36:22] <schana>	 specifically, a sequence diagram would be helpful in describing the relationship between ores-web, ores-worker and ores-redis
[17:38:19] <halfak>	 schana, roger
[17:38:22] <schana>	 halfak: so is the downtime specifically the ores-worker instances are not accepting any more tasks from redis?
[17:38:30] <halfak>	 schana, good question
[17:38:35] <halfak>	 These are hypotheses of mine
[17:38:45] <schana>	 ah, so this needs investigated?
[17:39:19] <halfak>	 :P 
[17:39:30] * halfak looks at task title "Investigating downtime (2016-01-14)"
[17:39:49] <schana>	 :D I'm just trying to grasp exactly what needs investigated
[17:40:23] <halfak>	 Gotcha.  
[17:40:34] <halfak>	 Intermittent downtime with no error in the logs. 
[17:41:00] <schana>	 is the log message in the description from the redis host?
[17:41:28] <schana>	 ah, nvm, it looks like ores-worker-01
[17:41:35] <halfak>	 Yuvipanda didn't say. 
[17:41:47] <halfak>	 Oh yeah. 
[17:41:49] <halfak>	 I see it too :) 
[17:42:26] <schana>	 one more: do we have any insight into the health of the redis host during the downtimes?
[17:42:59] <halfak>	 schana, nothing in my notes. 
[17:44:11] <schana>	 are logs in the standard places on the machines?
[17:44:37] <halfak>	 schana, not sure what the standard is, but I've been instructed to use 'journalctrl' to access them. 
[17:45:18] * halfak gets example call
[17:46:15] <halfak>	 on a web node: $ sudo journalctl -u uwsgi-ores-web | less
[17:46:30] <halfak>	 Will get load the logs into less on a web node. 
[17:47:01] <schana>	 helpful, thanks
[17:47:30] <halfak>	 On a worker node: $ sudo journalctl -u celery-ores-worker | less
[17:49:04] <schana>	 on github, is wiki-ai/ores what's on ores-web? and wiki-ai/revscoring what's on ores-worker?
[17:49:31] <halfak>	 schana, negative.  Both are running ores. 
[17:50:02] <halfak>	 Celery work in such a way that you configure your client the same way as you configure your workers. 
[17:50:29] <halfak>	 See https://github.com/wiki-ai/ores-wikimedia-config/blob/master/ores_wsgi.py
[17:50:35] <halfak>	 And https://github.com/wiki-ai/ores-wikimedia-config/blob/master/ores_celery.py
[17:51:35] <halfak>	 Both of those files leave a variable call "application" in the main namespace.  This is a convention used by uwsgi and celeryd. 
[17:51:52] <schana>	 I see, so the 'revscoring' repo is used for training models then?
[17:52:10] <schana>	 and not really applicable?
[17:52:16] <schana>	 (to this)
[17:52:44] <halfak>	 Yeah.  It's really unlikely that the issues we have will be due to a bug in revscoring itself. 
[17:53:05] <halfak>	 BTW: http://pythonhosted.org/revscoring/
[17:53:55] <schana>	 okay, I think I have enough info to get started - I'm sure I'll be asking more questions as they come up
[17:54:59] <halfak>	 Sounds good.  :) 
[17:55:07] <halfak>	 I'll have some more diagrams for you shortly. 
[17:55:12] <schana>	 :)
[18:06:36] * halfak wonders how to depict async elements in a sequence diagram
[18:16:00] <halfak>	 I think I did OK
[18:19:36] <halfak>	 schana, here come some diagrams. 
[18:19:39] <halfak>	 ORES cluster: https://upload.wikimedia.org/wikipedia/commons/0/07/ORES_cluster.svg
[18:19:50] <halfak>	 Sequence for when the score is cached: https://commons.wikimedia.org/wiki/File:ORES.request_sequence_diagram_(cached).svg
[18:20:06] <halfak>	 Sequence for when the score is not cached: https://commons.wikimedia.org/wiki/File:ORES.request_sequence_diagram_(not_cached).svg
[18:23:04] <schana>	 halfak: those are super helpful, thanks!
[18:23:21] <halfak>	 No problem.  Good reason to make them.  Now to figure out where in our documentation they belong. 
[18:23:25] <halfak>	 Probably on wikitech. 
[18:31:11] <halfak>	 schana, for future reference, see https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/Documentation#Cluster_layout
[18:31:27] <halfak>	 All the images I just linked to are there :) 
[18:31:33] * halfak runs away for lunch
[18:31:34] <halfak>	 o/
[19:34:32] <harej>	 halfak, DarTar: I was thinking of submitting an Inspire campaign proposal to beef up the efforts around Librarybase
[19:34:56] <halfak>	 harej, interesting.  How would you fit it into the review/curation theme?
[19:36:13] <harej>	 Will have to read the criteria more closely but it is a significant challenge to curate the source metadata and doing so will allow new opportunities including improved source reviewing and source recommendations
[19:37:18] <halfak>	 harej, I might also pitch the project as an exploration in Wikidata curation strategies. 
[19:37:40] <halfak>	 -- starting up a parallel wikibase and negotiating content loads separate from wikidata's database. 
[19:37:58] <harej>	 As in, you are going to separately propose this, or are you recommending that I include that?
[19:38:04] <halfak>	 If this works well, it may be good new pattern for loading data into wikidata. 
[19:38:13] <halfak>	 harej, feel free to include. 
[19:40:16] <harej>	 I'm not sure how necessary it is as a general Wikidata strategy though. Most Wikidata imports aren't at this scale.
[19:40:43] <harej>	 I'm trying to take on the Census Bureau as a client, and even for them I would just recommend direct input into Wikidata.
[19:41:20] <halfak>	 harej, fair enough.  I'd like to start documenting datasets with wikibase though. 
[19:41:23] <harej>	 I'm taking this approach for source metadata because it's a uniquely complex problem and because I have a need for a citation database that may or may not be consistent with the vision for Wikidata
[19:41:28] <halfak>	 I don't think that starting in wikidata is a good way to go. 
[19:41:50] <harej>	 For your use case, perhaps.
[19:42:41] <halfak>	 Yeah.  And likely for others too. 
[19:42:42] <halfak>	 Not all. 
[20:33:35] <DarTar>	 hey harej – just back from lunch. I’d love to read a proposal if you submit something. We’re making good progress with the organization of this thing in May, btw. I should be able to send out updates shortly
[20:34:24] <DarTar>	 agree WD is a fascinating testbed for curation/data imports