[01:21:38] halfak: https://github.com/halfak/Deltas/pull/6 [01:33:38] YuviPanda, {{merged}} [01:33:55] halfak: ty [01:34:21] halfak: did I jump the gun in https://phabricator.wikimedia.org/T106867#1520552 [01:36:09] Na. It's an aggressive timeline that I like. [01:36:35] I'm going to want to keep this tool in beta for a long time though. [01:37:14] We have some work to do studying bias and I'm hoping to ask aripstra to work with me on the UX. [01:37:26] halfak: oh yeah, totally. [01:37:33] halfak: betafeature [01:37:37] Yeah [01:37:41] Not beta labs [01:38:09] We should totally build a model for beta labs and announce a release on April 1st. [01:38:27] beta wiki? [01:38:53] halfak: wikitech.wikimedia.org/wiki/Labs_labs_labs [01:38:54] :D [01:41:56] Hey! I got it right. :) [01:42:31] halfak: :) [01:42:54] halfak: so we might need a performance and security review of this [01:43:00] security review from csteipp and performance review from ori [01:43:20] Oh man. I better get celery in shape before Ori has a look [01:43:26] I have a proposal there. [01:44:19] Hmm... Wanted to tell you about it before I got to work, but I didn't write it up. [01:45:05] So, the reason we're not catching dupe requests now is the delay between a request coming in and us hitting celery with a task. [01:45:36] We use this time during batch requests to hit the API. [01:45:47] During a single revision request, we do the same process, but it's a batch of 1. [01:46:07] So, I want to special case the single request. [01:46:24] I considered just ditching the batch request, but that's really good for (1) historical analysis [01:46:33] and (2) Special:UserContribs [01:47:16] It's also the fastest way we could populate old scores unless I figure out something clever with bot XML and API datasources. [01:47:24] *both [01:47:34] YuviPanda, ^ [01:48:30] So, in the special case for the single revision, we'd hit celery with the task immediately. [01:49:13] halfak: what do you mean by immediately? [01:49:27] Oh! The task itself will make the API call. [01:49:51] This will also substantially reduce the size of the arguments. [01:50:03] Since we're packaging up revision text for processing in the current setup. [01:50:22] ah [01:50:23] yes [01:50:35] halfak: actually, the task should always make the API call [01:50:38] halfak: why isn't it? [01:50:55] Can't batch and then split a task, can I? [01:51:02] aaarrrgh [01:51:03] yes [01:51:04] that [01:51:12] *BATCH*! [01:51:24] but you did say that fetching itself is instantaneous? [01:51:45] Practically -- compared to the time it takes to generate a diff. [01:52:23] halfak: and it's still instantaneous if you don't batch? [01:52:35] halfak: btw, the requests problem wasn't fixed, I responded on the ticket [01:52:42] WUT [01:52:45] Darn [01:53:20] OK. So yeah. We spend about 3 times as much time requesting 50 revs individually as we do requesting 50 in batch. [01:53:34] ok [01:53:40] And we would open 50 connections to the API all at once. [01:53:48] ok [01:53:51] fair enough [01:54:05] I just hate special cases. They end up complicating the mental model of how things work [01:54:08] It's still a little smelly. [01:54:09] Yeah. [01:54:24] halfak: what's our performance problem now? with RCStream based caching is this actually a big problem atm? [01:54:54] Well, if a bot hits us at RCStream speed, we'll double-generate a lot of scores. [01:55:15] hmm [01:55:17] ok [01:55:19] After this, we'll only double generate 1/LARGE_NUMBER [01:55:22] but is that an actual problem now/ [01:55:22] ? [01:55:27] we do have the capacity... [01:55:37] I can run about 8 precached simultaniously. [01:55:48] Oh wait. That might have been with 8 workers. [01:56:03] BTW, we should double the number of workers. [01:56:06] and with Extension:ORES we'll have mediawiki hit us the same time as it starts sending data out to RCStream so we have a few ms advantage there [01:56:11] and it'll be running on real hardware [01:56:15] halfak: are we using them up atm? [01:56:16] Yeah. [01:56:24] And probably right next to it topologically. [01:56:58] Oh! Don't double the CPUs. Just the workers :D [01:57:04] halfak: ah ok :) [01:57:06] halfak: http://tools.wmflabs.org/nagf/?project=ores btw [01:57:18] oooo [01:57:52] LOts of memory [01:58:10] Are we barely touching our CPU? [01:58:25] halfak: look at the individual numbers below, not at the project aggregate [01:59:24] * halfak whistles [02:00:06] So we spike up to 50% sometimes. [02:00:23] And at 9PM yesterday, there was a sudden rise in activity [02:00:39] Make that 8 [02:00:55] Except ores-worker-04 [02:01:18] are all the workers actually active? [02:01:28] OK. That's a problem. I need to figure out why these things are going offline. [02:01:31] we don't have alerts yet because we don't have graphite yet [02:02:12] Yeah. I might get that this weekend, but I've got Jenny's folks over. [02:02:51] Huh. ores-worker-04 doesn't appear in flower. [02:03:00] And 03 is offline. :\ [02:03:06] heh [02:03:15] madhu started on it and then got distracted [02:03:24] wut [02:03:27] >:( [02:03:53] Oh... I thought you meant took the machine down. I see she started on the worker offline issue. [02:05:03] Processed: 84544 :) [02:05:07] All back online [02:05:29] So, ores 03 was offline and has 0 failed. [02:05:39] sudo journalctl -u celery* -f for logs [02:05:42] well, last bits of logs [02:05:45] no -f for full logs [02:06:55] man... it's that darn is_bot error. I need to get these new models pushed to staging. [02:06:59] Darn other things! [02:07:10] I think when I come back [02:07:18] we can move off pip [02:07:48] Will we still be *available* through pip? [02:08:45] * halfak watches staging compile numpy [02:08:49] :C [02:09:33] * halfak watches flower monitor instead [02:11:24] Onto scipy [02:12:47] Gotta run. Jenny's folks just got here. [02:12:52] Have a good trip dude! [02:13:07] halfak: you too. [02:13:17] halfak: and yes, we'll still be available via numpy [09:11:42] hmm [09:11:58] watching numpy and scipy compile is quite hypnotic [09:12:19] do I get a barnstar for doing that for 10 hours? [10:01:30] ToAruShiroiNeko: watching scipy compile is something I don't wish on my worst enemies [10:04:38] hehe [10:04:49] why not just use ubuntu packages? [10:04:51] it seems like all I do with revision scoring at time [10:05:20] I dont think we use the same version [10:05:24] and I have THAT fail on me too [10:05:36] you should have seen the utter disblief on halfak's face [10:05:46] anaconda, then? [10:05:50] or was that numpy [10:24:58] I have just ogtten used to these tools :p [14:42:27] o/ [14:42:41] Wait... I didn't just create a tag now. [14:42:45] The bot lies [14:42:56] I did that two days ago! [14:43:10] https://github.com/wiki-ai/revscoring/commits/v0.4.10 [14:43:12] ! [14:52:47] halfak: maybe you only just pushed it? [14:54:42] Na. Even github thinks I did it a couple days ago [14:55:02] Oh wait! [14:55:04] Whut [14:55:15] weird. [14:55:43] Now I'm very confused. Because pypi has a 0.4.10 [14:57:50] Looks like I forgot to make a tag. [14:58:19] It seems that I somehow created a v0.4.10 *branch* [14:58:26] * halfak continues being confused [14:58:35] Oh well. back to reviewing papers. [15:25:35] {{done}} [15:25:47] OK. Now do I do celery performance or statd logging? [15:25:55] I think I'm going to do performance [15:50:32] wiki-ai-hack time! [16:27:14] halfak: I was gonna say statsd :p but ok [16:27:20] * YuviPanda falls back in bed [16:27:53] OK. I'll pick that up tomorrow morning regardless of the progress I make today. [16:28:07] * halfak hammers ores-staging with precached [16:28:17] Been going for a little over an hour without issue. [16:28:43] YuviPanda, BTW, that requests version error was due to an old version of revscoring being installed. [16:29:03] I don't recommend installing a new version on ores-web now though. [16:29:11] It will likely break the models due to pickling issues. [16:29:47] Yup [16:29:56] This is one of the problems with pickle I guess [16:30:16] It's really a problem with associating languages and features with a model. [16:30:20] We have deb packaged almost all your dependencies except scikitlearn [16:30:24] This is useful for helping to make sure we don't make mistake. [16:30:33] But equality needs to work cross-version :S [17:16:28] Woo! It works. Now to run my tests against it. [17:19:15] It works! [17:19:39] * halfak is running 4 simultaneous precached instances against his laptop [17:20:39] It looks like we need to have about 5ms between requests in order for celery to notice the dupe. [17:20:55] When running simultaneously, this works ~ 70% of the time/ [17:21:25] Nice [17:21:28] Yay [17:21:52] We should be doing really good when we are running from the same machine. [17:22:35] halfak: one thing we can eventually do is separate these into two queues for better reasonability. One is purely cpu bound and the other does io too [17:22:38] Not needed now [17:22:47] halfak: ya network penalties I guess [17:24:48] It seems like celery remembers it's queue when you bring it down and back up. [17:24:51] Interesting [17:24:55] Also a pain in the butt for testing! [17:25:29] halfak: you can flush redis [17:25:34] To prevent rhat [17:25:48] Also do we need pylru when redis is a requirement for celery? [17:27:44] Celery is not a requirement for ORES [17:27:56] You can run ORES with just a timeout [17:28:02] halfak: oh I see [17:28:03] Ok [17:28:16] I'm hoping to have some parallel instances set up soon. [17:28:24] Ones that are purely IO and SLOW [17:28:36] ? [17:28:46] E.g. https://meta.wikimedia.org/wiki/Grants:IEG/Automated_Notability_Detection [17:28:54] Aaaaah [17:28:57] Right [17:29:00] These guys are doing to score revisions, but they will mostly use external datasources :) [17:29:07] Yeah [17:29:19] I guess we will need separate queues [17:29:26] Celery has that built in I think [17:29:27] Yeah... That might work too. [17:29:35] I was imagining we'd have separate instances of ORES [17:29:38] Yeah lets not do parallel instances [17:29:47] A labs set that has experimental models [17:29:56] And a production instance that has models that are performant and vetted. [17:30:03] We can just have different configs for labs and prod [17:30:11] Yeah. that's what I mean :) [17:30:12] And enable disable appropriately [17:30:36] Some of those instances probably won't warrant a celery cluster. [17:30:43] Then again, you can run celery on the local machine [17:30:46] Hmm. [17:30:50] But by separate instances I thought you meant their own lb, redis, web, etc [17:31:07] Yeah... Well that too. [17:31:19] Going to become a nightmare to manage [17:31:20] :) [17:31:23] I imagine the labs instance will need to have a separate lb, redis, etc. than prod [17:31:28] Gotta have ores-labs [17:31:29] Yes [17:31:43] But inside labs there shouldn't be one per model... [17:31:48] +1 [17:31:54] cool :) [17:32:07] I guess we can send tasks for model x to queue x [17:32:21] Though, I'd like it if Bluma can stand up an instance in labs to experiment with that just works out of the box and doesn't require setting up redis or celery. [17:32:23] And thus be able to for example times of overload disable only specific models [17:32:34] +1 [17:32:36] That's a good idea. [17:32:56] halfak: well I would rather solve that problem with good documentation and puppet roles than additing additional points of divergence. [17:32:58] Some models should get priority. E.g. reverted is super time sensitive, but article quality isn't. [17:33:25] You add an extra option for how a system is set up, that doubles the total number of possible configurations... [17:33:53] Indeed. But not to rag on Bluma, but she's not much of an engineer [17:34:02] They can already setup an instance by just applying the staging role [17:34:11] If we told her "just figure out the celery roles and set it up like our production instance"... [17:34:13] It is clicking a check box and waiting 20mins :l [17:34:15] :) [17:34:32] I guess. [17:34:34] Indeed, which is why that puppet role is written that way and why the vagrant setup exists [17:34:52] So you don't have to worry about all the systems stuff [17:35:01] So, what you're saying is that I shouldn't provide interchangable parts with common interfaces within ORES? [17:35:18] For things like redis, I think so. [17:35:48] Or more specifically [17:35:55] For the entire way the system works [17:36:03] E.g. you have three score processors to pick from. The one that will wait as long as you want and process scores sequentially. There's one where you can set a per-score timeout. And another where you can farm the work out to celery *and* set a timeout. [17:36:07] There should be one model - uwsgi+celery or just uwsgi [17:36:29] Supporting both is going to be somewhat painful because they have very very different characteristics [17:36:42] Not really. [17:36:52] You say that now.... :) [17:37:04] The timeout processor is 40 lines of code and shares most of those lines with celery processing internals because I need to manage timeouts anyway. [17:37:14] The basic processor shares it's code with both timeout and celery. [17:37:33] Celery is built in parts from the basic processor and timeout processor. [17:37:48] I'm not saying it doesn't work. It is just... Extra complexity with not much of an advantage [17:38:06] It's not more complex. [17:38:10] It's more abstract [17:38:13] What does the uwsgi timeout processor give you? Easier to setup? Vagrant and puppet [17:38:20] Same thing halfak :) [17:38:25] Rather, it maintains an abstraction. [17:39:01] Anyway, to step aallll the way back [17:39:04] The timeout processor provides essential functionality (literally every line of code) to the celery processor. [17:39:06] For new people developing new models [17:39:15] So it was a logical half-step to set up. [17:39:15] I would say that we reccomended that they use vagrant [17:39:29] And for test labs instances use the staging puppet role [17:39:34] Does that sound agreeable? [17:39:49] Not sure I think that vagrant is ready for dev reliance. [17:40:03] Why [17:40:17] We still can't edit revscoring code from vagrant without doing an sshfs mount or something. [17:40:39] Also, it's a surprising pain in the ass to get working. [17:40:46] Err anything you edit in your host machine is already immediately reflected on /vagrant [17:41:00] It requires you to not use system provides packages [17:41:01] Yeah. But we can't just put revscoring in there. [17:41:17] We can. Awight was mentioning setup.py develop which we can use [17:41:27] Yeah. Once that is ready, we'll see. [17:41:30] Right now, it's not. [17:41:31] Ok [17:41:40] :( [17:42:09] Anyway, I have to go pack now. Sorry to drop out in the middle of conversation [17:42:26] Not at all. Thanks for hopping on the day you are leaving. [17:42:30] Have a good one! [17:42:31] :) [17:42:32] Ok [17:43:15] Just remember that this is a system with moving parts, and the less parts we can get away with the better. [17:43:18] Bye! [17:43:36] YuviPanda, +1 to that. [17:43:55] I just think we're having a difference on the definition of "moving part" :\ [17:44:01] :) [17:44:09] Either way, once we come together on the language, I'm sure we'll agree. [17:44:15] And I'll owe you a beer or something. ;) [17:44:34] That's OK. I can keep pushing to reduce what I think is moving parts and you can keep pushing back and I'm sure that'll end up in some version of optimum :) [17:44:43] Indeed :D [17:44:49] What's the fun if we keep agreeing on stuff [17:45:06] There are times I'm going to be wrong and times you are going to be wrong and then we learn... [17:45:13] OMG SPAM [17:45:15] Sorry [17:45:29] DELETE ALL THE BRANCHES [17:45:45] halfak: so before I go - one thing that would be nice (if it already isn't there) is to not crash if an optional library isn't found [17:46:03] halfak: like, pylru - ores should work if one of pylru or redis is found.. [17:46:07] If that makes sense [17:46:25] And if configured to use redis [17:46:29] It shouldn't even import pylru [17:46:34] Yeah... It should really work if configured to use one or the other. [17:46:34] +1 [17:46:52] The does not even import part is important [17:46:57] * halfak feels weird pulling imports form the top of the file. [17:47:01] So I can ignore packaging it as a dependency [17:47:11] GOod point. [17:47:22] And then having to keep it up to date, etc [17:47:26] I'll got make a quick PR to solve that one and see if there are others. [17:47:37] Awesome [17:47:39] Thanks [17:47:50] And now I go for realz [17:47:58] halfak: I'm not flying till Monday bte [17:47:59] Bte [17:48:00] Btw [17:48:09] But I am taking all my stuff with me so... :) [17:48:51] I suppose, but I should not bug you while you're building the next GSM tower -- or maybe a giant freedom robot. [17:49:48] * halfak plans to have all the ducks lined up for YuviPanda's return so that we can hit our schedule. [17:50:09] halfak: ya but Monday and Tuesday is OK I think and so is today and tomorrow [17:50:26] halfak: yup. Helder left some comments on legoktm's patch which was great [17:51:11] Yeah. I saw that. :) [17:52:05] :) [17:52:13] YuviPanda, should I remove those packages from requirements.txt then too? [17:52:16] o/ Helder [17:52:17] :) [17:53:07] halfak: so if we offer multiple modes of running ores with multiple sets of packages required we should have multiple requirements.txt no? [17:53:37] Maybe. These are "install requirements". [17:53:53] Some packages may be required for different functionality, but they are not necessary for install. [17:54:05] We can just have a separate prod-rewuiremenrs [17:54:09] .txt [17:54:31] But that will be Debian packages anyway soon [17:54:59] (Doesn't change anything for people using it from pip) [17:55:12] +1 [17:57:24] Sklearn is our big holdout now [17:57:26] For packaging [17:59:59] No getting around that one [18:01:22] halfak: ya just needs someone to figure out cython packaging [18:01:29] Shouldn't be too hard [18:01:39] Just relative to the other packages which were a lot easier [18:02:52] numpy and scipy already packages? [18:09:23] Hmm... Removing celery from the import is going to be painful due to the weird way that you register events for celery. [18:09:36] * halfak grumbles about module-level-only decorators. [18:09:40] halfak: yup [18:09:45] But I got pylru and redis to work nicely. [18:09:54] cool :) [18:09:56] With cute error messages that tell you why the import failed. [18:10:14] I'll leave celery as a req. and think more. [18:11:31] kk [18:17:51] OK. Gotta run. Have a good one folks! [18:17:52] o/ [19:40:05] YuviPanda, should I be sending statsd to graphite1001? [19:40:18] halfak: labmon1001.eqiad.wmnet [19:40:25] halfak: that'll show up in graphite.wmflabs.org [19:40:30] Gotcha. [19:40:41] * halfak starts looking at quarry for statsd use examples [19:40:48] halfak: quarry doesn't do statsd :( [19:40:53] there's a python library tho [19:41:04] What? [19:41:08] No logging in quarry? [19:41:38] halfak: https://pypi.python.org/pypi/statsd [19:41:43] halfak: no statsd metrics, no [19:41:46] terrible, I know... [19:41:58] statsd is metrics, not logging [19:42:20] What should I be setting up first? [19:42:43] I can make our use of 'logging' complete or start working with statsd [19:42:51] halfak: former, I think. [19:43:03] OK. On it, [19:43:05] halfak: depends on what you mean by 'complete'? [19:43:41] Well, right now, we don't have debug stuff everywhere. We also don't necessarily use logging.error everwhere we ought to. [19:44:01] I want to give it a pass and decide that it logs in a wholesome way. [19:44:07] Both revscoring and ores. [19:44:16] halfak: +1 [19:44:18] OK. [19:44:22] Doing that first then. [19:44:37] there's also https://phabricator.wikimedia.org/T108421 but that we can't do now because we don't have a way to receive it yet [19:45:24] Will we be sending our `logging` events to logstash? [19:45:52] halfak: yup, we'll just connect logging module to output to logstash [19:46:09] https://pypi.python.org/pypi/graypy probably [19:46:25] it should be an optional extra that you can configure [19:46:35] you don't want it for local dev or whatever [19:46:44] hahahaha [19:46:47] example code has [19:46:48] > puff_the_magic_dragon() [19:46:49] hehehe [19:46:58] lol [19:47:37] Hmm. I don't want to do much to OREs now to conflict with my last refactor. [19:47:46] Until that PR is merged. [19:47:50] +1 [19:48:01] So revscoring log messages it is! [19:48:04] :) [19:48:06] * YuviPanda has to go again [19:48:10] o/ [19:48:24] cya! [19:48:33] * YuviPanda would like to spend some time on Quarry again sometime soon [19:48:48] halfak: so when I'm in Italy I'd probably be deep in kubernetesland with _joe_ and not be of much use here [19:49:06] * halfak googles kubernetesland [19:49:12] haha [19:49:14] kubernetes -land [19:49:25] halfak: it's the GridEngine replacement / the 'future' [19:49:34] Oh! Cool :) [19:49:40] "Replacement" for Tool Labs? [19:49:44] err [19:49:47] gridengine replacement [19:49:47] Or part of it [19:50:15] so the idea being you'd define 3 services (redis, celery, uwsgi) as yaml files, and tell it to run X of each and it'll figure out what to do where [19:50:19] and keep them up and running [19:50:29] halfak: this is the thing YuviPanda and joe showed at the end of the hackathon [19:50:36] yeah [19:50:45] well, it's the underlying thing that enables that and a lof ot other things [19:50:59] our first goal is to make it run without any changes for 'webservice XXXX start' tho [19:51:02] * halfak tries to remember that [19:51:29] Oh wait. Have you showed me the UI for this before YuviPanda ? [19:51:52] halfak: yes, it was a competing project called Marathon but pretty much the same [19:52:03] Gotcha [20:27:07] halfak: I'm slowly changing the quarry puppet repo to match the lessons learnt from doing ORES :) [20:27:13] (it uses the celery module now) [20:27:53] :D [20:28:18] halfak: but I'm building it to be solely debian package based so that'll be nice to see how it goes too [20:28:22] * YuviPanda is looking into aptly.info [20:30:08] Should we be worried about making these available in a distro repo? [20:30:19] E.g. the comment by Mako re. debian [20:30:21] yeah [20:30:30] I think that's a good healthy thing for the medium to long temr [20:30:31] *term [20:30:39]