[01:50:40] Just finishing up testing the new ORES for deployment. [01:50:49] Want to make sure that the caching works exactly right. [02:53:36] And here we go! [13:20:12] o/ [13:20:15] Hey Amir1 [13:20:16] :) [13:20:25] o/ halfak [13:20:29] good morning [13:20:51] right now, I'm trying to enable ores roles for beta project [13:20:54] Anything for me to review this morning? [13:20:59] (my morning) [13:21:00] do you know how to do it? [13:21:11] Do you do it through wikitech? [13:21:27] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=e19c9cd0-c670-4486-9930-e29e18e15338&project=deployment-prep®ion=eqiad [13:21:30] You probably want https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [13:21:49] it's not in list of roles [13:21:50] It lets you select which puppet roles are available for instances of a project [13:21:57] You "add class" [13:21:59] great [13:22:39] awesome [13:23:33] thanks halfak, I was going though operations/puppet files! [13:23:42] \o/ [13:23:50] Happy to help :_ [13:23:56] :) [13:24:14] * halfak looks at the new ORES homepage and feels good about it [13:25:12] I'm worried about our feature extractor performance. [13:25:46] in general or when we inject feature valeus? [13:25:54] In general [13:26:27] https://github.com/wiki-ai/revscoring/issues/234 [13:26:33] "Profiling feature extractor" [13:26:52] I checked that a while ago [13:26:56] I would love to help [13:27:30] Seems like this will need to go through a couple of iterations, but it might be very easy [13:27:53] We already log how long the process() methods take to run. [13:29:54] halfak: do you want to split the work? [13:30:30] Maybe -- so long as we split it in sequence -- rather than parallel ;) [13:32:28] :D [13:36:41] halfak: btw. what is "role::labs::ores::lb"? I couldn't find the puppet class related to lb [13:38:08] * halfak digs [13:38:55] also redis roles should be enabled for workers? [13:39:56] Hmm... That doesn't sound right. [13:40:41] https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/labs/ores/lb.pp [13:41:28] Amir1, are you looking at the redisproxy? [13:41:49] Because that just makes sure that the machine can talk to redis. [13:42:29] " Simple nginx HTTP load balancer for ores" [13:42:33] line one [13:42:59] Yup [13:43:04] what's the difference between a "Load balancer" and LVS? [13:43:33] No idea what LVS is, but it seems that varnish can take over nginx's role [13:46:20] https://en.wikipedia.org/wiki/Linux_Virtual_Server [13:46:30] akosiaris can tell better of course :) [13:47:33] halfak: what do you think if I configure redis proxies and lb in the only worker node in beta? [13:49:05] I'd put them in the web node. [13:49:18] * halfak thinks about the lb [13:49:36] It would be great if we could *not* configure the lb at all. [13:49:59] I already put flower in web node [13:50:05] but yeah [13:50:14] https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/web.pp#L18 [13:50:24] If that said port 80 instead of 8080 [13:50:39] Amir1, why not set up a staging instance in beta? [13:50:45] We already have a staging role [13:51:10] https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/labs/ores/staging.pp [13:51:36] hmm, not a bad idea, but the point is to make everything similar to prod [13:51:55] but how much similar [13:52:03] Gotcha. Then, we'll want to have a web/redis/worker nodes. [13:52:07] (my guess is we need advice from ops) [13:52:13] ^ Good point. [13:52:23] yuvipanda would have a useful opinion here too [13:52:27] Easy for him to comment too [13:52:38] He'll likely be up in ~4 hours [13:52:44] cool [13:52:48] LVS is what Amir pasted the link to. In our case it's a Direct Routing layer with a weighted round robin implementation and heath checking [13:53:04] health checking and pooling/depooling happens via pybal [13:53:06] akosiaris, gotcha. So LVS would do our load balancing [13:53:10] yes [13:53:10] o/ [13:53:20] Great. :) [13:53:31] akosiaris, any opinion about our beta setup? [13:53:45] Could have one beefy machine that performs all roles [13:53:54] Or could have several vms that fill individual roles [13:55:47] IIRC, the SCA/B part exists already in beta [13:56:05] probably a VM to host the worker and another VM to host a redis if there isn't already one [13:56:12] so 3 VMs [13:56:25] that mirrors more of less production (6 machines actually) [13:56:31] more or less* [13:56:47] btw, I 've been cleaning up the mode [13:56:50] module* [13:57:05] I am thinking no flower in production and I am wondering about the precached [13:58:09] akosiaris, no flower in prod could be OK. If I can get graphite set up to track the workers in realtime nicely, I won't miss flower. [13:58:15] Right now, it's a pain to use graphite. [13:58:22] the idea of that is to allow the cache to be warmed up with all the new edits, right ? I am a bit worried it might end up eating cycles to generate scores for revisions that never get queried, lowering the cache hit rate [13:58:22] I think I just need to work out grafana. [13:58:50] akosiaris, the primary use-case of many of our models is recentchanges patrolling [13:59:12] Fwiw, we have implement a selective precaching strategy that will let us choose which models are precached. [13:59:31] oh, if celery provides an API to get data out of it (it did not when I was playing with it 3 years ago), creating a diamond plugin and a grafana dashboard should be easy [13:59:40] So we can, for example, only precache the 'damaging' model for enwiki rather than 'damaging', 'wp10', 'reverted', 'goodfaith' on every edit. [13:59:56] akosiaris, we can use the statsd logs I already have set up. [13:59:59] ok that definitely makes things better than precaching everything [14:00:14] the nice thing about flower is that it gives a clean view of which workers are online/offline. [14:00:34] oh that as well. statsd is fine as well [14:00:42] If I can set up a grafana or whatever with a graph pre-configured to show the workers status, then no flower needed :) [14:00:55] ok, I think we can achieve that [14:13:00] akosiaris: is anything needed to merge these patches? https://gerrit.wikimedia.org/r/#/c/278989/ [14:13:04] specially on our side [14:17:43] yes, cleaning up base a bit more. that git::clone thing for example, getting scap to work etc [14:22:50] akosiaris: scap is working in beta, I was able to deploy it [14:23:07] but I used puppet apply [14:23:40] and I want to add them to operations/puppet [14:24:01] but I don't know what to do about ssh credentials [14:24:21] your opinion on this matter is very valuable akosiaris [14:27:27] * Amir1 just made article of LVS in Persian Wikipedia :D [14:27:56] Amir1, link? [14:28:11] https://fa.wikipedia.org/wiki/%D8%B3%D8%B1%D9%88%D8%B1_%D9%85%D8%AC%D8%A7%D8%B2%DB%8C_%D9%84%DB%8C%D9%86%D9%88%DA%A9%D8%B3 [14:29:13] My translation gives me "Species: load balancing" lol [14:30:18] :)))) [14:30:51] halfak: you won't believe what we made to make creating articles easier in Persian Wikipedia [14:34:02] click on this link: [14:34:02] https://fa.wikipedia.org/w/index.php?title=%D8%A8%D8%B1%D8%A7%DB%8C%D9%85_%D8%A7%D9%86%D8%AF%D8%A7%D8%B2%D9%87_%DB%8C%DA%A9_%D8%B1%D9%88%D8%AF_%DA%AF%D8%B1%DB%8C%D9%87_%DA%A9%D9%86&enName=Cry_Me_a_River_%28Justin_Timberlake_song%29&enOldid=711712797&redlink=1&action=edit&withJS=MediaWiki%3ATofawikiHelper.js&requestingPage=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9 [14:34:02] %BE%D8%AF%DB%8C%D8%A7%3A%D8%AF%D8%B1%D8%AE%D9%88%D8%A7%D8%B3%D8%AA+%D8%A7%DB%8C%D8%AC%D8%A7%D8%AF+%D9%85%D9%82%D8%A7%D9%84%D9%87+%28%D8%B1%D8%A8%D8%A7%D8%AA%DB%8C%DA%A9%29&editintro=%D8%A7%D9%84%DA%AF%D9%88%3A%D8%A8%D9%87+%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%81%D8%A7%2F%D8%A7%D8%AF%DB%8C%D8%AA%E2%80%8C%D9%86%D9%88%D8%AA%DB%8C%D8%B3&uselang=en [14:34:10] oh [14:34:17] too long [14:36:59] halfak: http://bit.ly/1Rp4Bau [14:37:18] wait for some seconds and then click on show preview [14:38:22] What does that JS do? [14:38:29] Call an external service for translation? [14:38:44] that calls a service in labs that we made, that service collects lots of information (from en.wp, wikidata, GND, ...) and gives back the whole page [14:39:43] it doesn't translate, it generates [14:40:18] (in some parts it does but it generates the text fully) [14:44:08] http://bit.ly/1REPf0H [14:44:12] service ^ [14:44:39] Gotcha. That's pretty cool actually. [14:47:45] the last thing: it's not a service: it's several services, one for articles of cities, one for human bios, etc. I maintain some of them while my friends do other ones. 30% of new articles in fa.wp is being made using this tool now [14:53:47] shoot, the web node doesn't have proper security settings set up [14:53:57] I need to delete it and re-create it [15:00:14] "30% of new articles in fa.wp is being made using this tool now" That's pretty awesome. [15:03:45] :) [15:08:30] akosiaris: one question: scb doesn't exist in beta project, should I use sca or make a node for worker instead (we have two sca nodes, but their settings and services enabled in them, citiod and graphoid are the same) [15:18:29] afk for a while [15:40:58] Amir1|afk: scb and sca don't differ much even in production (as far as your purposes go). sca is trusty and only exists to host some non migrated yet to jessie services, scb is jessie. Other than that (and some nodejs version changes - which are irrelevant on this one) they are the same. It's fine to re-use sca in beta [15:55:57] thanks akosiaris [16:01:23] Amir1: think we could get it to beta working this week ? I sure would like to [16:01:49] I can make it working by tomorrow :) [16:02:12] the only thing I need to handle is SSH credentials [16:02:44] https://www.irccloud.com/pastebin/tvFR67gq/ [16:03:07] akosiaris: this is the puppet, I don't know how to handle my public key [16:04:28] ah the keyholder part .. [16:04:37] yeah that part is a bit more convoluted [16:06:42] lemme check if we actually got it working in beta [16:06:43] thanks [16:18:56] I've put up an etherpad for to-do of this work: https://etherpad.wikimedia.org/p/ores_prod_todo [16:19:08] feel free to add/remove anything you want [16:30:18] https://gerrit.wikimedia.org/r/280247 [16:30:30] (we need to make proper changes to the fabfile) [17:08:01] halfak|lunch: https://github.com/wiki-ai/ores-wikimedia-config/pull/50 [17:08:16] when you're back [17:08:50] Amir1: thanks for pushing a lot of these through with akosiaris :) [17:11:29] \o/ [17:12:14] I'm happy I'm oding something, I can't wait until this happens [17:12:19] *doing [17:12:32] :) [17:25:25] puppets in beta fail, https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&instanceid=a00ae311-40cd-4883-a6d3-e5f1816ac005&project=deployment-prep®ion=eqiad [17:25:34] unable to locate two packages [17:27:05] ncdu, dstat [17:47:05] that's just fantastic, ncdu and dstat are already installed in redis and web nodes but puppet gives unable to locate error :)))) [17:52:14] maybe the error is something else [17:56:55] these packages are in puppet: modules/base/manifests/standard_packages.pp [18:07:13] akosiaris: https://github.com/wiki-ai/ores-wikimedia-config/pull/50#issuecomment-203014071 Is it okay to make the deploy repo in diffusion would be ores/config instead of ores/deploy? [18:07:53] fixing that without breaking our current enviroment is virtually impossible [18:08:17] breaking for how long ? it might be ok to schedule a maint window [18:08:28] I originally called it config to match mediawiki-config, fwiw [18:09:03] depends on halfak|lunch [18:09:08] heh, so those repo paths in https://github.com/wiki-ai/ores-wikimedia-config/pull/50/files [18:09:15] are not going anyway to be the repo paths in production [18:09:51] so I am not sure what we are talking about here [18:10:00] akosiaris: one is the nltk dir [18:10:11] https://github.com/wiki-ai/ores-wikimedia-config/pull/50/files#diff-77ae19360cfaa1aaaea7086fab249bf4L8 [18:10:27] and the other one is https://gerrit.wikimedia.org/r/#/c/280247/3/modules/ores/manifests/base.pp [18:10:40] (my patch to remove the repo git clone) [18:11:16] it's ores/deploy in scap configs [18:11:26] yeah, the path in production actually is going to be /srv/deployment/ores/deploy for that repo [18:11:50] the venv that will be constructed by pip installing the wheels will go elsewhere obviously [18:12:09] but these are already parameterized in puppet [18:12:49] so are overrideable per environment [18:12:53] hmm [18:12:59] okay, I get it [18:13:11] let me make new patch sets and show them to you [18:13:35] I am not sure what the problem you are trying to solve is tbh [18:14:17] I'm trying to remove the git clone in puppet [18:14:37] and move it to fabfile in order to keep it for labs instances [18:15:00] so just remove the git::clone resource ? [18:15:11] no need to change anything else [18:15:29] that's what I'm doing now [18:15:37] ok [18:16:06] I was thinking in the mean time let's use ores/deploy instead of ores/config since our mirror would be like this (and consistancy between labs and prod) [18:32:16] o/ [18:32:31] So, I'm not sure we're stuck on the folder name ores/config, but it does make sense. [18:32:44] Naming the folder the same as the branch is a little silly, right? [18:34:11] akosiaris & Amir1 ^ [18:34:32] You can also choose the folder name during the git clone [18:34:54] halfak: I don't have any preferences over that but I was told that it's standard to use repo/deploy [18:35:11] yeah we try to standardize on repo/deploy for services [18:35:17] so ores/deploy in this case [18:35:21] As a folder name? [18:35:28] yeah, nothing to do with the branch [18:35:30] it seems in prod it'll be like ores/deploy and in labs it can be ores/config [18:35:44] I think the only issue here is nltk directory [18:35:47] ^ Let's just make it ores/deploy everywhere then. [18:35:59] (in config folder) [18:36:26] halfak: that's not possible without breaking, I guess [18:37:35] Amir1, we can coordinate the move with some other downtime. [18:38:03] kk [18:39:00] one thing: we can merge these patches for now [18:39:11] and make another one for moving to ores/deploy [18:40:23] who is deadpool? [18:40:26] :)))) [18:40:45] https://etherpad.wikimedia.org/p/ores_prod_todo [18:44:44] Amir1, thinking about this now. Is there anything in puppet that references these folders? [18:45:07] modules/ores/manifests/base.pp [18:45:46] Great. So if we (1) change puppet, (2) change fabfile.py and (3) "mv config deploy" & restart, that should do it. [18:47:51] as default value of a parameter [18:47:51] halfak: should we move config to deploy (3) in targets one by one? [18:48:02] Amir1, yeah, I think so. [18:48:40] okay [18:48:48] I can help with that [18:49:33] halfak: what about the first question. Should we do another PR/change set or I need to do it in these ones? [18:49:53] Depends on when we're doing it. [18:50:02] I fly out to Jerusalem tomorrow [18:50:12] So I'm not sure I want to schedule downtime soon. [18:50:20] have a safe flight [18:50:24] :) [18:50:25] :) [18:50:34] * halfak thinks [18:50:46] I wonder if we can move the dir without experiencing downtime. [18:50:51] I can do it while you are away [18:51:00] Might end up being tricky [18:53:02] hmm [18:53:05] your call [18:59:30] Let's not do it while I'm away. I'd like to not merge the change until we are ready. [18:59:46] At least the part that switches config for deploy [19:00:06] Wait... I forgot to ask, what does not merging this change block? [19:00:08] Amir1, ^ [19:01:03] moving to prod [19:01:21] we can live with ores/deploy in prod and ores/config in labs [19:01:34] but once the git clone is there [19:01:48] we can finish the change in prod [19:02:15] (run puppets in target since they can't talk to github.com) [19:02:22] halfak: ^ [19:03:20] *we can't [19:04:36] Gerrit will mirror github, right? [19:05:12] yes [19:05:30] I don't about about progress of that [19:05:40] s/about/know [19:06:25] but it's not a good idea to use git clone in puppet [19:06:53] these parts are already being done in scap3 [19:07:41] and in labs it's not needed except in case of initializing the server and in that case PR #50 is adding this funcationality to the fabfile [19:12:11] Amir1, how do deploys from git happen? [19:13:08] you login into tin.eqiad.wmnet (or its counterpart in beta) [19:13:23] go to /srv/deployment [19:14:02] git clone phabricator.org/ores/deploy.git /srv/deployment/ores/deploy [19:14:13] cd /srv/deployment/ores/deploy [19:14:13] Oh,, so you do the git clone yourself? [19:14:25] in tin yes [19:14:30] then you do deploy [19:14:42] that makes the tin a git repo for targets [19:15:17] so targets do "git clone ssh://tin.eqiad.wmnet/ores/deploy.git etc" [19:15:56] Lost now. [19:16:10] What exactly is happening. How do we get code there without cloning? [19:16:15] It looks like you are cloneing [19:16:41] we clone to an instance [19:16:56] (that instance is made to do that) [19:17:36] other instances like workers and web nodes actually get files from that instance [19:17:58] (keep in mind tin acts like an internal web server) [19:19:13] so "git clone https://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy.git" works inside beta instances [19:19:34] but not to the outside world [19:21:06] So, the git clone is not happening in puppet? [19:21:27] But it still somehow happening on the web/worker nodes? [19:21:49] no, it happens in scap (for targets) and by hand (in tin) [19:22:03] scap runs the git clone [19:22:10] (in targets) [19:22:17] But git clone will only happen once, right? [19:22:38] Once per host [19:22:43] no, it happens in every target [19:22:47] yes [19:23:37] for other ones scap does git fetch or git pull (not sure of that but I know it does) [19:24:02] updates [19:24:35] and for initalizing the server, git clone is being done scap puppets [19:25:14] OK. That makes sense. [19:25:33] So really the only change is that we'll move git clone to the "initialize server" part of scap/fabfile.py [19:25:54] excatly [19:26:02] if you check the PR 50 [19:28:01] {{merged} [19:28:02] } [19:29:40] \o/ [19:29:46] awesome [19:29:49] thanks [19:30:25] akosiaris: since the PR 50 is merged it would be great if you check this https://gerrit.wikimedia.org/r/#/c/28024 [19:31:21] halfak: I still have trouble enabling roles for beta nodes [19:31:28] that's pretty strange [19:32:34] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&instanceid=a00ae311-40cd-4883-a6d3-e5f1816ac005&project=deployment-prep®ion=eqiad [19:38:03] hmm, it seems it's the issue for sudoers and user www-data [19:53:10] I guess I know how to fix that [19:56:26] halfak: now it returns "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass server to Class[Role::Labs::Ores::Redisproxy] at /etc/puppet/modules/role/manifests/labs/ores/flower.pp:3 on node deployment-ores-web.deployment-prep.eqiad.wmflabs" [19:56:40] redis proxy is not set properly [20:00:54] I should do it in hiera I guess [20:04:21] redis instance failed with this error "Starting MTA:2016-03-29 20:01:08 Exim configuration error in line 16 of /etc/exim4/...4.conf" [21:10:45] Amir1 that gerrit change, https://gerrit.wikimedia.org/r/#/c/28024 makes zero sense to me. Probably c/p mistake ? [21:14:45] akosiaris: https://gerrit.wikimedia.org/r/280247 [21:14:47] sorry [21:14:50] :( [21:15:31] ok thanks, I 'll have a look and merge tomorrow [21:16:13] great [21:16:28] I work on setupping redis in beta in the mean time [22:38:52] ok, I fixed redis proxy setup, I need to run the redis itself [23:06:47] okay, it seems, web is good, redis is good and sca01 also contains the worker :) [23:07:24] I need to add scap::target to see if it works, the only thing here is the key holder [23:07:55] I think I need to use saltmaster and minion master(???) but I have no idea how they work [23:08:05] go to bed