[15:17:13] o/ [15:17:21] Hey Amir1 & White_Cat [15:17:45] o/ good morning halfak [15:17:59] White_Cat, check out https://phabricator.wikimedia.org/T130213 [15:18:18] Not that each one of the blocking task has it's own blocking task for completing the campaign. [15:20:31] halfak: I'm writing the deployment guide [15:20:52] Amir1, I'm debating whether to work on the OS version on our nodes or push on swagger [15:22:00] where? [15:22:38] Where to put the deployment guide? [15:22:40] Wikitech [15:22:42] For sure [15:22:59] yeah in wikitech [15:23:05] I save it once I'm done [15:23:16] kk [15:23:34] I mean where are you debating about swagger and OS [15:23:41] I think I should look into this issue with server OS version, but I'm not going to try to solve it today [15:24:01] Oh! I'm debating with myself -- internally :) [15:24:24] On the one hand, I need to dig into this issue because it is preventing us from deploying. [15:24:58] On the other, I probably shouldn't try to replicate our cluster and do a switch-over without some substantial support. [15:25:20] I guess I can make the switch on the load balancer. [15:25:27] That will allow me to stand up a parallel cluster. [15:25:48] I was meaning to stand up an additional web node anyway. [15:26:01] But for a little while, we're going to be pushing against our quota in labs, [15:26:05] :D [15:26:26] I think I'll document the issues I think are happening and then work on swagger. [15:26:35] I can use the new web node as a test bed. [15:27:39] I could also work on https://github.com/wiki-ai/ores/pull/115 [15:27:46] Meh. I'll leave that arlolra [15:27:55] Nice to have people hacking on ORES :) [15:28:02] <3 collaborators [15:28:11] \o/ [15:30:05] Hmmm... I actually need to test this on one of our live nodes. [15:30:12] Oh! We have old staging... I think [15:30:29] Nope [15:30:31] I killed it [15:30:32] :S [15:30:42] I was going to say that [15:30:48] * halfak sets up a virtualenv for running tests [15:31:45] lol. Can't run git. Not enough memory. [15:32:01] Looks like I'm going to do some minor hacks to hiera first [15:47:39] OK. I am now testing with my own user on web-01 [15:51:45] OK... hmm... I was able to install the wheels and start up a single uwsgi instance with all of the models loaded. [15:51:56] So, I think that it could have been a fluke yesterday? [15:53:05] Hmm... It seems like we might be able to try a deployment then. [15:53:43] Well... that won't help us get to prod faster. [15:54:12] The main benefit will be that (1) we get the better nlwiki model deployed, (2) arwiki and plwiki get `reverted` models. [15:54:20] * halfak thinks [15:54:35] Yeah... let's try this out. [15:54:59] It's probably a good time of day to have damage detection systems going offline (assuming we even would) [15:55:18] halfak: https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/Deployment [15:55:34] It's far from complete but not very bad [15:57:43] * halfak starts editing [15:58:16] thanks [16:00:51] OK. A few minor edits and the commands seem right. [16:01:00] I like how you split the sections by what you need to do. [16:01:14] Maybe we could have a high level overview of deploy process with detail beneath. [16:01:19] I'm imagining the BAG page. [16:01:21] * halfak gets link [16:01:40] https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval [16:01:48] That table has gotten very verbose [16:12:13] Amir1, how do you associate a patchset with a phab task? [16:13:36] halfak: you need to add a line like "Bug: T12345" [16:13:37] at the end of commit message [16:13:37] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [16:14:04] halfak: let me get an example [16:14:25] * halfak tries that [16:14:29] https://gerrit.wikimedia.org/r/#/c/274912/ [16:14:52] https://gerrit.wikimedia.org/r/#/c/278455/ [16:16:01] you added the wrong number [16:16:07] 12345 was an example :D [16:16:11] lolwoops [16:16:15] halfak: ^ [16:17:11] Fixed [16:17:19] * halfak feels silly just pasting and not looking [16:17:30] o/ Platonides [16:17:33] *plward [16:17:42] Sorry Platonides. But hi to you too :) [16:18:35] o/ halfak :) [16:19:47] working with aetilley on figuring out the best way to train the PCFG on wikipedia revisions without having to manually create tree banks [16:20:36] plward13, \o/ [16:20:54] Great! I'm glad you two are working together :) [16:21:36] me too, he's really nice and passionate about the work [16:21:49] halfak: I didn't know what was the correct phab task for that patchset otherwise I would give you the correct one [16:22:26] Amir1, no problem. That was my silly mistake :) [16:23:03] Hmm... Looks like it is taking a long time for the task/patchset to connect [16:23:09] There must be a bot that runs periodically [16:23:27] that's pretty fast usually [16:25:50] * halfak watches web-01 installing our wheels [16:25:57] On to 02! [16:26:37] I'm updating the environment and I'll make sure it all works before we try restarting the services. [16:28:43] awesome [16:31:29] Oh no. We're blocked on a puppet change [16:32:43] Amir1, do you know a good way to find all of my open patchesets on gerrit? [16:33:06] Nevermind. Gerrit homepage :) [16:33:07] https://gerrit.wikimedia.org/r/#/c/278413/ [16:35:43] * halfak tries to find someone to merge this [16:35:51] I guess I could go on a manual install campaign. [16:36:50] Arg. I don't want to leave things here. If, for some reason, we need to restart a service, it won't be able to recover without this puppet change [16:36:52] grrr. [16:40:45] OK. Yeah. I think I'm going to do this manually. [16:44:47] Amir1, thinking about deployment versions. [16:44:53] It seems like we should use tags [16:45:06] E.g. after a successful deploy, we can tag the ores-wikimedia-config repo [16:45:10] Thoughts? [16:47:28] \o/ web-03 works as expected. [16:47:41] Even though we haven't really updated the worker nodes yet :D [16:47:54] OK time to deploy to the rest. Let's see how this goes. [16:58:20] sorry, I was afk [16:59:24] halfak: tags in phab or github? [17:00:00] github [17:00:02] In the repo [17:00:09] So the tags would go anywhere the repo goes [17:00:25] https://phabricator.wikimedia.org/T130463 [17:00:27] Amir1, FYI [17:00:34] It looks like I will need to switch out our cluster [17:00:41] * halfak starts up ores-worker-05 [17:00:43] let's do it for a while and see what happens [17:03:09] The ores service is in a really weird state right now. [17:03:13] This is nerve racking [17:03:19] 2 workers are fine [17:03:21] 2 are down [17:03:34] And the web nodes will behave weirdly as soon as we switch. [17:03:43] But I haven't switched the lb yet [17:03:55] So we're online and stable so long as I don't do the wrong thing. :/ [17:05:38] Over quota. I'm killing ores-misc-01 [17:05:46] I hope that doesn't lose any work :( [17:06:21] And now we're overloaded [17:06:22] arg! [17:06:42] * Amir1 grabs coffee to help [17:06:52] Amir1, can you post to the AI mailing list saying that we're doing some maintenance that seems to have taken the service down and we'll post more details soon? [17:08:31] sure [17:08:34] in a sec [17:09:22] * halfak deletes old workers that were down anyway. [17:11:17] * halfak initializes ores-worker-05/06 [17:12:35] and now ores-worker-07 [17:13:06] I think that the real problem is that we have some weird interactions in pip [17:13:10] after installing our wheels. [17:17:30] halfak: done [17:17:40] Thanks Amir1 [17:17:53] back to the real issue here [17:18:05] Making progress converting the cluster to jessie 8.3 instances. [17:18:16] I'm still initializing web and worker nodes. [17:18:25] 1- have you installed all of c dependencies [17:18:30] I think you did [17:18:34] stupid question [17:18:36] nvm [17:18:36] Amir1, yeah. [17:18:39] heh [17:18:46] https://phabricator.wikimedia.org/T130463 [17:19:00] Something deep inside sklearn is weird. [17:20:25] I think I saw this error before [17:20:47] but I'm not sure where [17:22:38] halfak: which node is still 8.1? [17:22:48] (trying to test and see what happens) [17:22:49] All of them were. [17:23:10] Oh... um... web-03 is 8.1 [17:23:13] I'll leave it alone. [17:23:18] Sorry [17:23:22] web-02 [17:23:30] okay [17:26:46] !log ores doesn't work with jessie 8.1, looking to find out why https://phabricator.wikimedia.org/T130463 [17:26:46] No hay log abierto en #wikimedia-ai - log on para abrirlo, log list para listar los logs disponibles. [17:26:46] Not expecting to hear !log here [17:27:05] Amir1, gotta do that in -labs [17:27:22] we should make this happen here too [17:27:26] let me fix that [17:27:35] !log list [17:27:35] No hay logs disponibles para #wikimedia-ai. [17:27:35] Not expecting to hear !log here [17:29:10] why it's talking Spanish :))) [17:29:41] can't login to web-02 [17:29:45] logged in web-03 [17:29:53] 8.3 [17:29:57] try web-01 [17:30:01] kk [17:30:06] I accidentally deleted web-02 instead of 01 ;) [17:30:24] okay web-01 is in [17:30:48] Careful with 01 [17:30:54] It's serving stuff right now [17:30:57] Just a bit [17:31:04] halfak: I want to be careful, what should I command? [17:31:09] fab deploy? [17:31:14] Ack! No [17:31:37] Work in your own venv for now [17:31:48] Soon you'll be able to have full command of the machine with no worries. [17:31:49] sure [17:35:32] https://www.irccloud.com/pastebin/RRcBqqU3/ [17:35:35] halfak: ^ [17:37:28] https://www.irccloud.com/pastebin/rPSzno9p/ [17:37:35] halfak: and it's running [17:37:40] Yeah. [17:37:44] That's what I saw too. [17:37:52] it's 8.1 [17:37:53] I'm starting to think that it was due to an issue with the old venv [17:38:12] you should dispose the old venv and make new one [17:38:18] it's venv [17:38:35] should we do it with fabfile? [17:38:55] Amir1, manually fixing everything now [17:39:03] Will think about long term strategy later :) [17:39:27] kk [17:40:56] afk for dinner [17:41:05] yummy juje kabab [17:41:12] kk hopefully I'll haven this finished up by the time you get back :) [17:41:18] \o/ [17:44:42] lol this is going to fuck up our metrics. [17:44:49] We're going to have new machines everywhere [17:51:45] Alright. 4 workers online [17:51:49] 2 web nodes online [17:51:54] lb is aiming at the 2 web nodes [17:52:08] Let's add a new web node since we were going to do that anyway. [18:09:25] back [18:09:33] still eating though :D [18:11:57] kk [18:12:02] We're back online, but really slow. [18:12:13] I think we're getting hammered with requests, so I'm standing up more workers. [18:12:52] Utilization on our workers is at 100% [18:19:10] Amir1, so, I've learned that when installing from whl files, we need to clear out the old installed packages first [18:19:14] pip won't do it for us. [18:19:47] yes [18:19:54] we should write that down somewhere [18:23:01] I'm putting it in the fabfile now [18:23:15] Also, we're *still* overloaded with an additional worker! [18:23:19] Someone is hitting us hard. [18:23:26] We don't have a good way to look into this now. [18:23:38] * halfak starts up another worker. [18:24:39] probably beta cluster [18:25:22] Could be [18:25:32] Would make sense if there was a batch job running right now [18:27:12] * halfak watches graphite [18:27:46] halfak: https://phabricator.wikimedia.org/T130463#2136839 [18:27:50] added post mortem [18:28:23] I think someone who really hates us saw the email :D [18:28:54] lol [18:32:02] We now have 6 worker nodes [18:32:12] The queues are much smaller. [18:33:02] awesome [18:33:30] I think we should write a follow up in the AI-l [18:33:35] +1 [18:33:42] Did you start the incident report? [18:33:58] YEah [18:34:28] let me see if I can enable the "!log" here [18:34:56] addshore: hey, Can you enable !log here? [18:36:33] Amir1, I'll be killing web-01. OK? [18:36:45] sure [18:43:46] Amir1, I think it is possible we have a performance issue on feature extraction. [18:43:52] That wasn't thoroughly tested. [18:43:55] :/ [18:44:06] But qualitatively, it doesn't seem like this performance makes sense. [18:44:18] let's wait and see [18:44:25] e.g. I can score a single edit using the `revscoring` utility and it is much faster than ORES is right now [18:44:32] if it is so poor we revert it [18:44:33] the `revscoring` utility has no cache [18:44:36] Yeah [18:44:46] OMG [18:44:53] Everything suddenly got more sane! [18:45:02] Utilization just went down to 50% [18:45:03] \o/ [18:46:16] \o/ [18:51:01] Found a bug [18:51:06] Fixing in Ores [18:51:13] Only affects v1 routes [18:53:05] Amir1, going to need to update wheels to get this [18:53:32] ok [18:53:41] let me merge [18:54:01] Will have a patchset in a couple minutes [18:54:26] Getting a really bad network connection to eqiad [19:02:14] This took a lot longer than expected. [19:02:28] Amir1, https://gerrit.wikimedia.org/r/278457 [19:03:00] waiting to load [19:03:27] halfak: delete the old one [19:03:33] Crap thanks [19:03:57] :) [19:05:46] * halfak needs to remember to amend [19:05:56] OK. Change is made [19:07:33] Amir1, ^ [19:09:20] !!! We're being hammered by precaching requests!! [19:09:27] I figured out our performance issues [19:09:32] This makes me feel better [19:11:12] * halfak waits on new wheel [19:11:59] sorry, was on a phone call [19:12:06] No worries [19:12:19] * halfak was just looking to add a tag to our deployment [19:12:23] merged [19:12:23] Not sure how we should name it. [19:12:27] Great [19:12:38] I'd love wiki-ai 1.1 [19:12:41] Amir1, I kind of want to just tag it with the current date [19:12:43] or 2016.1 [19:12:49] why 1.1? [19:12:59] 2016.3 (for the month?) [19:12:59] starting from 0 [19:13:08] yeah [19:13:12] Maybe 2016.3.0 for the first deploy this month [19:13:28] we need to check with PEP440 [19:13:31] Maybe we should do 2016_3.0 [19:13:49] I know 2016.3 or 1.1 is okay [19:14:00] but not sure with these you are suggesting [19:14:01] What if we deploy twice in a month [19:15:08] I like having the date in the version for our deployments [19:15:26] https://www.python.org/dev/peps/pep-0440/ [19:15:48] per PEP440 it would be better to use 2016.01 [19:15:54] first deploy of 2016 [19:15:57] and we go up [19:16:41] What about 2016_3.01? [19:16:54] it doesn't fit in PEP440 [19:16:59] Damn [19:17:10] we can just don't care about it [19:17:26] Damn. I introduced a typo while fixing ores. [19:17:29] * halfak face palms [19:17:34] * halfak slows down [19:17:36] do this good [19:18:16] :D [19:18:21] where? [19:22:28] model.info() --> modelformat_info(format="json") [19:22:33] Notice the missing "." [19:27:10] Amir1, https://gerrit.wikimedia.org/r/#/c/278458/ [19:27:26] deploying was a bad idea today. I should have worked on swagger :S [19:28:11] {{done}} [19:28:11] How cool, Amir1! [19:28:18] you did a fine job halfak [19:28:21] it happens [19:28:46] don't worry [19:29:05] fine is a not good word, you did a great job [19:29:53] :) Running what I hope to be the last test on staging :) [19:34:34] All looks good. Off to deploy [19:44:15] Deploy takes forever because of these restarts :( [19:53:07] Looks like web-04 just mysteriously died [19:59:29] OK it seems that we're good. [20:00:15] Yup. I'm declaring victory [20:00:20] Time to update the incident report [20:00:27] \o/ [20:00:52] I got my email regarding HPI hackathon [20:01:05] It's official I'm going to Berlin [20:01:16] Woot! Alright :) [20:01:18] Amir1, did you link me to your incident report? [20:02:16] halfak: it's the phab card you posted [20:02:38] https://wikitech.wikimedia.org/wiki/Incident_documentation [20:02:42] I'll start one here [20:02:51] cool [20:10:34] https://wikitech.wikimedia.org/wiki/Incident_documentation/20160319-Ores [20:12:24] OK Gotta run. [20:29:54] o/ [20:30:21] o/ Helder [20:30:41] Aaron is not around but I hope we can talk