[13:38:53] o/ [13:48:22] YuviPanda, I think I want to do the ORES deploy. now a good time? [13:48:58] halfak: yes [13:49:03] halfak: we already put it on staging, right? [13:49:05] kk. [13:49:06] Yes [13:49:08] halfak: this is just to production now [13:49:10] halfak: k [13:49:17] And staging has been getting hammered with precached :) [13:50:12] nice [13:51:47] * halfak watches updates run on celery workers [13:52:04] Looks like weird compiling numpy [13:52:07] *we're [13:52:12] aaagh sigh [13:52:19] sometimes my fingers just type the wrong word [13:52:38] that might be a side effect of the ranges in requirements.txt - if they don't match exactly pip goes off [13:52:44] Yup [13:52:48] which is stupid [13:52:55] it should just be like 'oh it already satisfies the range' [13:52:57] and shut up [13:53:01] but oh well [13:53:05] Yeah. [13:53:11] that's why I don't like ranges [13:53:19] anyway, when we switch to debs we won't use it at all [13:53:26] It would be better if we could specify some packages should always be upgraded [13:53:35] Then we could just drop the --upgrade [13:53:42] And get better behavior out of the ranges. [13:53:56] Really, I think this could be done manually in the fabric file. [13:54:16] deploy-compile is a nice time to reflect on deploy [13:54:31] heh [13:55:24] e.g. we could have update_git do the "pip install -r requirements" and then "pip install -r core-requirements.txt --upgrade" [13:55:34] Or something like that [13:56:09] it'll be purely debian packages from next week ayway [13:56:11] *anyway [13:56:14] since no pip in prod [13:57:13] kk [14:02:54] * halfak waits harder [14:04:45] halfak: it might not succeed. I don't know if numpy will actually compile under our setup [14:05:07] halfak: a 'fix' is to match the debian version with requirements.txt and not have it specify a range, which IMO is the 'right' thing to do and causes the least amount of problems [14:05:07] It compiled on the staging server [14:05:27] YuviPanda, then we tie other users to our debian version. [14:05:33] Or the pain of compiling [14:05:34] yes and the bad effect of that is? [14:05:52] so you're self-inflicging the pain of compiling? [14:06:05] everyone else who is using alternate methods is also going to run into the exact same problem [14:06:16] of pip going for the latest one no matter what version they have [14:06:20] I like the solution of specifying the range and not forcing --upgrade better. [14:06:27] ok :) [14:06:31] Isn't that better? [14:06:32] have fun compiling :) [14:06:33] no [14:06:43] it isn't. version upgrades should be a deliberate action [14:06:45] Well.. it will only compile when the version lands outside of that range. [14:06:59] no [14:07:06] it compiles when it finds a 'better' match [14:07:07] afaict [14:07:09] so if you specify [14:07:13] Indeed. Right now, version upgrades deliberately happen all the time. [14:07:25] YuviPanda, it does not. Only when you force --upgrade [14:07:36] yes but if you don't do --upgrade if you change any versions at all it doesn't care [14:07:38] Otherwise, it's happy with whatever is in the range. [14:07:54] so if you force an upgrade from 0.4.8 to 0.5.1, if you don't specify --upgrade it won't actually upgrade [14:08:16] Indeed. We would have to set the range appropriately. [14:08:34] But I see what you mean. We're going to have to compile once unless the "best" version is the one that is installed. [14:08:51] yes, so in effect you are tying people to the 'best' version, which according to pip is whatever's latest on pypi [14:08:57] I think the trick may be to make sure that the packages are installed first. [14:08:58] regardless of what the user has installed [14:09:08] you need --upgrade since otherwise you won't get the version upgrades you do want [14:09:08] YuviPanda, only if they also use --upgrade [14:09:22] if you change the version of an unrelated component [14:09:24] I can specify ranges to get the version upgrades we want [14:09:25] you need to run --upgrade [14:09:42] anyway, I don't care since it isn't going to be a problem for us from next week :) [14:09:45] E.g. 0.4.8 is no good. We need 0.5.1. Just run it without --upgrade [14:09:57] it doesn't actually upgrade that afaict [14:09:59] ANd change the range to >= 0.5.1 [14:10:02] but maybe that's been fixed now [14:10:02] Yes it will [14:10:06] Long time [14:10:12] It's been like that [14:10:14] so why are we specifying --upgrade? [14:10:27] if that works we can drop --upgrade from our fabfile [14:10:31] and see if that works [14:10:35] Because we want 'revscoring' and 'ores' to be upgraded. [14:10:41] Yeah. That's what I'm saying. [14:10:47] Just run --upgrade on a couple of versions. [14:11:01] what do you mean by 'a couple of versions'? [14:11:05] *couple of packages [14:11:07] they're all the same requirements.txt file no? [14:11:08] aaah [14:11:10] can it do that? [14:11:15] Yeah. [14:11:17] withotu separating them out [14:11:18] ? [14:11:21] pip install revscoring --upgrade [14:11:31] right but you're seperating those out now [14:11:32] pip install -r upgrade-me.txt --upgrade [14:11:39] Yes [14:11:39] which is ugly but works, I guess [14:11:57] anyway, is it still compiling? [14:12:38] Yeah [14:12:40] lol [14:12:46] scipy now though [14:12:48] heh [14:12:53] this gonna happen on each node :P [14:13:04] Does it do the nodes sequentially? [14:13:06] yes [14:13:09] No [14:13:10] no parallelism [14:13:11] :( [14:13:16] * halfak will wait a long time. [14:14:46] I started with the workers. [14:15:47] halfak: it might take less time to split out the requirements file and have the --upgrade ripped out [14:15:59] Yes. [14:16:20] * halfak look into that [14:17:32] So, --upgrade seems to do it's work recursively. [14:18:01] Yeah... It looks like I do specify minimum version for 'revscoring' in ores. [14:18:09] So. --upgrade might be totally useless. [14:18:31] I just have to make sure that the version of 'revscoring' I need is in pip when I stage/deploy [14:18:56] YuviPanda, I'm considering dropping the --upgrade right now before it gets to the next worker. [14:19:00] good idea/bad idea? [14:19:05] halfak: do it and let's see what happens [14:26:45] * halfak runs a quick test against staging [14:26:56] Well... that was fast [14:27:14] halfak: :D does it still work properly? [14:27:47] Seems to [14:28:02] Looks like we still need a new version of scipy [14:28:45] it should always just use the version from debs... [14:28:58] Yeah. it should. [14:29:02] anyway, a problem that'll go away for me next week! :D [14:30:32] The deb version is 0.14.0 [14:30:51] We have 0.14.1 as a minimum specified. [14:30:54] for some reason [14:30:59] Because we hate free time [14:31:26] heh [14:31:46] I wonder why we did that [14:31:56] Do you remember anything about scipy 0.14.0? [14:32:02] Maybe it's just a typo [14:32:32] https://github.com/wiki-ai/revscoring/commit/4f1c1051a341221ecfa1f92f3ec4e65771bceff8 [14:32:46] "jiggle dependencies to fit available python deb packaging" [14:32:52] looks like we had a regression in the range. [14:33:03] halfak: ah, I think awight made it because scikitlearn wasn't building with 0.14.0 [14:33:09] Gotcha [14:33:14] I'll leave it then. [14:33:23] Looks like we can't just use the debs after all :/ [14:33:32] Or not the jessie deb [14:33:35] 0.14.0 [14:33:39] halfak: but our current version of scikitlearn doesn't care... [14:34:28] halfak: I can import 0.14.1 now to our repo if you want [14:34:33] that's what we were gonna do [14:35:00] kk :) [14:35:09] kk -> 'yeah, import it'? [14:37:13] Thought you already decided. [14:37:20] I'm not sure of the implications [14:37:34] But yeah. it should stop some of the upcomping compiles [14:37:39] *upcoming [14:38:25] heh [14:38:26] doing [14:40:34] Should I cancel my deploy? [14:40:37] YuviPanda, ^ [14:41:43] halfak: yah [14:42:09] I just installed 0.16.0 in the virtualenv on ores-worker-03 [14:42:18] Oh wait! [14:42:23] It looks like all the workers are done [14:42:34] 1-4 are updates to this new scipy [14:42:44] and the fab deploy_celery worked for them [14:44:09] halfak: hah. ok [14:44:11] hmm [14:45:19] halfak: ok, so no need for me to do any updates now? [14:45:45] Hold off for a bit [14:45:57] * halfak deploys to the web servers [14:46:51] ok [14:49:05] Hmm... And we're down. [14:49:13] oh [14:50:26] am looking [14:51:00] --- no python application found, check your startup logs for errors --- [14:51:02] ugh [14:51:11] Yeah. Looks like a failure on startup. [14:51:14] Weird. [14:51:19] It's like the models didn't get pushed. [14:51:25] But I did push master to deploy. [14:51:36] * halfak checks it all. [14:52:01] $ git push origin master:deploy [14:52:02] Everything up-to-date [14:52:04] hmm [14:52:16] halfak: https://dpaste.de/JSOn [14:52:18] exception [14:52:23] yeah [14:52:24] oh you already spotted [14:52:25] ok [14:52:43] halfak: is origin the 'right' repo? [14:52:45] Yeah... That looks like old models to me. [14:52:50] aka wiki-ai rather than halfak/? [14:52:54] (if you use those) [14:53:11] It was for staging [14:53:19] staging just pushes master [14:53:21] doesn't use the deploy branch [14:53:22] oh [14:53:23] right [14:53:26] but remote is ok [14:53:27] hmm [14:53:38] yeah... [14:53:55] halfak: commit ed7df0d8a8c7964b39fa82573ee0a2e17c9753f8 [14:53:59] is the last commit [14:54:00] i see [14:54:09] that's from aug 8 [14:54:57] halfak: https://github.com/wiki-ai/ores-wikimedia-config/tree/deploy [14:54:59] is very old [14:55:36] weird. [14:55:46] git push origin master:deploy should work, no? [14:55:48] I'm manually fixing up -web-01 now [14:55:54] so we can bring it back up [14:55:59] kk thanks [14:56:01] but I don't know what this would've done to the celery workers [14:56:09] I suspect the workers are going to be borked too. [14:56:12] es [14:56:13] yes [14:56:37] halfak: is your logal master up to date? [14:56:38] *local [14:56:49] aaargh, uwsgi takes so long to reload :( [14:57:47] halfak: different error now [14:58:00] Sep 08 14:57:11 ores-web-01 uwsgi[16795]: AttributeError: Can't get attribute 'process_tokens' on '/srv/ores/venv/lib/python3.4/site-packages/revscoring/datasources/parent_revision.py [14:58:25] Yeah. It's not just the models. The whole thing is behind. [14:58:31] I don't understand why it is borked. [14:58:36] I reset this to master [14:59:00] run pip install -r requirements.txt again [14:59:17] halfak: so I think maybe, it didn't actually upgrade revscoring [14:59:38] Yeah [14:59:52] aaargh [14:59:59] it's fuckign compiling numpy again [15:00:13] Did you do --upgrade? [15:00:25] OK. that was very weird. [15:00:46] halfak: no [15:00:47] But I think I somehow made a mistake merging into deploy [15:01:32] Weird. [15:01:46] The dependency jiggle actually widened the range for numpy. [15:01:50] * YuviPanda sobs a bit about virtualenv and pip [15:02:05] I really think that ranges for versions is going to make everyone's lives hell [15:02:10] and more complex to reason about [15:02:19] since there is a large number of variations now [15:02:24] but let's get out of this first [15:02:25] YuviPanda, only if it doesn't work the way we intended. [15:02:26] Yes [15:02:28] please [15:02:32] solve first problems first [15:02:33] it never dos :) [15:02:34] *does [15:02:41] I don't know what I can do atm [15:02:45] outside of waiting for it to compile [15:03:00] hmm actually [15:04:21] So.. yeah. it shouldn't be compiling something new :\ [15:04:43] * halfak just checked worker -04 [15:04:56] halfak: https://github.com/wiki-ai/revscoring/tree/no-ranges [15:05:28] So... I dunno how you're going to get that picked up. [15:05:41] You'll need to switch ORES dependency to this branch [15:07:02] Downloading/unpacking scipy>=0.14.1,<0.16.999 (from revscoring==0.5.0) [15:07:04] wtf pep [15:07:08] pip [15:07:16] Yeah. [15:07:18] WTF [15:07:20] sudo -u www-data ./venv/bin/pip install git+git://github.com/wiki-ai/revscoring.git@no-ranges [15:07:22] should work [15:07:36] INSTALL ALL THE SCIPYs [15:07:47] even if i speciy a version [15:07:48] aaargh [15:07:50] \err [15:07:52] a commitid [15:08:28] * halfak watches scipy compile on the web nodes [15:09:21] Now that the deploy repo is in good shape [15:09:29] wtf [15:09:37] even if I git clone it [15:10:59] halfak: ah, apparently it had cached what the requirements for 'revscoring==0.5.0' meant and won't let go [15:11:29] No worries. Let's finish this one by the book and fix the next one. [15:11:51] Better to have a little bit of predictable downtime than bork a node. [15:12:25] * halfak is running deploys [15:12:33] Looks like the workers are easy [15:12:41] Just the web nodes need some compiling. [15:13:09] Looks like the celery workers are online and as numerous as I hoped. [15:13:14] halfak: https://dpaste.de/PyyN [15:13:19] halfak: different error now [15:13:23] on the web node [15:13:26] Still deploying to web [15:13:38] ok [15:13:53] compiling on web-01 [15:15:23] * YuviPanda stops fretting about the downtime [15:15:48] It's OK. We'll need to get better at doing this soon though. [15:16:00] No one has showed up waving a flag yet [15:16:34] The good news is that the precached script runs *really* fast when we're down. [15:16:35] I've come around to the general ops position that deploying things with pip is going to give you only nightmares :) [15:16:35] ;) [15:16:38] heh [15:17:21] c'mon. Finish compiling dang it. [15:23:56] Done compiling. Still down [15:24:54] restarting uwsgi: Job for uwsgi.service failed. See 'systemctl status uwsgi.service' and 'journalctl -xn' for details [15:25:52] Restarting takes foreever! [15:26:05] yeah [15:26:06] it does [15:26:11] Do you know why? [15:26:17] Starting up a local flask server is fast [15:26:29] Even with all the models [15:26:43] Maybe it needs to start up a few forks [15:27:44] Looks like it isn't updating the ores package [15:27:50] Yet it's supposed to pull right from master [15:28:13] I think that's what --upgrade does and not specifying it doesn't do? [15:28:22] God damn it. [15:28:26] WTF [15:28:27] (could be totally wrong, but that's why I put in the --upgrade there) [15:28:32] No I bet you are right [15:28:41] "I already installed from this repo. I'm sure I don' [15:28:44] t need to do it again" [15:28:47] and that's why I was saying we need to do --upgrade and then that fucks up ranges and other things [15:28:48] yeah [15:28:50] you can fake it [15:28:58] sudo -u www-data /srv/ores/venv/pip uninstall ores [15:29:01] and then again [15:29:54] * halfak watches scipy compile again [15:29:58] WTF! [15:30:27] WHYU [15:30:33] WHY WOULD YOU DO THAT? [15:31:17] halfak: with or withotu --upgrade? [15:31:26] withotu [15:31:35] so one option... [15:31:40] sudo -u www-data /srv/ores/venv/bin/pip install git+https://github.com/wiki-ai/ores@master#egg=ores [15:31:49] did you do the uninstall first? [15:32:18] yes [15:32:39] halfak: so one option is to 1. pin the versions properly, 2. rm -rf /srv/ 3. puppet run 4. initialize_server [15:33:04] Why did this all work fine on the staging server though? [15:37:52] halfak: not sure. [15:38:00] halfak: (sorry, got pulled into a meeting) [15:38:07] halfak: ok so thoughts? [15:38:14] halfak: should we just bring up a new node? [15:38:15] It's up! [15:38:32] I just cloned in ores and ran setup.py myself (as www-data) [15:38:49] Still borked workers though [15:39:07] sigh [15:39:08] ok [15:39:08] I think that in the short term, I'm going to get rid of this github.com requirement and make it pull from pip [15:39:14] ok [15:39:19] Then I'm going to run it by the books again. [15:40:13] halfak: ok [15:40:21] halfak: let's see how that goes [15:42:30] OK. Didn't compile that time thankfully [15:43:14] Looks like we're installing scipy on the workers though [15:43:23] le sigh [15:43:28] on web? [15:43:46] Yeah. Didn't compile on 01 where I last manually compiled [15:43:52] compiling on web-02 though [15:44:04] And on the workers [15:44:04] ok [15:44:11] compile compile compile [15:44:19] halfak: I wonder if that's because it is cached the no-range thing I tried [15:44:34] * halfak doesn't think pip should be caching ANYTHING [15:44:47] indeed [15:45:30] Looks like out LB isn't behaving right [15:45:39] One web is returning Internal server errors and the other isn't [15:45:47] It seems both responses are being returned [15:45:50] though the LB [15:45:57] http://ores.wmflabs.org/ [15:45:59] oh [15:46:03] Hit that and refresh a few times [15:46:26] i see it [15:46:35] We get a 500 with that Internal Server Error [15:47:49] halfak: manually unpooled -02 [15:48:12] kk. It should be up again soon though. [15:48:18] yeah [15:48:23] Should our LB detect the issue and route all traffic to -01? [15:48:23] tell me when it is and I'll bring it back up [15:48:26] yes [15:48:29] it should [15:48:31] not sure why it didn't [15:48:34] kk. Where to file bug? [15:48:37] oh [15:48:40] phab revscoring [15:48:41] ? [15:48:44] I think it didn't count the 500 as failure [15:48:49] ha [15:48:54] it just re-routes on lack of connectivity [15:48:55] yeah, pha [15:48:55] b [15:49:55] Hmm... That might actually be fine. [15:50:13] We might sometimes 500 legitimately because 1/1000 requests produces a weird error. [15:50:17] halfak: yeah [15:50:20] halfak: so what we usually do [15:50:25] halfak: is to have a simple url [15:50:27] like /health [15:50:31] that'll always be non 500 [15:50:37] and use that as health check [15:50:38] Gotcha [15:50:43] we have Special:BlankPage in mw for this [15:50:49] if that fails, depool [15:50:50] The base of ores.wmflabs.org is basically that right now. [15:50:56] right [15:51:50] https://phabricator.wikimedia.org/T111806 [15:51:53] YuviPanda, ^ [15:52:09] * halfak watches numpy compile for the 50th time [15:55:33] restarting uwsgi on -02 now. It should be good to go. [15:55:51] And the workers finished too. [15:56:02] * halfak crosses fingers [15:56:07] halfak: ok, should I repool it? [15:56:11] Yeah [15:56:20] I suppose we need a rollback strategy if this happens again next time. [15:56:29] yes [15:56:32] we also need a postmorteem [15:56:34] IT WORKS! [15:56:37] to figure out what exactly happened [15:59:36] halfak: both pooled now [16:11:12] Yeah. [16:11:24] So. I'm just running my final tests. [16:11:31] It looks like precached is doing good. [16:18:07] I'll write up something today and post in on meta:ORES [16:18:32] halfak: thanks [18:05:19] halfak: hey, do you approve such scripts, (I talk about coding style and nasty hacks I'm doing) [18:05:24] https://www.irccloud.com/pastebin/kAzmfxuZ/ [18:07:13] If you don't see a clearly better way, then hacks are better than nothing. Any code you feel queasy about should have a # TODO: type comment next to it. Test thoroughly. [18:07:29] Try to incorporate some likely failures in your test cases. [18:07:42]