[00:32:10] https://github.com/wiki-ai/ores/pull/133/files [00:32:24] Got annoying seeing those 404s in the log. [00:39:23] Here we go. [00:40:12] I feel weird deploying new code to the worker nodes when the changes shouldn't effect them. [00:40:30] But I don't want to think about the weird errors we might see if they were mismatched. [00:40:36] Would be fun to debug, for sure. [00:40:50] So deploying to all the nodes anyway is what I'm doing. :) [00:42:19] halfak: that's also what we do for mediawiki [00:47:08] Now to wait for the super-long reboots [00:47:16] * halfak curses uwsgi. [00:47:24] Should try to figure out what is happening. [00:47:50] halfak: yeah :| [00:47:56] Hmm... when it hangs, I see a huge number of uwsgi processes using a ton of CPU [00:48:20] yuvipanda, I feel like we're maybe making a bad call by not using threads in uwsgi. [00:48:31] shared memory is killing us. [00:48:50] right now, we're using "processes" or forking [00:49:04] We don't do any CPU work in uwsgi at all, but we load huge models. [00:49:26] Hmmm... I wonder if I can just set a flag for loading the models on the worker vs. uwsgi. [00:49:40] That'd probably solve the problem and we can keep using processes. [00:49:58] * halfak uses -ai as his rubber duck [00:50:07] and yuv.ipands :( [00:50:09] indeed [00:50:13] I'll stop pinging. :) [00:50:15] nah :) [00:50:17] it's ok [00:50:53] If my understanding of how uwsgi implements threads is correct, it won't be of much use to us, since it'll either mean it is restricted to one process and has GIL limitations, or it'll spawn multiple processes anyway [00:51:05] halfak: and remember the uwsgi stall isn't ores specific - other uwsgi services seem to have it too [00:51:23] ^ good point [00:51:36] gil limitations would be OK [00:51:43] To a large extent [00:52:18] well [00:52:20] step one is to profile :) [00:52:23] and figure out what they're doing [00:52:39] we'll probably find surprising conclusions. [00:52:47] the uwsgi processes? [00:53:40] yeah [00:54:05] well, and the python as well, while it is running under uwsgi [00:54:39] Hmm... seems ores-web-03 is crawling. [00:54:54] Deploy script is running *really* slow on it [00:54:57] Even before uwsgi [00:55:06] Resource usage looks OK [00:55:11] 0.0 wa [00:56:09] could be the instance is hosed [00:56:18] if you see a lot of use in 'top' [00:56:22] for irqbalance / ksoftirqd [00:57:46] Nope. Hmmm [00:57:56] Seems pip will only use 5% of CPU [00:58:03] The machine is only using 58% total [00:58:25] Yeah. Compiling scipy with 5% cpu lol [00:58:25] fun [00:58:27] Oh wait. [00:58:30] That's not even compiling [00:58:33] Just extracting! [00:59:30] * halfak checks the other web nodes. [01:00:11] Well... they aren't doing anything! [01:00:16] What!? [01:02:38] yuvipanda, I'm seeing lots of ksoftirqd on the web nodes, but at low CPU/mem. That a problem? [01:03:15] yeah, it is a vague thing that seems to be happenign these days. [01:03:23] No idea if anyone actually evaluated it when I was gone [01:03:26] can you file a bug? [01:04:12] yuvipanda, this looks bad? [01:04:13] http://pastebin.ca/3408755 [01:04:34] yeah [01:04:39] but I've no idea what that process does [01:04:46] kk [01:05:48] OK if I kill this instance now? [01:05:54] Or should I leave one alive? [01:06:15] halfak: yeah, leave one alive probably. [01:06:22] halfak: I think restarting them would fix them [01:06:33] kk [01:13:34] weird. ores-03 suddenly calmed down after the uwsgi process restarted. [01:14:02] web-03 that is [01:14:15] but the reboot doesn't seem to be working. lol [01:14:32] Take my wikitech clicks [01:15:37] (reboot_started) [01:16:14] O_O (reboot_started) [01:16:42] welcome to wikitech, where we give you hope and then slowly let it collapse on itself [01:16:54] O_Olol [01:17:04] also my flight just got cancelled! [01:17:21] Booo! [01:17:23] and then they announced that the flight cancellation was only partially true, and they'll update in 7 minutes [01:17:32] lol [01:17:37] wat [01:17:57] What airline? [01:18:17] Reboot complete! [01:19:23] delta [01:21:02] seems ores-web-03 was good after a restart. [01:21:17] But I still see a bunch of ksoftirqd [01:22:45] FYI: looks like round-trip to celery to perform a trivial computation and return in uwsgi is ~ 0.5 seconds. [01:22:57] That should be way faster [01:23:28] Oh... The homepage is taking 0.3 secs to load. [01:23:29] lol [01:25:38] I'm outta here. Good luck, yuvipanda. If you want to fly to MN instead and crash in my spare bedroom, let me know :) [01:26:16] :D [01:26:18] will do, halfak [13:58:30] o/ akosiaris [13:58:36] downtime in T-2 minutes? [13:58:44] o/ schana [13:58:45] ^ [13:59:23] akosiaris, when you get here, please also merge https://gerrit.wikimedia.org/r/#/c/278898/ [13:59:31] It'll be nice to have that done during the scheduled downtime. [13:59:34] hi halfak [13:59:46] Good morning :) [14:01:07] ok. starting [14:01:38] * halfak posts updates on the wiki. [14:05:51] * halfak monitors error rates [14:07:22] akosiaris, did you get the second patch too? [14:07:28] all 3 of them [14:07:31] Great [14:08:05] hmmm redis on TCP 6380 is taking its sweet time to stop [14:08:13] Heh. And it looks like we're down. [14:10:20] Looks like we're getting "kombu.connections.InconsistencyError" on the web nodes [14:12:53] Looks like 6380 is the regular cache. [14:12:56] It's likely large [14:13:34] Now getting redis.errors.ConnectionError on the web nodes [14:13:41] (all expected) [14:15:24] akosiaris, redis back up yet? [14:15:41] halfak: almost [14:16:14] kk [14:18:34] o/ [14:18:35] halfak: It seems they are rebooting all instances in labs today [14:18:42] queue is up, bringing cache up [14:18:43] lol of course they are [14:18:47] thanks akosiaris [14:18:52] o/ Amir1 [14:19:02] (error) LOADING Redis is loading the dataset in memory [14:19:07] lol.. ok in waiting mode [14:19:18] Is that an error or a notification? [14:19:31] notification [14:19:49] kk [14:20:01] \o/ [14:20:24] Oh! I think we're back online! [14:20:37] yess [14:20:37] seems like it [14:22:02] * halfak finds vandalism while he tests ORES [14:22:02] https://en.wikipedia.org/wiki/?diff=638307889 [14:23:32] so, that migration means redis is now in AOF mode for persistence [14:23:39] http://redis.io/topics/persistence [14:24:08] Hmm... We switched from AOF to RDB in the past [14:24:34] filesize problems ? [14:25:06] Maybe. Looking at the old tasks. [14:25:11] Found this one, but it's not super helpful. [14:25:14] https://phabricator.wikimedia.org/T122666 [14:25:24] Maybe I had it wrong and we intended to switch from RDB to AOF [14:25:36] nope [14:25:41] https://phabricator.wikimedia.org/T121658 [14:26:43] Hmm... If you think we should stick with AOF, that's fine with me. [14:26:43] not very helpful indeed. what does minimize file usage mean ? [14:26:47] just file size ? [14:26:50] Good question. [14:27:04] I was probably just copy-pasting from a chat with Yuvi [14:28:21] akosiaris, maybe it was writing both RDB and AOF files. [14:28:36] i.e. write AOF all the time, with periodic RDB. [14:28:42] That would potentially speed up restarts. [14:28:46] er, it does not work that way ... [14:28:53] kk. [14:29:00] * halfak wildly speculates. [14:29:15] well in our setup [14:29:20] I should have said that [14:29:22] from the docs "It is possible to combine both AOF and RDB in the same instance. Notice that, in this case, when Redis restarts the AOF file will be used to reconstruct the original dataset since it is guaranteed to be the most complete." [14:29:25] http://redis.io/topics/persistence [14:29:35] in our setup it only writes the AOF file [14:29:36] Weird. So what is the point of the RDB! [14:29:49] it was the older way [14:29:55] Anyway, these seem fine to me. [14:30:03] Rather, going full AOF seems fine to me. [14:30:46] I 'll be monitor file size and I/O performance on that box and reevaluate [14:30:52] it's easy to switch back and forth anyway [14:31:05] kk [14:31:25] Lol. our http-->https redirect is broken [14:31:38] Go to http://ores.wmflabs.org [14:31:48] and you'll get redirected to https://oresweb/ [14:31:49] So... [14:31:58] er, nope [14:32:07] I am at ores.wmflabs.org [14:32:25] ^note the protocol difference [14:32:32] The Objective Revision Evaluation Service (ORES) is a web service running in Wikimedia Labs that provides machine learning as a service for Wikimedia Projects, yada yada yada [14:32:38] If you have https everywhere running, it might mess it up [14:32:39] I mean the redirect does not work at all [14:32:55] ah yes indeed [14:32:57] that was it [14:33:05] ok, let's fix that [14:33:48] halfak: It gives me proper results too, in http, without https anywhere [14:33:50] Looks like "${HTTP_HOST}" resolves to "oresweb" [14:34:06] but stays in http [14:34:19] Amir1: HTTPS everywhere ? [14:34:23] I get this on all my browsers. [14:34:37] http://ores.wmflabs.org --> https://oresweb [14:34:54] yeah :D [14:35:23] Seems like we should revert this change akosiaris. Is that difficult? [14:35:30] nope [14:35:47] for me and akosiaris it stays at http, it doesn't redirect to ssl [14:35:54] That's very weird [14:36:08] Amir1: er, I disabled https everywhere and I have the same symptoms as halfak [14:37:08] hmm so what if route-if => 'equal:${HTTP_X_FORWARDED_PROTO};http redirect-permanent:https://${HTTP_HOST}${REQUEST_URI}', [14:37:12] becomes [14:37:28] route-if => 'equal:${HTTP_X_FORWARDED_PROTO};http redirect-permanent:https://ores.wmflabs.org${REQUEST_URI}', ? [14:37:39] akosiaris, doesn't sound crazy to me. [14:37:43] Want to give it a try? [14:37:56] ok [14:38:27] akosiaris, should I make the change or are you doing it? [14:38:32] I am [14:38:35] kk [14:38:46] halfak: https://phabricator.wikimedia.org/T130408#2143627 [14:38:57] you can shut down revscoring project :) [14:39:23] we had one minute down time for mw-revscoring.wmflabs.org [14:39:36] nice work :) [14:39:57] btw, I am gonna have oresdbs ready today [14:40:01] in production I mean [14:40:13] I am wondering still about the scap3 parts though [14:40:44] Have you seen Amir1's progress on scap3 and our deployment? [14:40:51] yup, all 3 tasks [14:40:57] kk [14:41:20] akosiaris: I don't have access in prod but If I can help at anything, tell me and I'd do it [14:41:35] e.g. making configurations, etc. [14:41:43] oh, it's not about prod. it's about getting it deployed in beta [14:41:53] the moment scap3 works in beta [14:42:06] we do the exact same dance in production [14:42:19] I have shell access in beta and I already deployed some stuff there :) [14:42:24] grrr, why is gerrit so slow [14:42:35] I can't upload my change [14:43:05] So. I need to get on my bike and go to the university [14:43:26] Would it be OK if I left this work to akosiaris & Amir1 for the next 45 minutes? [14:43:30] ok [14:43:36] sure :) [14:43:38] Great. Sorry to run away. [14:43:44] I forgot I'm giving a lecture today! [14:43:50] no worries, thanks for being around and helping [14:43:52] I should have biked in this morning :/ [14:43:54] o/ [14:44:00] o/ [14:44:10] akosiaris: so, I try to deploy it in tin [14:44:19] (tin in beta) [14:44:35] but I need target(s) for flower, web and workers [14:44:43] flower ? [14:44:49] we need flower in production ? [14:45:00] AFAIK [14:45:03] hmmm [14:45:06] it's in the fabfile [14:45:26] https://github.com/wiki-ai/ores-wikimedia-config/blob/master/fabfile.py [14:56:33] (03CR) 10Thiemo Mättig (WMDE): "Question only." (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/264608 (https://phabricator.wikimedia.org/T122537) (owner: 10Awight) [15:01:01] Amir1: so https://ores.wmflabs.org/scores/arwiki/reverted/19004618/ seems to be working just fine [15:01:18] \o/ [15:01:18] the HTTP-> HTTPS conversion happens as well [15:01:27] I think that problem is resolved :-) [15:01:36] awesome, thank you [15:01:49] I update the phab card [15:06:20] (03CR) 10Ladsgroup: [C: 04-2] Integrate with Special:Contributions (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/264608 (https://phabricator.wikimedia.org/T122537) (owner: 10Awight) [15:34:44] o/ Amir1 & akosiaris [15:34:52] what's the status? [15:34:55] hey :) [15:34:58] welcome back [15:35:08] Looks like I'm still getting redirected to oresweb :/ [15:35:44] akosiaris wants to use beta to deploy, It's rather easy but first we need to determine what we need to do wrt flower [15:35:59] hard refresh? [15:36:01] Hmm... I think we should fix our broken service ASAP [15:36:54] agreed [15:37:30] Hmm.. Looks like the host is now hard-coded. [15:37:37] I'm going to run puppet on the web nodes. [15:40:15] OK. All are updated, but the issue persists. Going to manually restart the web nodes. [15:41:36] And it works! [15:41:39] halfak, er, I don't have the problem anymore [15:41:55] I did not for some time... [15:42:04] how come you still experienced it ? [15:42:17] Weird. Maybe my browser is remembering the permanent redirect? [15:42:36] You'd think that if I put the location in the address bar, it wouldn't make assumptions. [15:42:44] ah, yes it does [15:42:53] ok, so that explains it [15:43:15] Just checked again with wget and it seems to work as expected. [15:43:25] Also wget handles the redirect natively :) [15:43:48] the oresweb btw is due to nginx having upstream oresweb { [15:43:56] in it's configuration [15:43:59] but curl doesn't isn't it weird? [15:44:00] er,, no [15:44:06] proxy_pass http://oresweb; [15:44:25] anyway [15:44:32] yes curl needs -L IIRC [15:44:38] akosiaris, ^ was thinking about that on the whole ride in. [15:44:58] Where is "oresweb" and how did the config we copied it from document that a hostname must be set in such a way. [15:45:40] so the correct way to fix that [15:45:49] is to actually preserve the Host: header [15:45:56] proxy_set_header Host $host; [15:46:23] akosiaris, WOops! Didn't mean to assign you the epic task in phab. I must have gotten lost when I posted that comment. Was aiming towards the sub-tasks. [15:46:32] so that the backend HTTP request has the HTTP Host: header the client originally wanted [15:46:46] I 'll concoct a change and fix this the nice way [15:47:43] Great! Thank you. I'll copy the notes from this chat into the task [15:50:01] https://phabricator.wikimedia.org/T130618#2144579 [16:07:08] https://gerrit.wikimedia.org/r/#/c/279133/ merged [16:07:34] btw, ores in production will have a password, I suppose the software supports that, right ? [16:13:09] you mean when someone wants to look up a score? [16:16:17] akosiaris: ^ [16:17:12] and also I don't have access in https://wikitech.wikimedia.org/wiki/Nova_Resource:Beta, if you make a target for web and another target for worker, I start playing with them :) [16:17:28] thank you :) [16:17:55] if you don't have time, It's okay, I go find myself another thing to play with [16:20:41] Amir1: er, no I mean that ores should send the AUTH command right after connecting to the redis servers. Otherwise it will not be able to do anything [16:21:31] I think it does but I'm not so sure, halfak? ^ [16:22:47] Amir1: er, what's your labs username ? [16:22:56] "Ladsgroup" [16:23:15] thanks [16:26:36] akosiaris, a redis password? I'm sure we can support that. We have merge-able configurations. [16:26:49] Let me confirm how we'll deliver the password to the redis client [16:27:16] Oh.. We put it in the URL. So yeah./ [16:27:23] We support it :) [16:27:27] good to know [16:27:34] so only puppet changes then [16:27:36] great [16:27:42] * halfak is proud of our configuration strategy [16:27:46] It has served us well [16:27:52] I'm as well [16:28:15] I didn't contribute at it, that's the reason :D [16:31:35] :P [16:31:44] so, it looks like our swagger docs are still broken. [16:32:03] In firefox, our redirect prints stupid crap instead of JSON :\ [16:32:05] * halfak fixes. [16:32:16] I made a contribution in swagger, that's the reason :D [16:32:29] Damn it, Amir1! ;) [16:32:37] You and your RandomForest classifiers [16:32:38] :P [16:33:02] :))))) [16:33:20] actually that would be the only reason we need to migrate to git_fat eventually [16:34:08] BTW, did you notice we have a favicon now, Amir1? [16:34:08] btw. I worked on the Adam's patch on core https://gerrit.wikimedia.org/r/#/c/247249/ [16:34:32] yeah, I noticed it during PR merge [16:34:37] :) [16:34:50] :) Was fun to put together. [16:35:20] * halfak prepares a swagger fix to go to staging [16:35:54] it's like the London's Eye [16:36:08] when it's small [16:36:11] I think I should round out the corners a bit [16:38:57] lolwoops. looks like the https redirect breaks staging sort-of [16:39:05] lol & hardcoding [17:08:05] I go grab something for dinner, be back in one hour [17:18:12] hey halfak, are we still blocked on https://phabricator.wikimedia.org/T129420? (for the SoS) [17:18:28] Amir1, ^ [17:19:49] schana, let's say "no" for now [17:20:01] okay, halfak [17:20:09] I don't think we're going to invest in git-(fat|annex|lfs) in the short term [17:40:00] Swagger docs work again :) [18:16:33] back [18:16:39] yes