[15:24:02] Hey folks! Was sitting here with my client disconnected :| [15:24:05] hehe [15:24:13] No worries, I’m deep in PHP hell. [15:35:12] Working on emails. [15:35:23] Then I have reviews for codezee [15:56:52] I’m done with validation, so have 3 PRs to review [16:00:01] Nice. [16:00:26] Just starting in on the OneVsRest stuff [16:02:02] Awesome, and def a higher priority. [16:02:27] OMG, I haven't started up my text editor in SO LONG [16:02:32] It feels good :) [16:02:47] hahaha [16:03:34] akosiaris: I’m a bit uneasy about where we left the ores* cluster—it’s still doing production Celery work, but not web working, and AFAIK the last change was a failed rollback. [16:03:38] Maybe harmless? [16:10:16] awight: in a meeting, will be with ya in about 40-50 mins [16:10:38] akosiaris: ty, not a huge rush but I’d like to stabilize by the end of day, one way or another... [16:12:51] wiki-ai/revscoring#1415 (ensemble - 4d0a0c7 : halfak): The build was broken. https://travis-ci.org/wiki-ai/revscoring/builds/341015525 [16:19:29] 10Scoring-platform-team (Current), 10revscoring, 10artificial-intelligence: OneVsRest Classification for revscoring - https://phabricator.wikimedia.org/T185896#3927992 (10Halfak) Added a test to highlight an issue. [16:25:32] Reviewing complete. Working on a pitch to the board to continue to fund out work :) [16:25:43] https://etherpad.wikimedia.org/p/how_ml_saved_wikipedia [16:32:24] awight: I am around [16:32:33] so the rollback succeeded on the second try [16:32:40] which is .... worrysome [16:33:02] the fetch_batch_size: 3 setting was used in both cases [16:33:18] and it's low enough so I can't say it is to blame [16:33:45] now the other worrysome thing [16:33:56] the rollback tool 15:59:03 Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 04m 20s) [16:33:56] T171851: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851 [16:34:25] and most of that seems to have been the fetch stage :-( [16:34:37] That’s pretty slow, yeah... [16:34:44] I am not sure what on earth does scap need to fetch in an already fetched revision [16:34:48] and fetch should have been instaneous [16:34:54] exactly [16:35:08] 15:54:43 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'fetch', '--refresh-config'] [16:35:08] 15:54:43 Using key: /etc/keyholder.d/deploy_service.pub [16:35:08] ores/deploy: fetch stage(s): 100% (ok: 18; fail: 0; left: 0) [16:35:08] 15:58:13 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'config_deploy', '--refresh-config'] [16:35:11] that's the output btw [16:35:25] as you can see 3.5 mins goes to fetching [16:35:42] I’ll look at the log for fun [16:36:18] that being said, I think we are practically done with the migration [16:36:33] there are a few open points, but mostly cleanup [16:36:47] oh wait, I think I know what’s wrong. [16:36:55] Rollback still involves a full virtualenv rebuild. [16:37:12] taking a look in: scap deploy-log -f scap/log/scap-sync-2018-02-12-0009.log [16:37:40] I thought that too [16:37:49] but isn't that happening in the check phase ? [16:38:40] checks happen after each phase, I’m pretty sure that’s in the fetch-check [16:38:52] the detailed log certainly shows a lot of fetching though [16:39:06] but, maybe a 5s delay for each host? [16:39:25] 15:55:21 [ores1009.eqiad.wmnet] : process started [16:39:26] 15:56:21 [ores1002.eqiad.wmnet] : process completed [16:39:34] ugh fetching it is, then... [16:39:53] 15:56:54 [ores2001.codfw.wmnet] Revision directory already exists (use --force to override) [16:39:53] 15:57:50 [ores1004.eqiad.wmnet] : process completed [16:39:54] 15:58:13 [ores1006.eqiad.wmnet] : process completed [16:39:59] Maybe it’s just not logging the fetch-check? [16:41:04] no… the way checks.yaml is set up at the moment, it’s only a promote check [16:41:07] 10Scoring-platform-team (Current), 10WMF-Communications: Pitch: How ML saved Wikipedia - https://phabricator.wikimedia.org/T187210#3968015 (10Halfak) [16:41:20] 10Scoring-platform-team (Current), 10WMF-Communications, 10artificial-intelligence: Pitch: How ML saved Wikipedia - https://phabricator.wikimedia.org/T187210#3968027 (10Halfak) [16:41:53] One unusual thing about this repo is that we’re aggressively pruning old revisions [16:42:13] This config setting, fwiw: cache_revs: 3 [16:42:35] I did check btw that the deploy-cache directory did have the revision before rolling back [16:43:36] ok ty [16:43:48] I hate that the log seems to report two revisions [16:43:52] 15:54:42 [tin] Deploying Rev: HEAD = a9618e1e9d6ca170be4f87a8a482f983f4277278 [16:43:52] yet [16:43:55] 15:54:42 [tin] Started deploy [ores/deploy@f7e23f4] [16:45:05] which is HEAD^ [16:46:05] ugh [16:46:07] 18:57:52 [ores1008.eqiad.wmnet] Rolling back from revision None to 8d95adedfa0831f9c5b76879cb7ce25dc500281c [16:46:11] That’s from last night’s log [16:46:39] Was yesterday’s failure too much parallelism? [16:46:55] ah that's me saying yes to scap's rollback prompt after my rollback failed [16:47:01] 10Scoring-platform-team (Current), 10WMF-Communications, 10artificial-intelligence: Pitch: How ML saved Wikipedia - https://phabricator.wikimedia.org/T187210#3968077 (10Halfak) FYI: @EdErhart-WMF & @MelodyKramer I'm thinking of writing this up for the Wikimedia Blog. I think it could be a fun part of a mu... [16:47:21] yeesh [16:47:46] akosiaris: What do you think? Should we block on a releng investigation, or go forward with this “good-enough” situation? [16:48:48] I can tell you that tin did not complain so I don't think it was too much parallelism. At least not the one I 've met before [16:49:10] awight: I think we should proceed. We 've held this back long enough and all these are clearly scap issues [16:49:12] It’s weird that I don’t see the virtualenv being rebuilt [16:49:20] which we can't know when they will be solved [16:49:24] akosiaris: I’m okay with that, too. [16:49:32] and on the plus side, both uwsgi and celery seem to work fine [16:49:36] :) [16:49:51] we 've been essentially scoring on the new cluster for a day now with no issues [16:50:24] Yeah the scary part is deployment and rollback, and that’s scary no matter what. This just makes it more unpredictable. [16:50:36] If the worst-case scenario is that we have to roll back twice… [16:50:47] We can just get dramatic and disable the ORES extension. [16:50:56] s/twice/>1/ [16:51:01] harr. yeah [16:51:38] fwiw, I’ve installed the scap vagrant environment so I can cause myself pain without anyone knowing. [16:51:40] it's also true we do have other venues of reacting to a full blown mess as you point out [16:52:04] If this is really just rollback, one workaround might be to explicitly deploy an earlier revision. [16:52:31] You were demonstrating that normal deployment takes about 5 minutes, too? [16:52:32] so just to be clear. what I did was create a dummy commit [16:52:46] deployed it (~6 mins IIRC) [16:52:55] and then did a git reset --hard HEAD^ [16:52:58] got it, so the actual git changes were minimal. [16:52:58] and deployed again [16:53:34] if that triggered some edge case in scap, I wouldn't not know [16:53:43] but it did sound like a valid way to rollback [16:53:57] just make the offending commit dissappear and deploy again [16:54:03] 10Scoring-platform-team (Current), 10ORES: Preliminary deployment of ORES to new cluster - https://phabricator.wikimedia.org/T185901#3968116 (10Halfak) Looks like this is now done. [16:54:03] disappear* [16:54:44] Ah, I see now that “scap deploy” doesn’t have any explicit rollback ffeature. [16:55:14] I could have also said -r HEAD^ I guess [16:55:18] 10Scoring-platform-team (Current): Host Google-News-word2vec.bin publicly - https://phabricator.wikimedia.org/T185147#3907561 (10Halfak) a:03Halfak See: https://analytics.wikimedia.org/datasets/archive/public-datasets/all/ores/assets/ [16:55:29] or preferably specify the revision [16:56:15] Good to know, though, that’s the same way I’d be using scap. [16:56:26] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3478159 (10Halfak) It looks like this is done. Is that right? [16:57:23] awight, what task can I point to in order to represent your work on mw-ext-JADE? [16:58:02] E.g. is there a task for getting basic validation working? [16:59:47] 10Scoring-platform-team (Current), 10JADE: Build prototype JADE extension - https://phabricator.wikimedia.org/T187216#3968157 (10awight) [16:59:50] 10Scoring-platform-team (Current), 10JADE: Build prototype JADE extension - https://phabricator.wikimedia.org/T187216#3968167 (10awight) a:03awight [17:00:02] There’s one... [17:00:30] akosiaris: Let me know when you want to try switching over the web pool, depooling scb*, what-have-you. [17:00:36] great thanks [17:00:51] awight: ok I pool the rest of the ores* hosts now [17:00:56] I 'll pool [17:00:58] wicked [17:01:09] I 'll leave the depooling of scb for tomorrow though [17:01:37] Works for me. [17:09:26] 10Scoring-platform-team (Current): Support for word2vec on ORES deployment - https://phabricator.wikimedia.org/T187217#3968196 (10Sumit) [17:34:40] akosiaris: Should I be seeing web worker action on ores*? [17:56:03] 10Scoring-platform-team (Current), 10WMF-Communications, 10artificial-intelligence: Pitch: How ML saved Wikipedia - https://phabricator.wikimedia.org/T187210#3968403 (10MelodyKramer) Looks good @halfak. I added a few points of clarification in your outline. Please let us know when it's ready for a look. [18:10:10] doing kid things, then probably sleeping off a cold. [18:41:15] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for Catalan Wikipedia - https://phabricator.wikimedia.org/T182611#3968526 (10Townie) We are finished with the [[ http://labels.wmflabs.org/stats/cawiki/46 | current labelling ca... [19:17:26] meetings complete! [19:17:34] Oh crap there's more coming. [19:17:38] Taking lunch! [19:56:24] Back into meetings :( [21:13:22] 10Scoring-platform-team (Current), 10JADE: Build prototype JADE extension - https://phabricator.wikimedia.org/T187216#3969249 (10awight) [21:14:53] Captured some EventBus stuff. [21:15:00] out again, [21:17:43] wiki-ai/jade#7 (events - bb4f792 : Adam Wight): The build passed. https://travis-ci.org/wiki-ai/jade/builds/341141556 [21:29:24] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for Catalan Wikipedia - https://phabricator.wikimedia.org/T182611#3969369 (10Halfak) Wow! That was fast! We'll get to work on the next steps. [21:30:03] 10Scoring-platform-team, 10ORES: Train and test damaging/goodfaith models for Catalan Wikipedia - https://phabricator.wikimedia.org/T186749#3969375 (10Halfak) [21:30:19] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: Train and test damaging/goodfaith models for Catalan Wikipedia - https://phabricator.wikimedia.org/T186749#3969377 (10Halfak) [21:50:58] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10Bengali-Sites, 10artificial-intelligence: Edit quality campaign for Bengali Wikipedia - https://phabricator.wikimedia.org/T174878#3969484 (10Halfak) a:03Halfak [21:52:00] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10Bengali-Sites, 10artificial-intelligence: Edit quality campaign for Bengali Wikipedia - https://phabricator.wikimedia.org/T174878#3575726 (10Halfak) Hi @Bodhisattwa, could you translate "Edit quality (5k revisions)" to Bengali for... [22:33:02] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10Bengali-Sites, 10artificial-intelligence: Edit quality campaign for Bengali Wikipedia - https://phabricator.wikimedia.org/T174878#3969555 (10Halfak) [22:37:03] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10Bengali-Sites, 10artificial-intelligence: Edit quality campaign for Bengali Wikipedia - https://phabricator.wikimedia.org/T174878#3969558 (10Halfak) The campaign shows up here: http://labels.wmflabs.org/ui/bnwiki/ Once I have tha... [22:37:13] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10Bengali-Sites, 10artificial-intelligence: Edit quality campaign for Bengali Wikipedia - https://phabricator.wikimedia.org/T174878#3969559 (10Halfak) [23:00:41] OK I think I'm ready to declare victory for the day. [23:00:45] Have a good one folks.