[01:58:06] 10Scoring-platform-team (Current), 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2018-01-09 (1.31.0-wmf.16)), 10Patch-For-Review, 10User-Ladsgroup: Add models when initializing the table - https://phabricator.wikimedia.org/T184127#3873267 (10awight) This looks like another edge case: *... [08:40:10] (03CR) 10jenkins-bot: Minor fixes to ORES\Hooks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/401834 (owner: 10Ladsgroup) [14:15:34] halAFK: We can’t quite make transactional guarantees, but I think we would use this hook to send events from RevisionDelete: https://www.mediawiki.org/wiki/Manual:Hooks/ArticleRevisionVisibilitySet [14:15:55] * halfak sips his morning coffee [14:16:27] awight, what guarantee do you want and what's a practical problem with getting it? [14:16:48] :) [14:17:13] I want a guarantee that every RevisionDelete event in the Jade: namespace will send us an event.f [14:18:13] The RevisionDelete code as written doesn’t allow the hook to interrupt the workflow. We might be able to add that. [14:19:06] I’m not even sure that’s the right thing to do, though. We don’t really want to prevent suppression if JADE is down, we want to schedule a job or something to ensure that we get the memo. [14:19:39] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Flood of ORES errors at Beta Cluster - https://phabricator.wikimedia.org/T184276#3878311 (10MarcoAurelio) [14:19:47] uhwat [14:20:01] oh yeah [14:20:10] ores-beta is overloaded all the time now. [14:20:19] awight, I was thinking about the JADE downtime situation too. [14:20:45] Here’s a devious idea: include revisiondelete events in ChangePropagation [14:20:46] Not sure how to think about that, but if we could maintain a queue and suppress on the MW side, I think it'd be OK. [14:20:57] I think that has some decoupling. [14:21:06] yes, what you said. [14:21:22] re. ores-beta, any hypotheses as to why? [14:21:31] Are we maxing CPU or something? [14:21:40] Is celery crashing? [14:21:52] no. maybe spikes of high volume wikidata experiments, then our service goes into retirement. [14:21:55] lemme poke at it. [14:22:48] * halfak looks too [14:23:06] Looks like celery is offline [14:23:08] OOM? [14:23:26] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Flood of ORES errors at Beta Cluster - https://phabricator.wikimedia.org/T184276#3878331 (10MarcoAurelio) https://logstash-beta.wmflabs.org/goto/3da590c69d2896cf4d4cd227616fcd29 is one of them, but you should check the... [14:24:11] Celery last showed up in syslog on Jan 1 [14:25:24] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Flood of ORES errors at Beta Cluster - https://phabricator.wikimedia.org/T184276#3878311 (10awight) @MarcoAurelio Thanks for the report! Our celery worker died three days ago, probably due to out-of-memory. It's not t... [14:25:52] awight, I'm going to try restarting the celery service. [14:26:07] halfak: Sure, that should work. [14:27:00] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878342 (10awight) [14:28:10] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878346 (10MarcoAurelio) Dear @awight; thanks for your quick response. Yesterday @Krenair was discussing at -releng that there were a numb... [14:28:26] https://grafana-labs.wikimedia.org/dashboard/db/ores-beta-cluster?orgId=1&from=now-7d&to=now [14:29:26] We might just need more than 4 GB of memory on the beta node we're using [14:31:14] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878311 (10Halfak) It looks like we might need more memory on sca03 (or whatever beta cluster node we're deploying to). Maybe it's time t... [14:34:44] halfak: Or we could reduce the number of workers? [14:35:03] It's already very few workers. [14:35:03] I think we have 4 [14:35:35] Oh 8 [14:35:40] We could reduce it to 4 I think [14:37:42] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878361 (10Halfak) Alternatively, we could also reduce the # of workers from 8 to 4. I think we could still handle beta-capacity with th... [14:37:45] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Beta Cluster ORES celery worker dies - https://phabricator.wikimedia.org/T184276#3878362 (10awight) Looking at /srv/log/ores/app.log, we've been down for at least 2 weeks. Any useful evidence has been rotated out of lo... [14:38:20] halfak: We could also raise timeouts and increase queue size, but we might be straying too far from production-like in that case. [14:38:29] Turns out this has been down since at least Dec 14 [14:38:45] awight, how would that address memory issues? [14:38:50] celery was *down* [14:39:09] It would let us run with fewer workers, and still handle large bursts of requests. [14:39:17] Oh I see [14:39:19] https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep [14:39:25] Looks like the setting is there. 7 workers [14:39:39] How about I hit it with some stress so we have a fresh crash log? [14:40:48] Sure! :) [14:40:50] brb [14:41:17] * awight sheds off-duty hat and dons crash test dummy helmet [14:53:24] lol [14:53:28] I think uwsgi is falling over first: [14:53:29] [2018-01-05T14:53:14] Fri Jan 5 14:53:14 2018 - *** uWSGI listen queue of socket "0.0.0.0:8081" (fd: 6) full !!! (100/100) *** [14:54:42] ooh “revision not found" [14:54:51] we need a different input file [15:04:39] I’m seeing 45s response times for scoring wikidatawiki [15:05:46] k I think I managed to kill celery [15:06:23] yep [15:06:24] Jan 5 14:59:29 deployment-sca03 kernel: [17095051.512522] Out of memory: Kill process 27361 (celery) score 304 or sacrifice child [15:09:35] I manually lowered to 4 celery workers, it’s looking good so far [15:09:48] only 600k free, though, so anything else on the machine could tip the scales [15:10:26] uh 600MB [15:13:42] Strange. I reduced to 3 workers, and the machine has 450MB free [15:15:02] halfak: +1 4GB RAM is uncomfortable [15:15:27] At least while we’re sharing a machine with nodejs apps. [15:15:56] +1 We should get our own instance, I think and have at least 8GB [15:16:13] 4GB is fine if it’s just ORES [15:16:28] but obv. if 4 is fine, 8 is better. [15:16:41] I do like not needing to think about limitations in beta :) [15:16:46] yeah [15:16:58] But yeah, if releng/cloud insists on 4GB, we can make it work [15:18:38] Should I make a subtask, and request our own machine? [15:21:45] {{done}} [15:21:46] How cool, awight! [15:21:54] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Move beta cluster ORES to its own machine - https://phabricator.wikimedia.org/T184282#3878447 (10awight) [15:22:03] halfak: Please tag with whatever you think it needs ^ [15:23:04] halfak: EventBus uses Kafka and guarantees delivery. Let’s emit revisiondelete events and call it a day… [15:23:24] awight, oh good point. [15:23:35] Also re. kafka it can sometimes duplicate event delivery [15:24:02] I’m lobbying *for* kafka :) [15:24:08] at-least-once is fine for us, IMO [15:24:10] So we'll need a way to confirm that we only apply an event once [15:24:16] Right. [15:24:19] kk [15:24:33] much better problem to have than a sketchy hook that might not succeed [15:24:33] this is great because if we have a way to only apply events once, that means we can apply events whenever we want. [15:25:00] E.g. when calling JADE and JADE responds with an event, we can just apply before responding to the user [15:25:05] me takes another big swig of Kafkaesque Kool-Aid [15:25:21] We could possibly get an ordering issue, but in most cases, that won't be an issue. [15:25:24] E.g. in suppression [15:25:51] +1 I’m starting to hope we can do pure event streams with no distributed transaction [15:26:18] awight, one issue with throwing the suppression event in kafka is that we won't get a response from JADE confirming that it worked. [15:26:29] I think that’s okay [15:26:33] Essentially we're putting an "action" in kafka that might fail. [15:26:47] fail as in not produce an event [15:27:03] wait how would it fail? [15:27:30] We’ll definitely have the opportunity to read the revdelete event, and if we fail to deal with that we “put the message back" [15:27:31] Good Q. Not sure, but let's say a constraint fails. E.g. someone already suppressed something. [15:28:35] If it’s already suppressed, we can ignore the new suppression [15:28:50] Fair point. Hmm. [15:28:56] * halfak tries to think of a reason for failure. [15:28:57] If a constraint fails… good point. We emit really loud errors. [15:29:03] That means our DB is corrupt. [15:29:11] We should probably stop all processing... [15:29:15] Oh! An admin suppresses something but by the time that JADE gets the action, that user has been de-sysop'd. [15:29:21] O_O [15:29:36] Wheel warring -- it could happen [15:29:50] I think we have to trust ChangePropagation? [15:30:01] I think we want JADE to confirm in almost all cases. maybe suppression is the only exception [15:30:05] I could accept that. [15:30:09] if the revision is suppressed, it really is suppressed regardless of admin’s rights. [15:30:15] If MW says the suppression should happen, JADE should always agree. [15:30:24] If the suppression is reverted, we act on that too [15:30:25] +1 [15:30:54] So this would only be true of requests that come from a TrustedClient. We'll just trust it. [15:31:03] :-) [15:31:14] Seems that all events are going to be trusted. [15:31:27] OK I think this is OK. Reverts can always wait until JADE is back online. Suppressions can not. [15:31:38] The only untrustworthy source is the API, which is validating rights before committing the event, I suppose [15:31:45] Right. [15:32:34] halfak: remember me? [15:32:39] :D [15:33:00] I’m about halfway done with the labels. They should be done by the end of this month. [15:33:14] o/ Adotchar! [15:33:17] great news :) [15:33:28] Zppix wants to use them for a bot [15:34:18] * halfak still does not understand what zppix has in mind. [15:34:39] I think he might not understand that ORES does the hard work of making a prediction for him. [15:35:34] Idk what he wants to do [15:35:40] He just said he needs the labels to go further [15:37:54] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES, 10Wikimedia-log-errors: Move beta cluster ORES to its own machine - https://phabricator.wikimedia.org/T184282#3878472 (10Halfak) FWIW, our staging machine for our CloudVPS install for ORES is 16GB and usually runs with 9.2GB free. It has 8 ce... [15:41:31] kk well we'll get there [15:45:19] halfak: I think the events we need are already produced :) [15:46:24] YEP [15:50:46] fyi, https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/EventBus.hooks.php [15:51:18] * awight cooks a stack of pancakes for eevans and ottomata [15:51:33] I'm AFK to head to the doc. [15:51:43] (different one from yesterday) [15:51:49] break a leg! [15:51:53] This one is gonna try to make my old knees work better [15:51:54] NO [15:51:56] :P [15:52:07] o/ [15:56:00] Will look at specific event flows next week… https://docs.google.com/drawings/d/1Lagl0BJWVWHNvHLy5y6RNNKvl0C1tdVrE5YniwgqFJY/edit [16:18:21] 10Scoring-platform-team, 10JADE, 10MediaWiki-Vagrant: Vagrant role for JADE - https://phabricator.wikimedia.org/T182055#3878539 (10awight) [17:13:04] Halfak o/ i need the data for the bot to predict stuff [17:13:14] halfak like cluebot does [18:16:23] o/ [18:17:25] Zppix: ORES makes predictions [18:17:34] That's the whole point [18:24:19] hey folks. [18:24:33] I'm going to take out ores-staging-02 as part of the process of upgrading to stretch [18:25:07] Should be harmless [18:29:27] halfak: maybe im not using the right words [18:29:47] halfak: basically i need the data for the bot so i can have it do what cluebot does [20:46:15] 10Scoring-platform-team, 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879071 (10Halfak) [20:46:31] 10Scoring-platform-team, 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879071 (10Halfak) ores-staging-01 is back online as a stretch instance. [20:54:39] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Rebuild ORES wheels on Stretch - https://phabricator.wikimedia.org/T184135#3879099 (10Halfak) [20:55:16] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879113 (10Halfak) [20:55:31] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879071 (10Halfak) a:03Halfak [22:02:57] Amir1, could you take a quick look at https://gerrit.wikimedia.org/r/#/c/401822/ ? [22:03:13] Once I have that, I can run a test on the new ores-staging-01 [22:03:43] just quick question, after merging this, we can't deploy to scb nodes anymore, right? [22:03:54] as they are in jessie [22:04:02] Oh good Q. We can, but we need to not update wheels in the prod repo [22:04:12] I'll be working from the wmflabs repo [22:04:19] For now, anyway. [22:06:03] I see [22:06:05] cool [22:06:09] (03CR) 10Ladsgroup: [C: 032] Rebuilds wheels using Debian Stretch [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/401822 (https://phabricator.wikimedia.org/T184135) (owner: 10Halfak) [22:07:42] (03CR) 10Ladsgroup: [C: 032] "Still doesn't merge :/" [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/401822 (https://phabricator.wikimedia.org/T184135) (owner: 10Halfak) [22:23:08] \o/ [22:23:14] * halfak starts working on ores-staging. [22:34:03] Amir1, still around? https://github.com/wiki-ai/ores-wmflabs-deploy/pull/95 :D [22:34:23] I'm always around [22:34:53] Thanks :) [22:35:03] Now for the moment of truth (assume I got puppet classes right) [22:37:00] Damn. messed up the security groups [22:37:28] I really need to get ores in beta cluster working [22:37:44] to test the extension in more depth [22:37:52] Amir1, we have an active thread about that [22:38:06] If you need it working short term, restart the celery workers. [22:38:09] It should run for a while. [22:39:38] I did that once and they stopped working really soon [22:40:36] adam cut the number of workers. [22:40:40] Might work for longer now [22:40:41] not sure [22:41:30] Adam wanted to get some info about it crashing again to be sure we got it right. [22:41:34] https://phabricator.wikimedia.org/T184276 [22:41:37] Found it! [22:41:50] 10Scoring-platform-team (Current), 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2018-01-09 (1.31.0-wmf.16)), 10Patch-For-Review, 10User-Ladsgroup: Add models when initializing the table - https://phabricator.wikimedia.org/T184127#3879349 (10Ladsgroup) Yeah, I already knew about this p... [22:45:54] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879370 (10Halfak) Working on this now :| ``` halfak@ores-staging-01:~$ virtualenv -bash: virtualenv: command not found ``` [22:57:39] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879392 (10Halfak) Looks like I forgot to enable `role::labs::ores::staging`. Thought I'd clicked that :/ [23:09:02] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879423 (10Halfak) Now I'm looking at: ``` [ores-staging-01.eqiad.wmflabs] sudo: /srv/ores/venv/bin/pip install --use-wheel --no-deps /srv/ores/config/submodules/wheels/pip-*.whl [ores... [23:11:58] (03PS1) 10Halfak: Adds back the pip wheel. [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402448 [23:12:06] Amir1, git push --set-upstream origin add_back_pip [23:12:08] Woops [23:12:12] https://gerrit.wikimedia.org/r/402448 [23:12:13] ^ [23:12:29] copy paste error [23:12:30] :)) [23:12:37] (03CR) 10Ladsgroup: [C: 032] Adds back the pip wheel. [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402448 (owner: 10Halfak) [23:25:42] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879445 (10Halfak) And we're online! See https://ores-staging.wmflabs.org/v3/scores/fiwiki/1241 But it's returning an error. It seems like we have the wrong version of celery running.... [23:28:13] Arg. [23:28:15] So frustrating [23:28:23] version nonsense is driving me a bit batty [23:33:37] (03PS1) 10Halfak: Downgrade to celery 3.1 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402450 [23:33:45] Amir1, https://gerrit.wikimedia.org/r/402450 [23:33:46] halfak: batters up? [23:33:47] :D [23:34:10] 🦇 [23:36:30] I think I'm going to leave it at this. Amir1, please merge when you can. I'll come back and try another deploy tomorrow when I wake up. [23:38:30] (03CR) 10Ladsgroup: [V: 032 C: 032] Downgrade to celery 3.1 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402450 (owner: 10Halfak) [23:39:10] halfak: merged now [23:39:14] sorry, I misse dthe ping [23:40:38] \o/ I'm still here. Trying again! [23:42:58] * halfak face-palms. I deleted the pip wheel again [23:43:45] (03PS1) 10Halfak: Adds pip back (again) [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402451 [23:43:53] Amir1, https://gerrit.wikimedia.org/r/402451 [23:43:56] * halfak looks sheepish [23:44:13] Oh wait. [23:44:17] WTF. [23:44:23] looks like the revscoring wheel broke too. [23:45:47] (03CR) 10Ladsgroup: [C: 032] Adds pip back (again) [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402451 (owner: 10Halfak) [23:46:52] halfak: sorry i needed the wheel for a minute :P [23:49:41] (03PS1) 10Halfak: Updates revscoring wheel to 2.1.0 (again) [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402452 [23:49:50] Amir1, one more revscoring-2.0.11-py2.py3-none-any.whl [23:49:52] Damn past [23:49:54] paste [23:49:58] https://gerrit.wikimedia.org/r/402452 [23:50:08] :))) [23:50:12] If this isn't it, I officially give up for the evening. [23:50:12] (03CR) 10Ladsgroup: [C: 032] Updates revscoring wheel to 2.1.0 (again) [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/402452 (owner: 10Halfak) [23:55:57] IT'S ALIVE: https://ores-staging.wmflabs.org/v3/scores/fiwiki/1241241 [23:56:11] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879522 (10Halfak) IT'S ALIVE https://ores-staging.wmflabs.org/v3/scores/fiwiki/1241241 [23:56:29] 10Scoring-platform-team, 10ORES: Make sure ORES is compatible with stretch - https://phabricator.wikimedia.org/T182799#3879525 (10Halfak) [23:56:30] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879523 (10Halfak) 05Open>03Resolved [23:56:43] halfak: I hope not [23:57:06] 10Scoring-platform-team, 10ORES: Make sure ORES is compatible with stretch - https://phabricator.wikimedia.org/T182799#3834792 (10Halfak) [23:57:08] 10Scoring-platform-team (Current), 10ORES: Convert CloudVPS instances to stretch. - https://phabricator.wikimedia.org/T184296#3879071 (10Halfak) 05Resolved>03Open Woops. I didn't do the main cluster. Leaving that for later. [23:57:23] OK I'm off. Have a good weekend, folks! [23:58:14] U2