[02:23:55] 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10Patch-For-Review, 10User-Ladsgroup: Implement JS ORES client in mw-ORES extension - https://phabricator.wikimedia.org/T201691 (10Halfak) This would be doubly-hard because we can't batch without specifying a set of models to use in the pool.... [02:26:52] 10ORES, 10Scoring-platform-team, 10Operations, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10Halfak) I like the proposal of depooling one datacenter. What do you think @akosiaris? Is this crazy? [02:32:37] 10ORES, 10Scoring-platform-team, 10Operations, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10awight) Another thing to consider is that, although we're all very curious about our ceiling, it doesn't really matter until we see real traf... [06:35:41] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 442 bytes in 0.140 second response time [06:36:21] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.036 second response time [11:03:03] o/ [11:11:25] (03PS1) 10Ladsgroup: Increase number of parallel connections to 9 [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/472413 (https://phabricator.wikimedia.org/T191842) [11:49:29] 10ORES, 10Scoring-platform-team, 10Operations, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10akosiaris) >>! In T182249#4730735, @Halfak wrote: > I like the proposal of depooling one datacenter. What do you think @akosiaris? Is this... [11:54:30] akosiaris: is there anything I can do to help migrating to k8s? [12:02:03] Amir1: we need to come up with a plan together. ORES is one of the most complex services so I am guessing it will require some extra thought on how to do that. It will also change totally the deployment method (no scap anymore) [12:02:18] and hence the building method [12:02:34] I 'll try and whiff up an etherpad today and send it your way for comments [12:02:51] sure thing, that would be awesome [12:02:59] I would love to do anything I can to help [12:07:38] akosiaris: btw. Regarding the increased response time on codfw, I can't find anything that can contribute to it except the cache hit ratio, it's really low on codfw (now 17%) [12:10:31] that does sounds reasonable. What's killing the cache hit ratio there ? the external researchers ? [12:11:41] wait we are talking about https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=43&fullscreen&orgId=1&from=now-3h&to=now-1m ? [12:11:46] 13.44 secs average ? [12:12:21] yup [12:12:46] the precache is okay, not precache is horrible [12:16:59] akosiaris: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=12&fullscreen&orgId=1&from=now-24h&to=now-1m this might show some stuff [13:50:04] 10ORES, 10Scoring-platform-team, 10User-Ladsgroup: Train/test wp10 model for fawiki - https://phabricator.wikimedia.org/T172629 (10Ladsgroup) 05Open>03Resolved a:03Ladsgroup This is done [14:33:14] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 (10Ladsgroup) [14:33:41] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 (10Ladsgroup) I have the feeling that rebuilding the wheels with git lfs caused this (that was the only thing that was deployed yesterday) [14:55:21] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 (10Ladsgroup) Git lfs is broken: ``` ladsgroup@stat1007:~$ git clone ssh://ladsgroup@gerrit.wikimedia.org:29418/research/ores/wheels . . . ladsgroup@stat1007:~/w... [14:58:40] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 (10Ladsgroup) I can download it on my laptop but my connection is super slow but it's not downloading anything on labs nodes or prod stat machines :/ [15:09:12] akosiaris: Are you around to do a hot fix on prod? UBN! [15:09:27] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 (10Ladsgroup) p:05Triage>03Unbreak! Apparently in scap deploy, git lfs pull didn't work on all of wheels assets (only on the *.whl files and not on the corpo... [15:10:01] Amir1: yes [15:10:02] akosiaris: https://phabricator.wikimedia.org/T209060#4732150 We need to do it on all nodes of ores in prod. I don't have permission to do it [15:10:29] (then restarting both services) [15:10:43] it seems scap deploy didn't do git lfs pull properly [15:12:00] ouch [15:12:55] so, how do we treat this ? [15:12:59] fun thing is it downloaded 200mb of assets without any issues but choked on downloading 100kb [15:13:16] ladsgroup@deployment-ores01:/srv/deployment/ores/deploy/submodules/wheels$ sudo git lfs pull [15:13:37] akosiaris: We basically need to do this one by one node (or in parallel) ^ [15:13:52] this fixed beta [15:16:32] sudo -u deploy-service git lfs pull [15:16:32] Git LFS: (55 of 55 files) 96.63 KB / 96.63 KB [15:16:34] on ores1001 [15:16:48] what the ? why hadn't this happened ? [15:17:12] Exactly [15:17:16] :D [15:18:10] The fun thing is if you clone freshly and do git lfs pull, it downloads 200 mb of files, it means scap did git lfs pull but not for all [15:18:28] how, I have no clue [15:18:43] * Amir1 wonders why he didn't become a movie critic [15:19:43] Amir1: ok done [15:20:07] Thanks. Should I restart the services? [15:21:39] I am guessing uwsgi doesn't require restart, right ? [15:21:53] yup [15:21:54] it's the workers that need to be restarted [15:22:03] sure, go for it [15:22:09] thanks [15:30:00] afk for lunch, will be back soon [15:30:57] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 (10Ladsgroup) p:05Unbreak!>03High With the hotfix, it's back to normal but we can't deploy anything because it might break them again. [16:21:40] 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Implement twemproxy for ORES production Redis - https://phabricator.wikimedia.org/T122676 (10Ladsgroup) >>! In T122676#3411664, @akosiaris wrote: > codfw has been migrated to use nutcracker and reverted. This has backfired majestically. The reason being >... [16:29:57] 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Implement twemproxy for ORES production Redis - https://phabricator.wikimedia.org/T122676 (10akosiaris) For what is worth, the upstream task is https://github.com/celery/celery/issues/3500. Closed WONTFIX apparently. [17:10:19] 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Implement twemproxy for ORES production Redis - https://phabricator.wikimedia.org/T122676 (10Ladsgroup) Which now brings us to the question of what's the next step? Should we reinvent the wheel and make a basic nutcracker that is able to handle transaction... [17:19:46] 10ORES, 10Scoring-platform-team: Investigate what is creating Redis transactions and whether it can be fixed - https://phabricator.wikimedia.org/T196889 (10Ladsgroup) Running redis-cli monitor on deployment-ores01 gives these kind of transactions: ``` 1541697362.325121 [0 10.68.16.235:35308] "BRPOP" "celery"... [21:49:42] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/JADE] - 10https://gerrit.wikimedia.org/r/472549 (owner: 10L10n-bot) [21:52:09] hi, everybody! was the monday hangout just cancelled? [21:52:44] you got me thinking "today is Thursday…" xD [21:55:37] @awight hi! [21:55:38] 04Error: Command “awight” not recognized. Please review and correct what you’ve written. [21:56:02] awight: hi!