[00:52:51] halAFK: perhaps we should see how we could make sure all dependencies are installed? Ill start researching it now if i come up with anything ill send you and email or ping you on IRC [13:18:08] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Easy, 10Google-Code-in-2017: Wikicode is not interpreted in system message - https://phabricator.wikimedia.org/T142406#3815911 (10Ladsgroup) 05Open>03Resolved [13:30:59] 10Scoring-platform-team, 10MediaWiki-Vagrant, 10Patch-For-Review: Clean up ORES vagrant role - https://phabricator.wikimedia.org/T181850#3815958 (10Ladsgroup) Every time I clone ores-prod-deply with its submodules, it practically takes forever. I think we should finish the git-lfs and then come back here and... [15:04:20] awight: no backlog sync? [15:04:31] awight: halfak|Network_d o/ [15:04:42] oops, be right there! [15:12:08] !log upload kubernetes_1.7.10-1_amd64 on apt.wikimedia.org/stretch-wikimedia/main T181489 [15:12:08] akosiaris: Not expecting to hear !log here [15:12:08] No hay log abierto en #wikimedia-ai - log on para abrirlo, log list para listar los logs disponibles. [15:12:09] T181489: Gaps in kubelet-reported Prometheus metrics - https://phabricator.wikimedia.org/T181489 [15:12:15] lol [15:12:19] wrong channel [15:23:11] Hey awight mind helping an GCI student for me? [15:23:26] Hes in #wikimedia-dev [15:26:51] halAFK: nickname *ahem* [15:36:51] Hi, I got a psycopg2.ProgrammingError, when trying to fetch user with campaigns data in http://localhost:8080/users/555755/?campaigns , seems like wikilabels in wmflabs also has the same problem (http://labels.wmflabs.org/users/100/?campaigns). [15:38:16] refeed[m]: what python version? [15:38:42] python3Zppix [15:38:49] Hmmm [15:39:06] I dont see no error on wmflabs [15:39:52] really ? [15:40:01] http://labels.wmflabs.org/users/100/?campaigns it returns internal server error in mine [15:40:06] * halfak went right to meeting [15:40:45] Now it did [15:41:02] halfak: refeed[m] just found a bug :/ [15:44:18] Well, I figured out the problem yesterday, there's a sql syntax error, the comma here (https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/campaigns.py#L158 ) needs to be removed [15:44:37] Zppix: ty, we’re stuck in a meeting for another 10min, thanks for the GCI ping. [15:44:47] No problemo [15:45:20] If theres one thing im good for its pinging people constantly xD [15:47:41] so there is another meeting on this meeting? *joke [15:50:58] (03CR) 10Thiemo Mättig (WMDE): [C: 032] Wire ModelLookup using ServiceWiringFile, remove methods from Cache.php (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/394316 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [15:52:07] (03CR) 10Thiemo Mättig (WMDE): [C: 032] Rename Stats to ThresholdLookup and make it a service [extensions/ORES] - 10https://gerrit.wikimedia.org/r/394760 (https://phabricator.wikimedia.org/T181892) (owner: 10Ladsgroup) [15:52:41] (03Merged) 10jenkins-bot: Wire ModelLookup using ServiceWiringFile, remove methods from Cache.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/394316 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [15:53:44] (03Merged) 10jenkins-bot: Rename Stats to ThresholdLookup and make it a service [extensions/ORES] - 10https://gerrit.wikimedia.org/r/394760 (https://phabricator.wikimedia.org/T181892) (owner: 10Ladsgroup) [15:58:24] 10Scoring-platform-team (Current), 10ORES, 10Wikimedia-Incident: Document Nov 28-29 ORES outage - https://phabricator.wikimedia.org/T182101#3816393 (10Halfak) a:03awight [16:00:42] eisenhaus335: haha, I'm back from the meeting, if you need any help for the wikilabels stuff, poke me :) [16:00:56] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3816414 (10akosiaris) [16:00:58] 10Scoring-platform-team (Current), 10ORES, 10Wikimedia-Incident: Document Nov 28-29 ORES outage - https://phabricator.wikimedia.org/T182101#3816416 (10akosiaris) [16:01:07] I 've merged the 2 tasks [16:01:27] i am might poke you entire time *booooooo [16:01:39] 10Scoring-platform-team, 10ORES, 10Operations, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3802542 (10akosiaris) [16:01:52] I just pushed the fix to the master: https://github.com/wiki-ai/wikilabels/commit/001f50d01f65262959e97125e74f9cad663a54af [16:01:57] awight: ^ [16:02:21] eisenhaus335: that's why I'm here, don't worry :P [16:02:39] Amir1: I commented for symbolic CR value [16:03:00] and—I’ve made that exact typo in SQL at least a dozen times! [16:07:31] SQL optimizers practicly rewrite the whole queries all the time but they can't ignore the comma, argh [16:08:13] Amir1: is that why there's a 'master of introducing new bugs' in your phab account ww [16:08:38] SQL: born 1986, died ? [16:09:24] refeed[m]: exactly :P [16:12:19] afk for coffee break [16:13:39] how time is it on your (i mean all of you but i cannot find proper word for this) local place? [16:16:01] 10Scoring-platform-team, 10ORES, 10Operations: Tuning profile::ores::celery parameters should cause a Celery service restart - https://phabricator.wikimedia.org/T182203#3816483 (10awight) [16:20:03] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10ORES, 10Documentation: Elaborate documentation on how to deploy ORES to a new wiki - https://phabricator.wikimedia.org/T182054#3811391 (10srodlund) I am beginning and audit of current and desired tech documentation for ORES. Will add this. [16:51:00] I’m getting PTSD re-reading our IRC logs. [16:56:56] awight: i hear some people use and they say alcohol would help with that xD [17:02:24] arg. meeting went over [17:02:25] back now [17:14:53] I'm back for the late lunch [17:36:03] 10Scoring-platform-team (Current), 10Wikimedia-Incident: How can we test all the wiki/page combinations that can be affected by ORES? - https://phabricator.wikimedia.org/T181830#3816795 (10Halfak) I think we really need someone to do QA for this. Could we collab with @jmatazzoni for some QA time on beta durin... [17:36:35] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3816809 (10Halfak) [17:37:25] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3816812 (10Halfak) [17:38:03] 10Scoring-platform-team, 10Research Ideas, 10drafttopic-modeling, 10artificial-intelligence: Revscoring tune does not recognize a set of labels as target - https://phabricator.wikimedia.org/T181163#3816815 (10Halfak) [17:38:33] 10Scoring-platform-team (Current), 10Research Ideas, 10drafttopic-modeling, 10artificial-intelligence: Revscoring: Statistic for multilabel classification - https://phabricator.wikimedia.org/T181166#3816819 (10Halfak) a:03Sumit [17:41:01] 10Scoring-platform-team (Current), 10ORES: Design JADE data storage schema - https://phabricator.wikimedia.org/T153152#3816833 (10Halfak) I think this is done, right @awight? [17:41:44] 10Scoring-platform-team (Current), 10User-Ladsgroup: Check catalan wikipedia status - https://phabricator.wikimedia.org/T178408#3816837 (10Halfak) a:03Ladsgroup [17:41:55] 10Scoring-platform-team (Current), 10User-Ladsgroup: Check catalan wikipedia status - https://phabricator.wikimedia.org/T178408#3816839 (10Halfak) 05Open>03Resolved [17:43:03] (03CR) 10Halfak: [V: 032 C: 032] Put all the wheel tools into the Makefile [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/391562 (https://phabricator.wikimedia.org/T180496) (owner: 10Awight) [17:43:34] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Clean up ORES wheels Makefile - https://phabricator.wikimedia.org/T180496#3816845 (10Halfak) 05Open>03Resolved [17:43:41] halfak: Do you understand whether we were harming services other than ORES, on scb*? [17:43:54] akosiaris: ^ in case you have any insight? [17:44:52] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for Icelandic - https://phabricator.wikimedia.org/T181099#3816861 (10Halfak) 05Open>03Resolved [17:45:17] awight, I don't know if we have evidence that we were the ones to *cause* OOMs [17:45:27] hmm [17:45:28] did we have a jump in memory usage for ORES when the OOMs started? [17:45:38] Good question. [17:45:39] I suppose we might have harmed other services during the self-DOSing [17:45:41] With CPU usage. [17:45:59] * halfak reviews everything! [17:47:59] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for eswikiquote - https://phabricator.wikimedia.org/T182218#3816871 (10Halfak) [17:48:01] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for eswikiquote - https://phabricator.wikimedia.org/T182218#3816882 (10Halfak) https://github.com/wiki-ai/editquality/pull/108 [17:48:12] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for eswikiquote - https://phabricator.wikimedia.org/T182218#3816883 (10Halfak) 05Open>03Resolved [17:48:41] halfak: It looks like the OOM was caused by another service. [17:49:06] Gotcha. I heard some grumblings about electron. Was that it? [17:50:45] awight, I need lunch. Can you attend "ORES docs" without me today? [17:50:49] Xmas came early [17:50:55] all {{merged}} [17:50:57] 10[4] 04https://meta.wikimedia.org/wiki/Template:merged [17:51:13] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for Icelandic - https://phabricator.wikimedia.org/T181099#3779529 (10Halfak) 05Resolved>03Open [17:51:20] 10Scoring-platform-team (Current), 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Train/test reverted model for eswikiquote - https://phabricator.wikimedia.org/T182218#3816871 (10Halfak) 05Resolved>03Open [17:51:21] halfak: sure [17:51:30] :D [17:51:38] Thanks awight [17:51:41] Amir1, :DDD [17:52:01] Aww, electronpdf doesn’t include memory usage graphs [17:52:24] 10Scoring-platform-team (Current), 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-12-12 (1.31.0-wmf.12)), 10Patch-For-Review, 10User-Ladsgroup: Rewrite Stats.php - https://phabricator.wikimedia.org/T181892#3805825 (10Halfak) @Ladsgroup looks like this is "done" or "pending deployme... [17:52:47] 10Scoring-platform-team (Current), 10JADE, 10Epic: Implement basic path structure for JADE (judgements) - https://phabricator.wikimedia.org/T181098#3779510 (10Halfak) Got feedback. It's on me to go through it. [17:52:58] * halfak runs off to lunch [17:52:59] o/ [17:54:13] 10Scoring-platform-team (Current), 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-12-12 (1.31.0-wmf.12)), 10Patch-For-Review, 10User-Ladsgroup: Rewrite Stats.php - https://phabricator.wikimedia.org/T181892#3816924 (10Ladsgroup) Nah, I just renamed the class, way more work is needed... [18:00:06] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3816966 (10mmodell) I'm not sure what to make of this one. I don't think T179013 ever effected production, so I'm not sure... [18:02:56] 10Scoring-platform-team, 10ORES, 10monitoring, 10Wikimedia-Incident: Create Grafana graph to show number of ORES API requests per user-agent - https://phabricator.wikimedia.org/T182222#3816979 (10awight) [18:03:01] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3816990 (10mmodell) I'd like to push the latest scap code to production this week if I can get an opsen to upload the package. I'll create a... [18:24:06] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10ORES, 10Documentation: Elaborate documentation on how to deploy ORES to a new wiki - https://phabricator.wikimedia.org/T182054#3817077 (10awight) To clarify, there are two things to document: * How to enable ORES on a wiki that didn't have it previous... [18:25:11] And also `http://localhost:8080/campaigns/enwiki/?create` returns internal server error (ValueError: View function did not return a response), Traceback: https://dpaste.de/7qr2 [18:26:11] Happens in wmflabs: http://labels.wmflabs.org/campaigns/fawiki/?create [18:27:24] awight: ^ [18:30:26] Amir1: wikilabels shenanigans ^ [18:30:40] Delegation at its finest [18:31:05] Thanks :D [18:31:10] refeed[m]: checking [18:31:13] halfak|Lunch: I’m headed to SoS. Want to make any edits or review? https://wikitech.wikimedia.org/wiki/Incident_documentation/20171128-ORES [18:33:09] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817115 (10awight) [18:33:12] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3817113 (10awight) 05Open>03Resolved This can be resolved for now, it's a facet of the Celery 4 work. [18:34:00] refeed[m]: You should not send 'create' flag to campaign route, I already removed one (which I think is your case) https://github.com/wiki-ai/wikilabels/pull/215/files [18:34:15] overall you should not send GET stuff that does writes [18:34:43] if you POST it, it should do the same [18:41:50] Amir1: oh ok, yeah that is, I didn't pull the latest code before, sorry [18:43:14] awight: yes we have. a few services lost their heartbeats under the heavy memory/CPU usage and ended up being restarted by the master service-runner process. Which is not ideal [18:43:48] akosiaris: Good to know, I’ll mention that in the report! [18:46:05] * halfak looks at report [18:46:31] 10Scoring-platform-team, 10Wikilabels, 10Easy, 10Google-Code-in-2017: Error messages should not contain relative paths or error codes - https://phabricator.wikimedia.org/T175726#3817157 (10rafidaslam) 05Open>03Resolved Marking as resolved as https://github.com/wiki-ai/wikilabels/pull/214 has been merged [18:47:16] 10Scoring-platform-team (Current), 10Wikilabels, 10Easy, 10Google-Code-in-2017: Error messages should not contain relative paths or error codes - https://phabricator.wikimedia.org/T175726#3817161 (10Halfak) [18:48:06] 10Scoring-platform-team (Current), 10Wikilabels, 10Easy, 10Google-Code-in-2017: Error messages should not contain relative paths or error codes - https://phabricator.wikimedia.org/T175726#3601563 (10Halfak) 05Resolved>03Open I'm opening this back up because it is pending deployment. There's no more fo... [18:51:43] Amir1: When I was fetching the user that was not exist using `/users//` API endpoint, it didn't return 404, instead it just returns the `user_id` back to the requester, isn't this an issue ? [18:52:35] for example http://labels.wmflabs.org/users/9999999/ [18:53:24] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441#3817170 (10awight) We decoupled this from the new cluster deployment, I'll disconnect the parent task. [18:55:27] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3817185 (10awight) [18:55:30] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441#3817184 (10awight) [18:55:32] 10Scoring-platform-team (Current), 10Wikilabels, 10Easy, 10Google-Code-in-2017: Error messages should not contain relative paths or error codes - https://phabricator.wikimedia.org/T175726#3817186 (10rafidaslam) Oh okay, you're welcome. [18:55:51] 10Scoring-platform-team, 10ORES: Clean up file handle and Redis connection management in ORES worker and celery processes - https://phabricator.wikimedia.org/T177036#3817190 (10awight) 05Open>03Resolved a:03awight [18:55:55] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3560572 (10awight) [18:56:31] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817196 (10awight) [18:56:33] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Wheels built on ores-misc-01 are incompatible with ores* and scb* - https://phabricator.wikimedia.org/T179095#3817195 (10awight) 05Open>03Resolved [19:02:59] refeed[m], that's not an issue. You need to ask for something else. [19:03:09] Well, then again, I guess it could 404. [19:03:26] http://labels.wmflabs.org/users/9999999/?tasks [19:03:39] Here, user 9999999 has not labeled any tasks. [19:06:05] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3817228 (10akosiaris) I can handle that. I 'll try and build it tomorrow and upload it if successful [19:11:20] halfak: So that's intended? although user `9999999` has not exist yet in the database (http://labels.wmflabs.org/users/) ? [19:11:36] (03PS1) 10Ladsgroup: Join decomposition of ores_model table [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395811 (https://phabricator.wikimedia.org/T181334) [19:11:46] refeed[m], right because a user exists before they log in. [19:11:57] Logging in does not add a row to the DB since it's just via oauth. [19:12:46] (03CR) 10jerkins-bot: [V: 04-1] Join decomposition of ores_model table [extensions/ORES] - 10https://gerrit.wikimedia.org/r/395811 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [19:15:42] halfak: okay, did you mean 'a user exists before they register' ? [19:16:03] no. This is not pre-registration. Registration happens on MediaWiki [19:16:14] wikilabels just borrows MediaWiki's users. [19:17:08] E.g. https://www.mediawiki.org/w/api.php?action=query&meta=globaluserinfo&guiid=9999999 [19:20:19] euhm, so the users in the database (http://labels.wmflabs.org/users/) are actually users from mediawiki that have logged in to Wikilabels before? [19:23:29] halfak: two edits since you last looked at the incident report, fyi [19:23:55] I’m about to mail it out. [19:24:01] 10Scoring-platform-team (Current), 10Wikimedia-Incident: How can we test all the wiki/page combinations that can be affected by ORES? - https://phabricator.wikimedia.org/T181830#3817328 (10jmatazzoni) Hi guys. This sounds valuable but I'm not sure I understand the project or the request fully. Can you say a li... [19:28:55] halfak: there, the summary is better now. ? [19:41:25] 10Scoring-platform-team (Current), 10Wikimedia-Incident: How can we test all the wiki/page combinations that can be affected by ORES? - https://phabricator.wikimedia.org/T181830#3817626 (10Halfak) Essentially, it's possible that changes to ORES can affect RC Filters. We'd like Collab's help reviewing the filt... [19:43:11] 10Scoring-platform-team, 10Bad-Words-Detection-System, 10revscoring, 10Patch-For-Review, 10artificial-intelligence: Experiment with using English Wikipedia models on Simple English - https://phabricator.wikimedia.org/T181848#3804284 (10Catrope) Right now I'm not seeing the ORES filters in RC at all on si... [19:44:25] awight: Ha, jinx. We both commented on my config patch saying it shouldn't be deployed yet [19:44:42] I tried it in beta labs and couldn't get it to work at all, even though the config is already enabled there [19:45:00] awight, sorry in meetings and side-tracked with phab. [19:45:08] Should have been side-tracked with IRC [19:45:12] RoanKattouw: k I was seeing fishiness, too [19:46:17] awight, +1 on the incident report [19:51:33] halfak: should i put the multilabel demo in master through a PR? [19:51:42] the ipython one [19:52:43] codezee, good Q. I don't think so. I think we should just get revscoring to handle it and go from there. [19:55:07] ok, i'll keep the branch anyway for sometime till we get full support in revscoring + tests [19:55:24] +1 [19:55:43] I have some time now. I could look at getting it in revscoring. [19:58:30] I’m gonna stab ores* in the knees for a while, parallel stress tests. [19:58:42] :( [19:58:46] poor ores [19:59:15] Nothing it won’t thank me for later. [20:00:38] halfak: Oh interesting, thanks for doing some stress tests [20:04:39] 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3817764 (10awight) [20:04:41] 10Scoring-platform-team, 10ORES, 10Operations: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3817763 (10awight) [20:04:44] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817761 (10awight) 05Open>03Resolved @halfak Let's declare this a win. We showed that the new cluster is capable of keeping up w... [20:05:33] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3817771 (10awight) [20:05:37] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3817772 (10awight) [20:05:40] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817770 (10awight) [20:05:55] 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3355113 (10awight) [20:05:58] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [20:06:04] 10Scoring-platform-team, 10ORES: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073#3355113 (10awight) [20:06:07] 10Scoring-platform-team, 10ORES, 10Operations, 10Release-Engineering-Team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10awight) [20:07:19] 10Scoring-platform-team (Current), 10Wikimedia-Incident: How can we test all the wiki/page combinations that can be affected by ORES? - https://phabricator.wikimedia.org/T181830#3803654 (10mmodell) This seems like it would be at last partially testable with selenium running against beta cluster. [20:14:14] halfak: FYI i unilaterally decided we can unblock on stress testing. [20:14:25] Relocating for 15min, hope to hear your thoughts. [20:16:03] bah. Got jumped at my desk by an old friend ^_^ [20:16:32] codezee, I'm considering working on revscoring a little bit to see if I can make it work for multilabel based on your notebook. [20:16:37] Have you started work there yet? [20:24:12] halfak: yes, i'm working on an idea, i'll push it in a while w/o testing to show [20:24:26] Sounds great [20:24:31] I'll look back at JADE then :) [20:31:24] net split tiem! [20:34:40] rajaniemi and legiun got split [20:36:25] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3817844 (10demon) [20:36:28] 10Scoring-platform-team, 10Diffusion, 10Gerrit, 10ORES, and 5 others: Add gitlab to proxies/whitelist for mirroring to phabricator - https://phabricator.wikimedia.org/T181835#3817842 (10demon) 05Open>03Resolved a:03demon [20:38:51] o/ awight [20:39:04] bonjour [20:39:13] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817847 (10Halfak) One reason we might not want to provision these machines is that we won't be able to safely load test again. If w... [20:40:37] ^^ [20:40:41] brb [20:48:32] wiki-ai/revscoring#1329 (multilabel-rf - 7943caa : Sumit Asthana): The build failed. https://travis-ci.org/wiki-ai/revscoring/builds/312615943 [20:52:04] halfak: i've not touched statistics handling code and this could use some more refactoring, but this is the basic idea - https://github.com/wiki-ai/revscoring/commit/7943caa5df9b [20:52:15] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817883 (10awight) Point well taken. What if we temporarily depool some of the servers for future tests? Any single ores* machine c... [20:52:53] let me know what do you think, i've subclassed Probability classifier and added an additional multilabel.fit_transform on labels in the subclass [20:53:09] while trying to keep redundancies at a minimal [20:55:23] back [20:55:45] codezee, thanks for pushing. I'll have a look [20:56:11] snuck a comment into the cluster thread. [20:57:06] Amir1: halfak: Nothing to deploy? [20:57:18] ? [20:57:26] Lots of stuff in the pending column :) [20:58:02] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817916 (10Halfak) We could de-pool a whole datacenter. That would allow us to not mix traffic and run tests. That would also allow... [20:58:05] awight, you might have forgotten about ores200* :D [20:58:10] :D [20:58:19] gold. [20:58:37] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817917 (10awight) Let's do it. [20:58:42] Resolving? [20:58:48] Yeah. Do it. [20:58:57] But add another task for follow-up stress testing :) [21:00:26] That netsplit was fun! [21:00:53] awight, we're not ready to deploy any of the ORES stuff in the pending deployment column :/ [21:01:06] We'll need a deploy repo update first [21:01:15] deploy repo update? [21:01:23] I’m lost. [21:01:53] https://phabricator.wikimedia.org/source/ores-deploy/ [21:02:06] To pull in the new reverted models in the "pending" column [21:02:26] * awight pulls self out of incident report hole [21:02:33] lol [21:02:33] OK let’s skip it today, then. [21:02:55] I could push to beta if anyone is feeling that [21:03:21] +1 [21:04:57] halfak: Just thinking out loud, maybe the bottleneck is actually still the Celery workers count. [21:05:21] 150*9 > 600 [21:05:27] right? [21:05:55] How many web workers do we have? [21:05:57] * halfak looks [21:06:26] My clue is that, when a node of celery workers went down we would see the other hosts pick up more of the load, but never exceeding a low-ish ceiling. [21:06:35] How long does one of these ?features requests take? [21:06:54] 2.5 seconds at 95% [21:07:07] u know the median? [21:07:11] oh it’s just in the graph. [21:07:42] right [21:07:54] precached should be roughly the same as a ?features [21:08:01] request-wise [21:08:12] 10Scoring-platform-team, 10ORES, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3817970 (10awight) [21:08:18] We have 2 uwsgi workers per core [21:08:48] 1.34s to calculate a score. [21:09:37] 40 processors! [21:09:43] 9 nodes * 150 workers / node * (1 request / 1.34s) [21:09:58] 80 workers * 9 machines = 720 web workers [21:10:06] huh = 1000 req / second. Does not compute. [21:10:12] how are we going any faster [21:11:13] awight, median != mean [21:11:18] :) [21:11:22] true [21:11:32] Our distribution is skewed AF [21:11:42] kk [21:11:52] well dammit Bones, how fast can she go [21:11:53] * halfak gets avg [21:11:56] lol [21:13:38] Hmm... Mean is ~1.1secs [21:13:39] hmm [21:13:42] Still does not compute [21:13:45] lul [21:14:35] so uh [21:14:41] * awight scratches [21:14:54] right [21:15:14] We're *really* fast [21:15:16] Oh hey—although we’re screwed on this math [21:15:22] there’s another thing that really bothered me [21:15:39] When I set —delay=0.001 we should have immediately gone into overload [21:15:42] we did not [21:15:46] nothing complained [21:15:51] on either end [21:16:09] You saw the same thing when running 2 in parallel [21:16:23] I think that means that we’re politely waiting for the HTTP request? [21:16:28] and that is in fact the bottleneck [21:16:44] humm no that doesn’t prove it [21:16:52] cos the http request will wait until work is complete to return [21:16:56] I think it's uwsgi waiting [21:17:02] There's no available process [21:17:06] So it fills a queue [21:17:19] And because we don't know that the request came in, the timeout hasn't started. [21:17:21] ok, so our stress testing can’t push the system harder than the web endpoint can go? [21:17:29] A lack of web workers makes a ton of sense to me. [21:17:43] Everything backs up in the web queue [21:17:48] Which is not how we want to go down. [21:17:51] Back to the arithmetic, then. [21:18:04] Hence we set a limit in the celery queue so we *know* when we're going crazy. [21:18:36] Web worker limits need to be higher than celery limits to that we can signal in a way that isn't just make-requests-take-forever. [21:18:37] … and we only overload the queue when the web listeners can work faster than celery [21:18:41] +1 [21:18:42] right [21:18:57] This might doom the scb = uwsgi plan [21:18:59] donno [21:19:12] Maybe. not sure how I feel about that plan honestly. [21:19:27] But then again uwsgi is much cheaper than celery per-worker. [21:19:32] Less ram. less CPU [21:20:05] I think symmetrical boxes is a nice design, and we can tune the uwsgi/celery ratio [21:20:36] then adding and removing boxes is just a linear deal [21:20:56] OTOH, we could also go in the opposite direction and have specialized boxes for e.g. high-memory models. [21:21:20] awight, I think that's the idea of kubernetes [21:21:44] WQoops. Time to go. I get to watch one of my (almost) advisees defend his thesis. [21:22:06] I'll be back online in ~1 hour. There might be celebration afterward. [21:22:28] oh my units are wrong on the math. [21:22:29] min / sec [21:22:31] This is my favorite rite to participate in academically. It's supposed to be a hard test, but it's really just a celebration [21:22:33] oh! [21:22:35] Yeah :) [21:22:39] I’ll post later [21:22:42] o/ [21:22:43] have fun! [21:27:13] 9 nodes * 150 workers / node * (1 request / 1.17 node-s) * (60 s/min) = should be able to handle 69,000 req/min [21:27:48] Or more considering cache no? [21:29:05] 10Scoring-platform-team, 10ORES, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3818074 (10awight) At their current performance, the celery workers we have should be able to handle 9 nodes * 150 workers / node * (1 request / 1.17 node-s) *... [21:29:41] Zppix: ^ probably less, considering CPU. Cached responses are only about 10% of our traffic. [21:34:18] Thats what i meant [21:34:51] Im multitasking so im not paying alot of attn on what i say [21:35:27] good call though [21:41:54] (03PS1) 10Awight: Bump editquality and ores submodules [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/395855 [21:42:13] (03CR) 10Awight: [V: 032 C: 032] Bump editquality and ores submodules [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/395855 (owner: 10Awight) [21:47:16] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), 10Wikimedia-Incident: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3818134 (10awight) [21:47:31] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), 10Wikimedia-Incident: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3794478 (10awight) 05Open>03Resolved a:03awight [21:47:34] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3818137 (10awight) [21:50:35] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Performance: Clean up ORES thresholds cache: pre-emptively check before expiry - https://phabricator.wikimedia.org/T182256#3818148 (10awight) [21:51:10] ORES beta is down thanks to my deployment. [21:53:58] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3818175 (10awight) 05Open>03Resolved [21:54:08] awight: well atleast it wasnt prod [21:54:11] Less yelling [21:54:21] awight: need some help? [22:02:00] 10Scoring-platform-team, 10ORES, 10Scap: ORES virtualenv deployment step fails intermittently - https://phabricator.wikimedia.org/T182258#3818204 (10awight) [22:02:22] Zppix: Strange business. ^ is all. [22:02:44] Scap... gotta love it [22:19:02] ORES people, using your awesome creation, I've managed to score 500 of the "very old" Articles for Creation drafts in the enwp backlog - https://en.wikipedia.org/wiki/User:There%27sNoTime/AfC_very_old_draft_scores [22:19:07] thank you! :-) [22:21:12] Thank you! Glad we could help [22:21:19] (Noone is here atm [22:21:22] But me [22:21:24] So yeah [22:21:30] Ill answer for them [22:21:35] well damn :P [22:21:41] They are afk [22:22:14] Aaron is listening to something with advisees awight left a few mins ago [22:22:35] A-mir is probably afk (i think its late) [22:25:52] 10Scoring-platform-team, 10ORES, 10Operations, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3802542 (10greg) https://wikitech.wikimedia.org/wiki/Incident_documentation/20171128-ORES [22:26:20] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3818306 (10greg) [22:26:25] 10Scoring-platform-team, 10ORES, 10Operations, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3818304 (10greg) 05Open>03Resolved a:03awight [22:29:32] Zppix yeh it's 11pm in germany. [22:29:50] How inconsiderate [22:29:53] though 11:30pm there and 10:30pm here. [22:30:11] Lol sam [22:32:49] 10Scoring-platform-team, 10Analytics, 10EventBus, 10ORES, and 3 others: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3818312 (10Jdlrobson) Are there plans to expose this in https://stream.wikimedia.org/v2/stream/recentchange ? [23:58:42] 10Scoring-platform-team, 10Analytics, 10EventBus, 10ORES, and 3 others: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3818583 (10Ottomata) Yes! It's just not prioritized, so I only get to work on it when I have a bit of headroom to do so! :) [23:58:57] 10Scoring-platform-team, 10Analytics, 10EventBus, 10ORES, and 3 others: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3818584 (10Ottomata) Wait, in recentchange? No, it will be its own stream.