[01:50:53] 10Scoring-platform-team, 10Edit-Review-Improvements-Integrated-Filters, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10MW-1.31-release-notes (WMF-deploy-2017-10-03 (1.31.0-wmf.2)), and 2 others: PHP Warning: Attempted to serialize unserializab... - https://phabricator.wikimedia.org/T176236#3659452 [03:21:41] (03PS14) 10Awight: [WIP] Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) [03:22:47] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) (owner: 10Awight) [16:30:45] o/ [16:37:26] o/ awight [16:46:47] eek. Just took another look at T176456 with jmat [16:46:47] T176456: ORES on Watchlist causes big slowdown—especially with 'Last revision' filter turned on - https://phabricator.wikimedia.org/T176456 [16:47:29] Unfortunately, parts of the query are getting cached so it’s really hard to tell how changing the query impacts performance. [16:49:38] Yeah that seems like a PITA. [16:53:33] Gimme a sanity check on this—when you have an expensive query but are limiting the results to a small number of rows, doesn’t MySQL do all the processing to create the full list, then return the requested size? [16:54:09] especially when sorting results [16:54:09] yeah. There’s just no other way to do it (that I can think of) [17:15:35] k setting that down now. [17:15:51] Back to Extension:ORES [17:21:34] o/ [17:38:35] halfak: awight: we have three signups for the working group updates already https://meta.wikimedia.org/wiki/Global_message_delivery/Targets/JADE [17:38:43] Right on! [17:38:53] Yay. I'll keep following up on those messages, get y'all the draft of the email to lists in a bit [17:44:13] (03PS15) 10Awight: Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) [17:44:31] (03PS16) 10Awight: Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) [17:46:18] halfak: Amir1: ^ That should be pretty close to what we need. The one important thing I’m not covering with a test yet is the new->old fallback itself, but there’s a unit test for the internals that the fallback relies on, so I’d be happy to just smoke test. [17:46:20] Doing that now. [17:47:36] I'll get the wmflabs deployment out today. Maybe we can do the beta deployment too? [17:50:23] Happy to help with any of that. We might get the extension merged by then, if you’re thinking that too. [17:51:04] Right. I don't want to deploy to beta until we have that merged. [17:51:12] I'll start on the deploy to wmflabs ASAP. [18:17:03] 10Scoring-platform-team, 10JADE, 10Epic: Deploy JADE MVP API in labs - https://phabricator.wikimedia.org/T176333#3661878 (10Keegan) [18:17:05] 10Scoring-platform-team, 10Community-Liaisons, 10JADE: Set up working group for JADE - https://phabricator.wikimedia.org/T170954#3661876 (10Keegan) 05Open>03Resolved >>! In T170954#3656259, @Qgil wrote: > Moving this task back to the #community-liaisons backlog. Is this support request still active? Is i... [18:17:40] 10Scoring-platform-team, 10Community-Liaisons, 10JADE: Set up working group for JADE - https://phabricator.wikimedia.org/T170954#3661880 (10Keegan) [18:17:42] 10Scoring-platform-team (Current), 10JADE: Create list of ORES collaborators (focus on language asset helpers) - https://phabricator.wikimedia.org/T174685#3661879 (10Keegan) 05Open>03Resolved [18:23:15] (03PS17) 10Awight: Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) [18:23:42] RoanKattouw: ^ That’s ready for CR if you feel like it. [18:25:02] Pretty nasty thing I did, but I use a magic empty result to allow us to cache a failure, so we only test the API endpoint compatibility once per minute. [18:32:47] ^ halfak: Amir1: smoke tested and updated patch [18:39:19] I’ll start on the blog work update [18:39:33] * awight catches a cricket and puts it in my taco [18:40:43] Nice! [18:41:10] Also ewwww. [18:41:21] working on WMFlabs deployment [18:44:10] halfak: You need me to CR or merge anything? Otherwise, I’m thinking of breaking for lunch in 10min [18:44:28] awight, one quick thing since you are here. [18:44:44] https://github.com/wiki-ai/ores-wmflabs-deploy/pull/88 [18:44:54] kk [18:46:12] Just a little thing to make life easier that I'll lose if we don't do it now :) [18:46:48] It could be merged, but there’s at least one “deploy” in need of substitution [18:46:58] halfak: when you get time https://www.mediawiki.org/wiki/User:Keegan_(WMF)/sandbox#Email_draft [18:47:04] Short and to the point [18:47:28] halfak: also search for the ‘deploy’ e.g. in the function default param value [18:49:11] awight, got it [18:49:22] * halfak reads draft [18:50:11] merged [18:50:45] thanks! [18:50:55] I'll have wmflabs ready by the time you are done with Lunch :D [18:51:05] staging should be good now [18:53:37] If you have down time, take a look at the horrors of my extension patch... [18:53:59] lol @ downtime [18:53:59] I’m sad about how kludgily I approached the migration [18:54:05] ;) [18:54:25] on the bright side, all that stuff can be torn out as soon as we deprecate the old code and config [18:54:31] \o/ right :) [18:56:33] (03CR) 10jerkins-bot: [V: 04-1] Support new thresholds API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/380893 (https://phabricator.wikimedia.org/T175053) (owner: 10Awight) [19:00:14] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:49] Oh shaddap [19:01:45] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:45] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:13] We had a weird oom on deploy [19:05:38] Yup. Looks like we have outgrown our staging machine :/ [19:05:54] Oh wait. No staging is fine. This is just a web node. [19:06:02] hrrm... [19:06:48] strange. [19:07:20] We can fit both into 16GB of memory on staging, but can't fit just web into 8GB of memory on a web node. [19:09:07] Looks like out per-process memory usage has increased by a factor of 3. [19:09:10] :S [19:09:37] uwsgi used to be 792MB and now it's 3.1GB [19:09:51] Hmmmmmmhmmmhmmm [19:10:10] Thresholds information takes up too much space. [19:14:14] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 1.127 second response time [19:14:44] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 459 bytes in 0.650 second response time [19:14:45] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 459 bytes in 1.057 second response time [19:37:10] OK so this might be a blocker. [19:37:32] I did not expect memory usage to jump so much. [19:37:34] * halfak thinks [20:30:21] 10Scoring-platform-team, 10ORES: Model information UI (graphs and statistics) - https://phabricator.wikimedia.org/T140364#3662288 (10Halfak) [21:32:54] FYI, I’m stress testing the new ORES cluster [21:33:10] I don’t see any needles moving in Grafana, however. [21:35:31] o/ [21:37:58] the cluster is ores* on grafana right awight ? [21:38:10] Zppix: exactly [21:38:14] Maybe the data just hit? [21:38:23] awight: thats what i am thinking [21:39:08] nice. [21:39:57] awight: looks good on grafana if im reading this data right (lord knows i suck at reading grafana) [21:40:21] yes! [21:40:53] Next test, I’ll dial it up a bit. CPU usage is around 6% [21:41:41] Increased the load 6x... [21:44:03] halfak: I believe the new cluster is handling 100x the normal load on scb* [21:44:23] Holy moley. [21:44:27] lol, something broke [21:44:41] * halfak loads up grafana [21:45:03] This was supposed to be a background task, but it’s fun :p [21:45:51] I'm still recovering from the thought that our "thresholds" system requires a huge amount of additional memory. [21:46:09] Probably Python’s array storage is horribly inefficient [21:46:16] I know it compresses really well. [21:46:19] oh [21:46:24] And we're using pseudo dicts as rows! [21:46:53] I wonder how many rows there are. [21:47:49] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662554 (10Halfak) [21:48:18] 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662569 (10Halfak) [21:48:34] 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662554 (10Halfak) My hypothesis is that it's all 'thresholds' because that's the new big thing. [21:51:04] OK so it looks like we're keeping data on ~20000 unique thresholds for every label. [21:51:25] So basically we have a unique threshold for every single input observation in testing. [21:52:54] 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662588 (10Halfak) ``` >>> m = Model.load(open("submodules/editquality/models/enwiki.damaging.gradient_boosting.model")) >>> len(m.info['st... [21:53:24] maybe a bit overboard [21:54:15] Still 20k * 10 floats shouldn't be that much data, right? [21:54:39] Like, 160 kilobytes [21:54:55] in ANSI-C maybe [21:55:27] Are we using arbitrary precision? :) [21:55:53] halfak: check this out, https://stackoverflow.com/questions/16972501/numpy-size-of-data-type [21:55:53] Should be a 32 bit float [21:55:58] It behaves like one [21:57:08] Seems that really bad things happen when ORES overloads [21:57:16] ? [21:57:48] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&panelId=3&fullscreen&from=1507236425413&to=now&refresh=10s [21:59:14] I’ll let the celery queue run down this time [21:59:21] and see if I need to restart services [21:59:25] awight: any clue what that is on the non admin url? (no access) [21:59:47] oops [21:59:48] https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1507236425413&to=now&refresh=10s [21:59:52] ty [22:01:18] awight, not sure what you are showing me [22:01:52] halfak: That most of the server nodes shut down and did nothing during the overload condition [22:02:36] Shutting down is a surprise. [22:03:02] If you did something to redis that could explain it. [22:03:12] It's important that redis doesn't get in a weird state [22:03:20] Didn't you delete the queue a while back? [22:08:00] Yeah... so a model doesn't take up that much space in memory. Hmm [22:10:52] I haven’t touched the queue today [22:11:32] halfak: watch out for the n.b. in that datatype.itemsize hack. Apparently it only measures the size of the array elements, but not the array object attributes. Or something. [22:11:45] halfak: How can I get redis back to a pristine state? [22:12:04] flush all, restart workers and uwsgi [22:12:05] Pretty sure this overload crash is something I’ve been seeing since June [22:12:10] kk [22:12:14] I've not seen workers stop working [22:12:28] Except for that one instance of crazy downtime. [22:12:30] I can flush from the redis console, then I’ll deploy from tin perhaps to restart all services? [22:12:38] Overload is intented. [22:12:44] Right [22:13:57] ok. flushed and restarted. [22:15:14] halfak: "It's largely inspired by the false positive reporting work that currently occurs by hand on wiki pages." does that work for you? [22:15:30] Fiddin' to send [22:15:30] here goes another stress test. [22:15:57] Keegan +1 [22:16:06] * Keegan thumbs up [22:24:49] 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3662735 (10Halfak) With one model loaded (enwiki damaging): ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAN... [22:42:04] Baffling [22:42:17] The load isn’t well balanced [22:42:23] servers haven’t actually shut down [22:42:45] mmm ah I’m wrong about what I’m doing. [22:43:11] I’m hitting the wsgi endpoint on each server, but that’s not where the actual work is performed. [22:49:06] It's what the load balancer does. [22:51:11] ooh I’m looking at one of the machines currently not doing work (ores1003), and it was showing 1 celery thread at 100%, all others at 0. Now it’s switched to loading all threads. [22:52:38] Funky. [22:57:33] So I've determined that info is probably taking up a bunch of space. But I don't know why that is. I'll be picking that up again tomorrow. [22:57:46] I have some ideas related to __slots__ :) [22:57:47] o/ [23:03:55] 10Scoring-platform-team, 10MediaWiki-Vagrant, 10MediaWiki-extensions-ORES: Can't enable ores role in vagrant - https://phabricator.wikimedia.org/T177555#3662836 (10Mooeypoo) [23:13:59] (03PS1) 10Sbisson: WLFilters: Temporarily stop respecting hideNonDamaging on WL with beta feature [extensions/ORES] - 10https://gerrit.wikimedia.org/r/382627 [23:22:29] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3662930 (10awight) Ran a few tests today, and found that the filehandle issue is not solved. The celery service died on several node... [23:23:04] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3662933 (10awight) [23:23:07] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3662931 (10awight) 05Resolved>03Open Reopening, I saw this error kill the celery worker on a... [23:26:35] 10Scoring-platform-team, 10MediaWiki-Vagrant, 10MediaWiki-extensions-ORES: Can't enable ores role in vagrant - https://phabricator.wikimedia.org/T177555#3662836 (10Reedy) Have you done a git update recently? Is it working successfully? The problem doesn't look to be ores, it looks to be vagrant; it's not ru... [23:52:14] 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Revscoring 2.0 takes up too much memory - https://phabricator.wikimedia.org/T177544#3663066 (10Halfak) I just tried the article quality model for enwiki and found a much larger RES ``` PID USER PR NI VIRT RES...