[05:22:15] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:57] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 419 bytes in 0.105 second response time [06:41:25] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:43:07] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 405 bytes in 0.142 second response time [08:14:56] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:16:46] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 368 bytes in 0.677 second response time [09:39:11] 06Revision-Scoring-As-A-Service, 10ORES: [Investigate] Periodic redis related errors in wmflabs - https://phabricator.wikimedia.org/T141946#2517410 (10Ladsgroup) @-jem-. One rather strange question. Why are you not using the production cluster (ores.wikimedia.org) which is more stable. [09:53:19] 06Revision-Scoring-As-A-Service, 10ORES: [Investigate] Periodic redis related errors in wmflabs - https://phabricator.wikimedia.org/T141946#2532213 (10akosiaris) FWIW, I am pretty certain we have pinpointed the issue and this can be marked as resolved [09:54:47] 06Revision-Scoring-As-A-Service, 10ORES: [Investigate] Periodic redis related errors in wmflabs - https://phabricator.wikimedia.org/T141946#2532227 (10Ladsgroup) Yup, it's in the "Done" column in our. We do it after the weekly meeting (which is later today) [09:57:38] akosiaris: hey, Can you review this change? https://gerrit.wikimedia.org/r/#/c/303356/ [09:57:57] It's written in a backward-compatible way. [09:58:16] I tested it in beta in both old and new ways [10:01:38] thanks :) [10:02:48] Amir1: hmm puppet did not restart either uwsgi nor celery [10:03:00] that's okay [10:03:00] I 'll do that but I suppose it should [10:03:27] because with the deployment we'll do a restart too [10:03:38] ok [10:07:09] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet: Change CP to do several models at once. - https://phabricator.wikimedia.org/T142360#2532251 (10Ladsgroup) [10:09:54] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet: Increase web and worker processes in production - https://phabricator.wikimedia.org/T142361#2532269 (10Ladsgroup) [10:10:09] 06Revision-Scoring-As-A-Service, 10ORES: Increase web and worker processes in production - https://phabricator.wikimedia.org/T142361#2532269 (10Ladsgroup) [12:27:19] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Announce deployment of wp10 models to ruwiki community - https://phabricator.wikimedia.org/T138623#2405672 (10Johan) What's the status of this? [12:30:49] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES, 07Community-consensus-needed: Enable RC patrolling on trwiki - https://phabricator.wikimedia.org/T140475#2532909 (10Johan) Sorry for late reply, was on vacation for a while. I'm going to spend some time looking at the processes we have ge... [12:33:01] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 07Documentation: Provide a space for reporting bad predictions - https://phabricator.wikimedia.org/T140278#2458936 (10Johan) Also, it would probably be helpful if the documentation (if not the reporting itself, because then you'd have to keep track of too much... [13:11:31] o/ [13:12:36] OK we had a lot of momentary downtime pages over the weekend. I want to look into that today, but I think that, in order to do that, we need better logging from precached. [13:12:45] So, I think I'll be looking into that. [13:19:16] halfak: hey I was working on making the beta cluster working on ORES [13:19:21] https://ores-beta.wmflabs.org/v1/scores/testwiki?models=damaging&revids=12345 [13:19:26] https://ores.wikimedia.org/v1/scores/testwiki?models=damaging&revids=12345 [13:19:35] Everything is fine except this ^ [13:21:48] model and revid is swapped halfak [13:21:55] (after the refactor) [13:23:29] Hey Amir1. Went AFK for a minute. [13:23:55] okay. The good news is that the beta is fully functional now [13:23:59] I don't see what you mean, re. model/revid swapped [13:24:16] Oh weird [13:24:18] Yes I do [13:24:20] hmm [13:24:45] Seems to me that we expect model-->revid [13:25:18] the bad news is the refactor caused a bug [13:25:36] Yes. Well, that's not very bad news. I think the bad news is it looks like we have a bug in prod [13:26:22] why prod? [13:26:32] It's v1 [13:26:33] I didn't think we ever had revid-->model as the result format [13:26:48] it should stay consistent [13:26:55] Yes it should [13:27:02] I see that it is v1 [13:27:08] it was from the beginning. That's why the extension works [13:27:25] Yes. So this is unexpected. But maybe I've just forgotten some of the v1 format [13:28:44] I guess that's correct :D [13:28:59] maybe we can find the old etherpad file for that [13:29:25] https://etherpad.wikimedia.org/p/ores_response_structure [13:29:31] model-->rev_id [13:29:39] Looks like this is an *old* bug [13:34:09] halfak: oh, I think you got it incorrectly, the prod. says model->revid but the beta (after the refactor) says revid-> model [13:34:24] no I was wrong [13:34:25] It should be model->revid [13:34:39] Looks like prod has revid->model [13:35:07] So, it looks like we're just going to write that into the spec :/ [13:38:11] halfak: so do you suggest we keep this way for v1? [13:38:31] I don't get the duplicate "score" field when I make this request to the dev server running on ORES master: http://localhost:8080/v1/scores/testwiki?models=revid&revids=12345 [13:38:44] Amir1, we should stick with past behavior [13:38:59] That's why I thought it was a caching issue [13:39:22] yeah, it was caching issue. [13:39:53] by past do you mean model->revid or vice versa? [13:40:37] Whatever it is that ores.wikimedia.org is doing right now [13:41:01] :D [13:41:05] okay [13:42:53] I'll work on the response format quickly. Can you blow out the cache of ores-beta? [13:43:39] let me check [13:58:47] (03CR) 10Daniel Kinzler: [C: 031] "Looks sane to me, but I know nothing about the logging interfaces involved." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/302703 (https://phabricator.wikimedia.org/T141978) (owner: 10Ladsgroup) [14:04:35] OK. I think I have the switch implemented. This is super weird. [14:04:45] But I'm still confident that we should match old behaviors. [14:07:14] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 07Documentation: Provide a space for reporting bad predictions - https://phabricator.wikimedia.org/T140278#2533089 (10Halfak) There's no way we could manage that type of documentation. We currently support 20 wikis. Unless we somehow recruit a small army to... [14:10:59] Amir1, https://github.com/wiki-ai/ores/pull/162 [14:11:46] nice, let's wait for travis [14:18:23] halfak: {{merged}}. One question. By cache do you mean redis cache? [14:18:29] Yes [14:18:32] okay [14:18:35] The port 6380 one [14:18:42] I couldn't figure out how to actually connect to it [14:18:49] 6379 is redis' managed cache [14:18:54] Actually, let's blow out both :) [14:19:05] *celery's managed cache [14:20:20] I'll do it [14:23:25] both should be done by now [14:23:27] let me test [14:24:28] https://ores-beta.wmflabs.org/v1/scores/enwiki/damaging/?revids=216123%7C32423423%7C3243242%7C234324 [14:24:51] behaves strangely [14:24:58] lol woops. Celery workers and uwsgi probably need a restart [14:25:10] okay [14:25:13] on it [14:25:20] I bet there's a lock in there that is expected [14:25:34] redis does *mostly safe* locks [14:25:48] okay, restarted [14:26:49] okay, back to "normal" :D [14:27:24] \o/ "normal"! [14:27:24] still, we need to deploy the recent change to beta and see if it works [14:27:28] +1 [14:27:46] https://grafana.wikimedia.org/dashboard/db/ores-extension [14:27:54] jobs done for the beta cluster = 0 [14:27:57] for a very long time [14:27:58] :D [14:31:21] halfak: btw. for future reference: (on deployment-sca03) redis-cli -h deployment-ores-redis.deployment-prep.eqiad.wmflabs -p 6379 -a areallysecretpassword flushall [14:31:57] Yeah. No problem connecting to 6379, but 6380 wasn't working for me [14:32:28] I did that for 6380 too, worked just fine [14:32:38] probably you needed to retry [14:32:44] Strangely, I'd just get timeouts on Friday [14:32:47] (or Thursday) [14:33:50] Oh! I was trying from ores-compute-01 [14:33:56] Since I needed to install redis-cli [14:34:08] Must be that port 6379 is open but not 6380 [14:34:15] 6379 is default for redis [14:34:56] redis-cli not found on deployment-sca03 [14:36:05] I can connect from deployment-ores-redis directly :) [14:37:21] that's what I did :D [14:37:41] brb [14:46:53] the deployment is done [14:53:00] the maintenance script works just fine [14:53:07] let's wait and see for the jobs [14:53:10] afk [14:55:14] back [15:13:07] I'm back, waiting for a SWAT for fawiki [15:20:19] Amir1, I'm seeing some issues with our labs instance. [15:20:26] Not sure about deploying this refactor soon [15:20:33] I'll say more in the meeting [15:20:43] Looks like we have periods where timeout errors are pretty common. [15:20:54] It could be celery not queuing nicely. [15:21:49] okay, we can do tests [15:22:19] This might be related to our intermittent pages [15:22:43] It seems that some scoring requests simply don't go through. [15:22:53] And caching isn't helping the way that it ought to. [15:35:01] halfak: we can test simply if these are related or not, we can check logs of #wikimedia-ai and see if the icigna nagging started after the deployment or it was happening before the deployment of the refactor too [15:35:14] I guess it was happening before too and these are not related [15:35:22] but we can test carefully [15:41:09] Looks like our hundarian model is totally broken too. [15:41:10] Hmm [15:47:26] halfak: nlwiki again [15:47:33] okay. I will fix this [15:47:49] Nope. Different [15:47:51] See https://ores.wmflabs.org/v2/scores/huwiki/?models=reverted&revids=17811694 [15:48:02] looks like something is broken in our regex extractor [15:48:09] I'm trying to figure out what it could be. [15:48:20] oh, okay [15:50:07] if the regex was broken how were we able to build the model? [15:50:17] Doesn't make sense. [15:50:27] Also, the model isn't broken [15:50:35] Feature extraction is. [15:50:49] But I can extract this badwords/informals feature in my local install [15:52:51] Amir1, http://pastebin.ca/3678122 [15:52:58] I don't know why it doesn't work on the server [15:55:35] halfak: it might be a bug and it got resolved [15:55:43] the revscoring version is 1.2.6 in the server now [15:55:57] Should be 1.2.8 [15:56:05] That's what's in wheels [15:56:54] halfak: we haven't deployed it yet [15:56:58] I can check again [15:57:04] We have on labs [15:58:41] yes [15:58:51] in prod. I confirm it's 1.2.6 [15:58:59] Not looking at prod, dude [15:59:28] Yeah, I realized that. I just was reassuring it's 1.2.6 [16:00:15] I'll join in a min [16:02:46] halfak: ping! [16:30:11] 06Revision-Scoring-As-A-Service, 10ORES: Extrapolate memory usage per worker forward 2 years - https://phabricator.wikimedia.org/T142046#2533740 (10Halfak) a:03Halfak [16:58:56] 10Revision-Scoring-As-A-Service-Backlog, 06Research-and-Data, 10Research-outreach: Participate in the WSDM Cup 2017 challenge - https://phabricator.wikimedia.org/T142407#2533924 (10DarTar) [17:01:04] 10Revision-Scoring-As-A-Service-Backlog, 06Research-and-Data, 10Research-outreach: Participate in the WSDM Cup 2017 challenge - https://phabricator.wikimedia.org/T142407#2533948 (10DarTar) Assigning this to you @halfak, feel free to close it come September if we don't have bandwidth or interest in participat... [18:06:04] 10Revision-Scoring-As-A-Service-Backlog, 06Research-and-Data, 10Research-outreach: Participate in the WSDM Cup 2017 challenge - https://phabricator.wikimedia.org/T142407#2534241 (10leila) Thanks, @DarTar. under my radar. :) [18:19:19] 10Revision-Scoring-As-A-Service-Backlog, 06Collaboration-Team-Triage, 10Edit-Review-Improvements, 10MediaWiki-extensions-ORES: Include goodfaith model information in ORES review tool - https://phabricator.wikimedia.org/T137966#2534352 (10Catrope) [18:39:51] halfak: there are three things we should do today 1- find out (and fix probably) what's wrong with the labs setup 2- fix huwiki models 3- write the weekly update [18:40:07] Do you want to split and take over one or two [18:40:11] so we don't overlap [18:40:16] +1 [18:40:21] I want to work on huwiki model [18:40:26] It looks like deep revscoring stuff [18:40:51] I just got done with meeting. I'm going to clean up lunch stuff and get back to work on that [18:40:55] Where do you want to start? [18:41:20] okay, I check the labs setup since it look like to be a little bit more operational [18:41:42] and then each one of us got to the weekly update [18:43:30] sounds good [18:45:10] nice [19:11:15] I'm declaring this hungarian issue crazy and I'm rebuilding the model. [19:11:33] I suspect that the issue is the model was trained with an old version of revscoring. [19:24:10] I'm waiting on revert detection and feature extraction, so I'm going to start on the weekly update [19:25:12] okay, halfak [19:25:20] I found this: [19:25:21] https://icinga.wikimedia.org/cgi-bin/icinga/avail.cgi?t1=1470596539&t2=1470682939&show_log_entries=&full_log_entries=&host=ores.wmflabs.org&service=ORES+worker+labs&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringnotrunning=yes&includesoftstates=no&initialassumedhoststate=0&initialassumedservicestate=0&timeperiod=thisweek&backtrack=4 [19:25:34] which is super helpful, I'm digging into logs [19:25:57] can't connect to web-04, it's weird [19:26:18] web-03 shows lots of time outs for precaching today [19:26:32] 15K time outs to be exact [19:26:44] Yikes [19:37:09] in the last 24 hours web-04 didn't get even one request, honestly we should just depool it [19:37:16] and throw it away [19:37:26] Amir1, it's been thrown away and rebuilt. [19:37:30] Something else is going on [19:37:52] so there's something wrong with the number 04 :D [19:38:07] I guess. That's the only thing that's been held constant! [19:38:17] yeah! [19:38:19] Amir1, https://etherpad.wikimedia.org/p/ores_weekly_update ready for review. [19:38:32] nice [19:41:20] halfak: it's great [20:15:30] OK Posting update now [20:27:35] o/ SMalyshev [20:27:38] hey [20:27:48] So yeah. I hear from DarTar that you might be able to squeeze in some work on ORES [20:28:12] I figured it would be good to have a chat about that and to figure out how we might get started. [20:28:35] sure [20:29:11] I'm around. Trying to understand what's wrong with labs, ping me if I can do anything :) [20:29:19] So, what kind of investment do you think is reasonable -- how might we best take advantage of your expertise. [20:29:21] Will do Amir1 [20:29:33] Oh BTW, Amir1, SMalyshev, not sure you've met. [20:29:50] Amir1, is my primary collaborator on the ORES project. He's the reason we have a bus factor of 2 :) [20:29:57] hey! [20:30:03] nice to meet you :) [20:30:35] halfak: well, we haven't talked with Dario about much specifics except for somehow helping with architecture review [20:30:58] SMalyshev, right now, I'm running 100% engineer mode and I'm hoping to get back to doing some research too. [20:31:03] Amir1: nice to meet you :) [20:31:46] so I understand ORES is now productized, so what is needed there? [20:32:40] Well, we have a few places that the system is insufficiently optimized and that means our memory footprint is bigger than it needs to be. [20:32:57] It also means that we don't quite score multiple edits as fast as we'd like. [20:33:25] Optimally, I'd like to pull a senior engineer onto the team to work beside me and eventually take over the primary engineering needs of the project. But right now, we'll take what we can get. [20:33:42] We used to have an engineer working with us, but he struggled with the complexity of the system. [20:34:00] so what's the engineering needs? I mean, which areas - sysadminning, writing code, etc.? [20:34:22] arranging services in nice pleasant configurations... [20:34:28] Most complex bits are (1) celery, the queue-based distributed processing system and (2) the dependency injection-based feature extraction system. [20:35:12] SMalyshev, not sure how to answer, but there's plenty of code to write. And some cleverness needed for getting around some of the hairy corners of serialization and scheduling of task execution. [20:35:37] hmm... I worked some with celery on my $job-1 [20:35:45] Nice. That should be helpful. [20:35:54] so is all the system python-based? [20:36:15] Yeah. We have a few UI bits that use JS and OOJS-ui, but otherwise, mostly python. [20:36:44] We use sklearn's models. Everything with ML is a thin wrapper around an off-the-shelf sklearn estimator. [20:37:01] But a lot of what ORES does is getting the features extracted fast [20:37:40] The DI system allows us to let model devs have flexibility while ORES/extraction can be engineered/optimized independently. [20:40:29] ok, so if you need somebody to deal with the system on ongoing basis, we;d need to ask Powers That Be, but if it's just kind of looking at it and going through the arch and doing review and forming the base for what exactly is needed for ongoing basis - then we could do it now probably [20:42:13] Amir1, quick note. Retraining the huwiki model just worked. WTF. I don't care. Victory! [20:42:29] nice! [20:42:30] SMalyshev, would you like to start with a phone call and a walkthrough of the code? [20:43:46] halfak: yes, probably. You're not in SF, right? [20:43:55] That's right. I'm in MN. [20:44:05] I'm depooling ores-web-04 to run an experiment [20:44:23] ah, ok then. Hangout it is :) [20:44:37] SMalyshev, OK if I schedule a 25 minute call for tomorrow? [20:44:42] Or you think we should aim for longer? [20:45:20] halfak: I'd book an hour slot just in case. May not need it but for the first walkthrough better not to be in a rush [20:45:35] Cool. Sounds great. [20:45:41] my afternoon tomorrow is mostly open [20:46:06] Would you rather meet right after The disco weekly or do you want a break? [20:48:10] SMalyshev, ^ [20:48:53] halfak: I think better a bit later, ~1pm PDT [20:49:11] 3 meetings in a row is enough :) [20:49:46] Makes sense. I scheduled you in Lovelace. I hope you'll forgive me [20:49:46] is it ok for you? if not we can do 11:30 still [20:49:51] Nothing else available. [20:49:57] heh [20:57:57] halfak: after depooling ores-web-04 request timeout stopped [20:58:03] WTF [20:58:06] WHY [20:58:11] :)))) [20:58:36] probably load balancer tries to connect as much as possible [20:58:38] Although, I'll admit that I'm happy at that solution [20:58:43] every minute [20:59:04] OK. So now we get to figure out why "ores-web-04" isn't an OK name [20:59:08] Bah. meeting. [20:59:10] Back in 1 hour [20:59:27] :D [21:00:15] I want to keep monitoring in the next 24 hours until the next deployment window and if it was alright we'll deploy [21:06:27] halfak: I have a theory about ores-web-04, What if our project in labs is like a partition in HDD and have a bad sector (a real issue in hardware) and we run into it no matter what [21:06:37] I need to talk to Andrew [21:06:46] Amir1, +1 to that [21:23:40] Okay, hardware won't stay the same [21:23:53] it seems ores-web-04 got ran out of memory error [21:24:05] why this one got it and not other ones, [21:24:10] I have no clue [21:39:51] this is interesting, the only difference of ores-web-04 is that it's jessie 8.5 but other ones are 8.3 [21:53:20] I will get some sleep, ttyl [21:55:18] Amir1, It was debian 8.3 before I rebuilt it [21:55:20] Weird! [21:55:26] ANyway, sleep well! [21:56:25] oh [21:59:42] SPAM [21:59:45] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10ORES: Dashboard or pane for ORES service in beta - https://phabricator.wikimedia.org/T142294#2535110 (10Ladsgroup) 05Open>03Resolved [21:59:47] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES: Dashboard or pane for ORES failed jobs on beta - https://phabricator.wikimedia.org/T142119#2535111 (10Ladsgroup) 05Open>03Resolved [21:59:50] 06Revision-Scoring-As-A-Service, 10ORES: [Investigate] Periodic redis related errors in wmflabs - https://phabricator.wikimedia.org/T141946#2535113 (10Ladsgroup) 05Open>03Resolved [21:59:52] 06Revision-Scoring-As-A-Service, 10ORES: Extrapolate memory usage per worker forward 2 years - https://phabricator.wikimedia.org/T142046#2535112 (10Ladsgroup) 05Open>03Resolved [21:59:55] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 07Wikimedia-Incident: Config beta ORES extension to use the beta ORES service - https://phabricator.wikimedia.org/T141825#2535114 (10Ladsgroup) 05Open>03Resolved [22:00:07] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES: Make user-centered documentation for review tool - https://phabricator.wikimedia.org/T140150#2535119 (10Ladsgroup) 05Open>03Resolved [22:04:02] o/ [22:06:34] See you!