[08:27:43] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES: Make user-centered documentation for review tool - https://phabricator.wikimedia.org/T140150#2454899 (10Ladsgroup) I made this very long time ago to be a user-centered documentation. We can improve it or re-write it: https://www.mediawiki.or... [09:20:45] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 07Wikimedia-log-errors: Model contains an error: ValueError: Failed to process datasource.wikibase.revision.parent.item_doc: Expecting value - https://phabricator.wikimedia.org/T139660#2439218 (10Ladsgroup) With a closer look to the error. It's clear what's wr... [09:21:35] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES, 10Wikidata, 07Wikimedia-log-errors: ORES extension score only main namespace edits for Wikidata - https://phabricator.wikimedia.org/T139660#2457033 (10Ladsgroup) p:05Triage>03Low [11:24:36] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 10Wikimedia-Site-requests: Deploy ORES review tool in Turkish Wikipedia - https://phabricator.wikimedia.org/T139992#2457235 (10Ladsgroup) [11:25:09] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 10Wikimedia-Site-requests: Deploy ORES review tool in Turkish Wikipedia - https://phabricator.wikimedia.org/T139992#2449140 (10Ladsgroup) It's a beta feature thus removing #community-consensus-needed [12:56:16] 06Revision-Scoring-As-A-Service, 10ORES: Explore growing memory usage of web workers on scb machines - https://phabricator.wikimedia.org/T140020#2450123 (10Ladsgroup) 12:54-56 UTC: P3413 [13:26:32] 06Revision-Scoring-As-A-Service, 10ORES: Explore growing memory usage of web workers on scb machines - https://phabricator.wikimedia.org/T140020#2457707 (10Ladsgroup) It seems we do have a memory leak on labs as well: https://grafana.wikimedia.org/dashboard/db/ores-labs [13:33:46] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-editquality: [research] Why is the japanese 'reverted' model so bad? - https://phabricator.wikimedia.org/T133405#2457741 (10Miya) @Elitre I don't fully understand what you need. Is it something to do with the message by とある白い猫 posted to [[https://ja.wikipedia.o... [13:49:04] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-editquality: [research] Why is the japanese 'reverted' model so bad? - https://phabricator.wikimedia.org/T133405#2457790 (10Ladsgroup) @Miya: Hey, We are working on building anti-vandalism tools for Japanese Wikipedia using AI (for example see ORES in [[https://... [13:58:59] o/ [14:05:19] halfak: o/ [14:05:30] Just updated memory consumption graphs. [14:05:42] Looks like uwsgi has a ceiling, but celery worker processes might not. [14:05:51] nice, In two hours we will have ORES review tool in trwiki [14:06:01] Great! [14:06:24] also we have a deadline from Ops to fix our memory consumption issue [14:06:38] the next Wednesday [14:06:41] 20th [14:07:17] we need to determine a release date by then [14:07:21] not fixed by then [14:07:32] That's not going to work for me. I'm out of town [14:07:32] halfak: ^ [14:07:40] What will they do if we don't fix it by then? [14:07:58] relocationg [14:08:14] To? [14:08:14] (and I guess it'll be painful for us) [14:08:19] Also, we can drop uwsgi workers to 50 per machine [14:08:34] IDK [14:08:34] That's an easy, quick fix that will will reduce our pressure substantially. [14:08:54] Also, I was surprised that we only had 16 celery and 72 uwsgi. [14:09:13] We should have more celery, but let's leave it for now since we're not struggling with performance. [14:09:55] halfak: there is a big issue around reducing to 50. we don't know the exact number of requests to ores [14:10:07] lots of them hit redis cache and we don't count that [14:10:13] Our queue can't be bigger than 100 anyway [14:10:24] Yes that's right, but they'll also be blazing fast [14:11:19] yeah, since I don't know the exact number of them. I feel we are walking in the dark, I don't even have an estimation [14:11:40] we should have something for that soon [14:11:45] I made a phab card [14:14:06] Right now, 19/72 workers are doing *anything at all* on scb1001 [14:14:45] Now it is 15 workers [14:15:52] halfak: can you get me a number for uwsgi workrers? [14:16:09] That is the number for uwsgi workers [14:16:23] oh, you are right [14:16:27] :D [14:16:28] Still 15 [14:17:11] okay. I make a patch to bring it to 48 worker [14:17:19] Generally, I think we need celery_workers + 5 for uwsgi to make sure we have at least 5 available for cache-only requests. [14:17:20] halfak: is it okay for you? [14:17:24] +1 [14:17:27] Let's do it asap. [14:21:53] halfak: done. the patch is about to be merged [14:22:04] OK. Graphs are updated. It looks like uwsgi clearly has a ceiling and celery has been migrating slowly upwards in memory usage. [14:22:28] If the leak is in celery, I don't think it's our code, but maybe celery's [14:22:30] I also think even though uwsgi seems super big but celery workers have spikes [14:22:58] is there a way to put a memory limit on celery workers? [14:23:12] Amir1, what do you mean? [14:23:26] What should the program do when it runs into its memory limit? [14:23:31] Refuse to process new requests? [14:23:37] restart? [14:23:49] harakiri in sense of uwsgi :D [14:24:29] http://docs.celeryproject.org/en/latest/configuration.html#std:setting-CELERYD_MAX_TASKS_PER_CHILD [14:24:52] We could set that to some large number and then workers would be restarted periodically [14:25:37] yeah, my idea [14:30:20] 06Revision-Scoring-As-A-Service, 10ORES: Explore growing memory usage of web workers on scb machines - https://phabricator.wikimedia.org/T140020#2457899 (10Halfak) OK. Graphs are updated. It looks like uwsgi clearly has a ceiling and celery has been migrating slowly upwards in memory usage. If the leak is... [14:30:31] I'm going to make a patch for that in the wmflabs deploy [14:30:43] Also, let's not refer to this as a "leak" [14:30:49] It doesn't seem to be a leak at all. [14:31:00] "leak" is kind of derogatory. [14:33:34] let's find a cool name :D "hungry ores" [14:33:46] I don't think this is ores at all. [14:33:52] This is most likely celeryd [14:37:10] halfak: speaking of ores-wmflabs: https://github.com/wiki-ai/ores-wmflabs-deploy/pull/65 [14:40:20] thanks [14:40:26] :) [14:41:52] * halfak runs precaching against staging [14:42:54] I messed with staging a little bit :/ I don't think it should have any problems [14:43:06] * Amir1 makes a phab card to fix ores-experiment [14:44:12] halfak: do you want to deploy and monitor? [14:44:26] I'm deployed to staging now. Am testing there quick. [14:44:36] nice [14:44:44] Our performance improvements made a big difference :D [14:44:47] Re. revscoring [14:44:55] Staging can keep up with precached no problem [14:45:11] :)) [14:45:30] * Amir1 give microphone to halfak so he drops it [14:46:26] Amir1, we'll get another big performance boost with precached in the refactor. [14:46:35] We can score multiple models at the same time for free. [14:46:42] Just so long as they are looking at the same rev_id :) [14:47:02] So we'll cut down CPU used in precaching enwiki by 4x [14:47:09] wikidatawiki by 3x [14:47:18] And those two together will be MASSIVE [14:47:22] yess, I need to fix precaching in prod too [14:47:30] it would be super easy [14:47:31] goddamn separate configs [14:48:03] speaking of which, I made configs for prod today [14:48:12] they will merge it tomorrow [14:48:28] https://gerrit.wikimedia.org/r/298707 [14:50:20] halfak: when you can confirm, this change indeed fix the hungriness of celery. Tell me so I happily send it to prod [14:51:47] Amir1, that'll take hours. I just want to make sure the thing stays online and processing requests for ~15 minutes [14:51:54] I want to make sure that at least one worker restarts. [14:54:30] yup [14:54:58] I can continue monitoring tomorrow morning (my time = in 12 hours) [14:55:07] if you are busy [14:57:11] Sure. That'll be great. [14:57:22] Still waiting on a process to get cycled [14:57:26] It'll take a little while [14:57:35] We've got 16 workers. [14:57:49] Oh! There goes the whole set. [14:58:08] Hmm... they all transitioned at about the same time. I suppose that will become more scattered after a few restarts. [15:00:19] going for deploy for trwiki [15:13:12] OK declaring victory on staging [15:13:19] GOing to labs [15:17:30] yessss [15:19:05] 06Revision-Scoring-As-A-Service, 10ORES: Explore growing memory usage of web workers on scb machines - https://phabricator.wikimedia.org/T140020#2458077 (10Halfak) I just deployed this to wmflabs. [15:29:12] Hmm... deploy is stick on -04 [15:29:36] https://grafana.wikimedia.org/dashboard/db/ores-labs?panelId=8&fullscreen [15:31:40] Okay, time to send out the announcement [15:31:56] Weird. Looks like -05 is failing to restart too [15:32:16] This looks like the old restarts-take-forever behavior [15:34:07] we never ever encountered those 1:30 issues like ever after the fix [15:34:26] I can't emphasize on "ever" more :D [15:36:08] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:54] * halfak tries to log into ores-web-05 [15:38:27] well... looks like -05 is down [15:39:31] Rebooting [15:40:52] Soft rebooting isn't working. [15:40:56] Trying a hard reboot [15:41:42] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 10Wikimedia-Site-requests, 13Patch-For-Review: Deploy ORES review tool in Turkish Wikipedia - https://phabricator.wikimedia.org/T139992#2458141 (10Ladsgroup) Announcement: https://tr.wikipedia.org/w/index.php?title=Vikipedi:K%C3%B6y_%C3%A7e%C5%9... [15:58:27] !log ores repooled ores-web-04 [15:58:27] No hay log abierto en #wikimedia-ai - log on para abrirlo, log list para listar los logs disponibles. [15:58:27] Not expecting to hear !log here [15:58:32] Woops [16:03:46] halfak: https://grafana.wikimedia.org/dashboard/db/ores-labs?panelId=8&fullscreen I can see the restarts [16:03:48] :D [16:05:38] \o/ [16:06:03] * halfak punches web-05 in it's face [16:40:55] halfak: btw. It seems we already have per web node check in LVS I'm waiting for Daniel Zahn to confirm that [16:41:15] Amir1, great news [16:43:13] (or potentially great news) [16:44:34] halfak: one thing: https://ores.wmflabs.org/ works fine for me but https://ores.wmflabs.org/v1/scores/enwiki/damaging/?revids=21914234230|3243242|234324 gives me internal server error [16:44:53] same there? workers acting crazy? [16:44:59] So, I accidentally killed ores-worker-05 while trying to kill ores-web-05 [16:45:07] and now I can't seem to re-initialize it. [16:45:13] Getting a weird fabric error [16:47:45] :D [16:48:22] WTF. Some of my fixes to the fabfile from when I last did this didn't get merged [16:48:24] Arg! [16:49:57] OK. Fix applied again [16:50:18] Arg. Not going to get to eat lunch today :( [16:50:41] it's still gives me 500 errors [16:50:47] halfak: don't worry [16:50:50] I take care of it [16:51:10] go get lunch [16:51:17] * halfak finishes initializing web-05 [16:51:36] tell me when you're done and I take it over [16:51:54] halfak: ^ [16:52:00] I'm serious [16:52:59] Error during puppet run: Error: Cannot create /srv/log/ores; parent directory /srv/log does not exist [16:53:17] you need to make that dir by hand [16:53:20] Looks like some puppet breakage happened revently. [16:53:21] (use sudo) [16:53:24] That's broken :( [16:53:45] Puppet should be able to run on a bare system [16:53:48] I encountered this error before while making sca03 in beta [16:53:59] I thought, I fix it [16:54:03] but then I forgot [16:54:17] 06Revision-Scoring-As-A-Service: Puppet fails on new web node - https://phabricator.wikimedia.org/T140265#2458591 (10Halfak) [16:54:26] 06Revision-Scoring-As-A-Service, 10ORES: Puppet fails on new web node - https://phabricator.wikimedia.org/T140265#2458604 (10Halfak) [16:54:31] https://phabricator.wikimedia.org/T140265 [16:54:51] 06Revision-Scoring-As-A-Service, 10ORES, 07Easy: Puppet fails on new web node - https://phabricator.wikimedia.org/T140265#2458605 (10Ladsgroup) a:03Ladsgroup [16:54:57] claimed it :D [16:55:09] it's 15 minutes work [16:55:22] 06Revision-Scoring-As-A-Service, 10ORES, 07Easy, 07Puppet: Puppet fails on new web node - https://phabricator.wikimedia.org/T140265#2458608 (10Ladsgroup) [16:55:33] Amir1, could you make it "unbreak now" too? [16:56:01] it's not a big deal. It happens only on new instances [16:56:52] Looks like out puppet might be incompatible with the new Jessie imahge [16:57:02] I switched to 8.5, but I think we need to go back to 8.3 [16:57:45] Amir1, I'd like to leave this with you. Task is to delete and rebuild ores-web-05 and ores-worker-05 with jessie 8.3 [16:57:58] It might be that puppet is just plain broken because of some scap stuff or something. [16:58:10] So if you could check on that, it would be great if we could stay on 8.5 [16:58:39] Amir1, here's the failed puppet run: https://phabricator.wikimedia.org/P3417 [16:58:43] * halfak runs away to feed himself [16:58:44] okay [16:58:50] I do it [17:20:30] halfak: Okay, I fixed puppet in ores-web-05 [17:20:37] going to deploy [17:26:25] restart in ores-web-04 is taking tremendous amount of time [17:26:36] 2 or 3 mins by now [17:29:26] I need to come back and fix 04 too [17:33:40] I just deleted and made ores-web-05 and ores-worker-05 [17:34:01] in order to fix ores-web you need to add this line to /etc/hosts in the instance [17:34:08] and run the puppet agent twice [17:34:45] 10.68.17.240(tab character)tin.eqiad.wment [17:34:51] I got to go [17:34:56] be back in ten min. [17:47:33] o/ Amir1 [17:47:34] just got back [17:49:23] Uh oh. ores-worker-05 is m1.small and it needs to be m1.large. [17:49:43] ores-web-05 should be m1.medium [17:49:56] I'm going to recreate them as the right sizwe [17:50:28] I've got 10 mins [17:50:33] Then I must run away again [17:52:22] OK. Instances recreated [17:55:55] halfak: I'm still working my way through the review and will continue tomorrow [17:56:19] but with it being large, it may be helpful to have someone more familiar also review [17:57:04] Hey schana. [17:57:12] and, if you have the time, to have a hangout where you walk through the changes and give your thought process behind the changes [17:57:17] Sorry I've been slow to your feedback & questions. [17:57:23] no worries [17:57:28] Dealing with a critical memory issue in prod [17:57:31] and downtime in labs :) [17:57:47] I've been reading the backscroll - sounds like a handfull [17:57:55] (and not the cute furry kind) [17:58:14] The labs stuff is my fault. [17:58:22] No deleting instances until the coffee is completely gone. [17:59:15] running puppet on the new instances results in really weird errors. [17:59:21] Error: Could not start Service[exim4]: Execution of '/usr/sbin/service exim4 start' returned 1: Job for exim4.service failed. See 'systemctl status exim4.service' and 'journalctl -xn' for details. [17:59:23] 06Revision-Scoring-As-A-Service: Provide a way to report false positives for ORES tool - https://phabricator.wikimedia.org/T140278#2458936 (10Superyetkin) [17:59:29] This is on ores-worker-05 [17:59:36] just back [17:59:52] Good timing. I have 1 more minute :( [18:00:21] halfak: okay, where are you now? [18:00:21] WTF is this exim4.service? [18:00:32] doesn't matter [18:00:36] mailing service [18:00:39] I have edited /etc/host on web-05 [18:00:45] I have configured both instances [18:01:02] okay [18:01:06] puppet agent? [18:01:08] I have run initialize_*_server on both and they ran to completion but failed when trying to restart the service [18:01:40] I get the copy-pasted error when running puppet agent on worker-05 and now running on web-05 [18:01:59] Same error with exim4.service [18:02:02] And then puppet dies [18:02:19] OK. Now I must go. I leave it to you. [18:02:24] if that's exim4, it's okay [18:02:28] Schana, we'll need to catch up tomorrow during your afternoon. [18:02:29] o/ [18:02:33] o/ [18:02:40] (BTW, I should be back in 1.5 hours) [18:02:46] Taking puppy to the vet! [18:02:47] kk [18:02:52] have fun [18:08:17] halfak: for when you're back, the hosts part actually had a typo. I fixed it [18:13:24] curl 0.0.0.0:8080/ in ores-web-05 works just fine but https://ores.wmflabs.org/node/ores-web-05/ (maybe some nginx caching?) [18:20:19] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.031 second response time [18:21:29] why the f [18:32:00] ladsgroup@ores-web-05:~$ curl 0.0.0.0:8080/scores/nlwiki/damaging/789 works [18:32:08] why :D [18:39:03] okay, hiera settings [18:39:22] fixed and ores web is totally okay [18:39:22] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 353 bytes in 0.594 second response time [18:39:49] workers? not so much [18:50:35] https://ores.wmflabs.org/v1/scores/plwiki/reverted/?revids=21914234230|3243242|234324 [18:50:46] doesn't work [18:51:00] but https://ores.wmflabs.org/v1/scores/testwiki/reverted/?revids=21914234230|3243242|234324 works! [18:51:08] models have issues [18:51:41] https://ores.wikimedia.org/v1/scores/plwiki/reverted/?revids=21914234230|3243242|234324 [18:51:50] it doesn't work in prod [18:53:26] other revs/models work [18:53:37] even in labs [18:53:43] relieved [18:54:29] if revision requesting doesn't exist labs and prod returns internal server error [18:54:35] it shouldn't happen [18:56:00] it was different before [20:12:42] Back. [20:12:46] OK logging into logstash [20:24:07] It's actually not the missing revision. It's another in the set and it looks like the problem is that the request to pl.wikipedia.org/w/api.php takes too long and the whole thing errors out. [20:24:45] Looks like we might want to increase the timeout for the Extractor [20:24:55] We also might want to handle this type of error better. [20:25:00] I'll file some tasks. [20:30:09] 06Revision-Scoring-As-A-Service, 10ORES: Respond with useful error information with all recoverable errors - https://phabricator.wikimedia.org/T140301#2459606 (10Halfak) [20:35:34] 06Revision-Scoring-As-A-Service, 10ORES: Respond with useful error information with all recoverable errors - https://phabricator.wikimedia.org/T140301#2459648 (10Halfak) See T140302 for the specific request that caused the timeout error. [21:12:07] 06Revision-Scoring-As-A-Service, 10ORES: Respond with useful error information with all recoverable errors - https://phabricator.wikimedia.org/T140301#2459783 (10Halfak) a:03Halfak [21:12:26] 06Revision-Scoring-As-A-Service, 10ORES: Respond with useful error information with all recoverable errors - https://phabricator.wikimedia.org/T140301#2459606 (10Halfak) https://github.com/wiki-ai/ores/pull/158 The error is still gross, but it is more useful than the bare 500. [22:20:47] \o/ I actually made it through PR feedback. WOOO [22:37:50] 06Revision-Scoring-As-A-Service, 10ORES: Explore growing memory usage of web workers on scb machines - https://phabricator.wikimedia.org/T140020#2460077 (10Halfak) While it's not my favorite solution, I think this is good for now.