[03:44:10] (03CR) 10Krinkle: [C: 04-1] Introduce ext.ores.api (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) (owner: 10Ladsgroup) [03:56:13] (03PS3) 10Krinkle: Use $this->setTemporaryHook() in tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/456040 (owner: 10Legoktm) [03:56:16] (03CR) 10Krinkle: [C: 032] Use $this->setTemporaryHook() in tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/456040 (owner: 10Legoktm) [04:20:22] (03Merged) 10jenkins-bot: Use $this->setTemporaryHook() in tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/456040 (owner: 10Legoktm) [04:47:22] (03CR) 10jenkins-bot: Use $this->setTemporaryHook() in tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/456040 (owner: 10Legoktm) [10:28:11] (03PS4) 10Ladsgroup: Introduce ext.ores.api [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) [10:28:16] (03CR) 10Ladsgroup: Introduce ext.ores.api (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) (owner: 10Ladsgroup) [14:02:20] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @CFisch_WMDE & @amir1 - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:19:39] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores-extension grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores-extension is alerting: Service hits for obtaining thresholds alert. [14:20:52] datacenter switchover in progress [14:21:16] good, it scared me [14:21:20] Thanks apergos [14:21:25] yw [14:21:45] awight: btw. for when you're back. Regarding one of my commit messages: https://www.youtube.com/watch?v=ac_iZc0gGxk [14:28:08] halfak: both PRs are ready for review [14:28:24] Cool. Will look at that today. [14:36:41] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores-extension grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores-extension is not alerting. [14:41:50] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores-extension grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores-extension is alerting: Service hits for obtaining thresholds alert. [14:45:00] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores-extension grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores-extension is not alerting. [14:50:27] Amir1, are you still working on the ORES client lib for JS? [14:50:49] I'd like to fix ArticleQuality.js because our DOS protection "broke" it :) [14:51:31] halfak: I'm waiting for the patch in the extension to get merged before I do anything on it [14:52:14] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @CFisch_WMDE & @amir1 - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:52:59] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/459549 [14:53:02] wait, what [14:54:59] "OresApi" Is this some style thing? Both ORES and API are acronyms and I'd expect them to be all caps. [14:55:17] it's styling [14:55:38] It's really gross :( [14:55:39] OK [14:55:48] about ores.Api()? [14:55:55] To look like mw.Api()? [14:57:00] that would mean introducing global variable of ores and it's highly discouraged, eslint doesn't let you do that [14:58:56] ext.Ores makes more sense, I don't mind. I think it's another world and we play by js rules :D [14:59:12] Oh. Sure. ext.ores.Api() [14:59:30] This is not JS. This is goofy ass MW JS :P [14:59:36] o/ [15:00:08] o/ [15:00:39] Amir1, looks like this doesn't handle batches. Am I reading that right? [15:00:39] Amir1: I'm looking forward to this commit message [15:01:32] halfak: yes, this the first iteration on it [15:01:48] Ahh so it's not useful for me yet [15:01:53] once merged I will add score part [15:01:56] yeah [15:02:28] Amir1: There are two legitimately fun CR patches waiting for you btw ;-) [15:02:45] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/JADE/+/456078/ [15:02:53] oh sure [15:03:02] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/JADE/+/464737/ [15:03:07] at technical advice irc meeting atm [15:03:34] your calendar is full, no worries. Just wanted to sneak-shuffle to the top of the queue. [15:12:24] (03PS5) 10Ladsgroup: Introduce ext.ores.api [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) [15:28:32] (03PS6) 10Awight: Tests to demonstrate SpamBlacklist integration [extensions/JADE] - 10https://gerrit.wikimedia.org/r/464727 (https://phabricator.wikimedia.org/T206255) [15:28:34] (03PS3) 10Awight: Split user schema into ID, IP, or username; validate [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461502 (https://phabricator.wikimedia.org/T206573) [15:28:45] halfak: There's the schema we discussed yesterday ^ [15:29:28] I think you might want to use oneOf [15:29:42] Also it seems like username doesn't belong unless you have discovered magic. [15:31:06] awight, Amir1: either of you heading to SoS now? [15:31:26] halfak: already there [15:31:47] halfak: I did discover magic, dropped a note in IRC but I think it washed away in the backscroll. https://phabricator.wikimedia.org/T206573#4653369 [15:32:43] I was considering oneOf but it seems polite to allow clients to include {username, id} if they want. [15:32:44] awight, what if my username is "Awight_live_at_"? [15:33:07] Strange but what's the harm? [15:33:16] awight, so we have a mix of some {username, id}, some with just {username} and some with just {id}? [15:34:04] awight, re. the DOXXING, it is a common and destructive harassment method. [15:35:03] I'm currently allowing a mix of username,id or just one. [15:35:17] Right. That seems like a mess. Why do that? [15:35:47] Doxxing is a problem with any free-text, yes. None of the fields get special treatment, so user.username doxxing is equivalent to judgment.notes doxxing. [15:36:08] I have no strong opinions about the oneOf, actually. [15:36:15] Not exactly. Suppression allows usernames to be removed independent of the content. [15:36:21] I'm going with robustness principle by default. [15:36:31] What principle is this? [15:37:01] https://en.wikipedia.org/wiki/Robustness_principle [15:37:29] username,id seems like a nice courtesy to allow [15:37:37] Are we having clients edit the JSON directly? [15:37:40] username for hupeople [15:37:59] I disagree re. nice courtesy. Since username is not a persistent identifier. [15:38:01] clients and editors are allowed to edit JSON for sure [15:38:04] Usernames change. [15:38:14] So we're being liberal with what we do. [15:38:31] We can hook into user renames [15:38:36] Oh wait. I misread. [15:38:51] Editing content based on user-renames is intractible. [15:39:06] Why is that? [15:39:22] Hmm yeah signatures break too, I suppose [15:39:56] Yes. Signatures are broken. This is part of the reason people want something like Structured Discussions. [15:40:09] haha *some* people. but yes [15:40:20] Yes. the entire internet except for us. [15:40:31] Okay, I'm convinced, let's do it your way [15:40:43] ty for reviewing [15:41:23] Just to clarify, are you saying that you want to disallow usernames entirely, or have oneOf the three options? [15:41:51] I'm saying that I don't understand how usernames are good or maintainable. [15:42:07] And I suggest a oneOf strategy for legitimate user-identifiers. [15:42:14] kk [15:47:42] (03PS4) 10Awight: Split user schema into ID or IP; validate [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461502 (https://phabricator.wikimedia.org/T206573) [15:47:45] {{done}}! [15:47:45] You rule, awight! [15:49:45] I see it's not oneOf, but it works. [15:51:12] I think I could have used oneOf by specifying the ID and IP as separate subschemas, not sure which is more readable. [15:52:46] Took me a minute to read through min/maxProperties to confirm that it worked. But otherwise, it's not too complicated. [15:56:02] hair-raisingly, I'm discovering that CentralAuth is full of undocumented hooks. [15:59:20] https://www.mediawiki.org/w/index.php?title=Extension:CentralAuth&type=revision&diff=2917044&oldid=2865714&diffmode=source [16:01:41] Staff meeting! [16:01:53] Amir1, harej (if available), awight, etc. [16:02:24] It looks like we could make user renaming tractable by indexing username usage. [16:45:20] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10awight) [16:45:50] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10awight) p:05Triage>03High [17:09:19] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10awight) Looks like the trouble started on Sept 22nd, 16:00 UTC. Restarting celery workers on each codfw box reset the usage back to zero, which bu... [17:26:24] Hey folks! https://wikimediafoundation.org/2018/10/10/mitigating-biases-in-artificial-intelligences-the-wikipedian-way/ [17:26:28] Finally published! [18:15:00] 10JADE, 10MediaWiki-ContentHandler, 10TechCom-RFC: Content model version field to accompany content model - https://phabricator.wikimedia.org/T205921 (10Krinkle) I agree with Daniel that introducing a new top-level primitive for "content model version" seems undesirable as it creates additional expectations... [18:17:47] bravo! [18:22:46] https://phabricator.wikimedia.org/project/board/3613/ [18:22:50] \o/ [18:33:56] There are several funny questions about the cpu usage: The cpu usage actually went up two days after the deployment of pool counter, why? Secondly why ores1009 that is not getting any traffic is going up too (don't restart it, I want to run some checks on it) [18:38:05] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) p:05Triage>03High a:03awight [18:38:47] I'm ruling out poolcounter [18:38:51] it's something else [18:39:22] the real poolcounter deploy was at 19th, 21 was another deployment [18:39:29] https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen&orgId=1&from=1537477850460&to=1537720679875 [18:39:38] The cpu started to go up at 22 [18:43:26] \o/ [18:43:50] nothing is in SAL of 22 :/ [18:44:04] Looks like CPU is currently stable [18:45:43] I remember we had something for celery nodes that they would get automatically restarted after doing 70 tasks or something like that [18:45:48] It was in puppet [18:45:55] let me check if that got removed [18:47:05] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) I'm going to push this forward, but would first like to reopen discussion of the name. "contentquality" i... [18:47:20] halfak: harej: Amir1: ^ https://phabricator.wikimedia.org/T206037#4656098 [18:48:36] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) Additionally, the `itemquality` -> `articlequality` is much easier, only affects Wikidata clients, and doe... [18:48:53] Amir1: That's a great lead [18:49:13] There wasn't any change in puppet in 22 https://github.com/wikimedia/puppet/commits/production?after=8c2e94731dc053dccd189a953a79e9d1f8eea448+384 [18:49:27] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10Halfak) "Content" is commonly used among Wikipedians. "Content pages" is an official definition used in Wiki stats. [18:49:56] was it a holiday? everything is quiet [18:50:19] It was a Saturday [18:51:31] I thought it was Monday [18:52:28] oh! [18:52:31] * halfak double-checks [18:52:48] This is Sept 22nd, right? [18:52:51] That is a Saturday [18:52:55] Oct 22nd is a Monday [18:55:42] yup, that was the reason I was mistaken, I looked at October [19:07:57] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Ladsgroup) Some detailed graph of the jump: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=5&fullscreen&orgId=1&from=1537477850... [19:14:49] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) >>! In T206037#4656107, @Halfak wrote: > "Content" is commonly used among Wikipedians. "Content pages" is... [19:20:10] we are probably being DDoSed [19:20:27] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10Halfak) "talk quality"? Generally we don't refer to discussions as "content". "Content pages" does not include... [19:20:49] I know some researchers are using ORES now. [19:26:00] halfak: any particular researcher from Sweden? [19:26:36] Hmm. No [19:26:38] DDoS'd how? The graphs looks steady, https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=10&fullscreen&orgId=1&from=now-90d&to=now-1m [19:26:39] PoolCounter basically prevented ORES to go down just brought our cpu usage to highest possible [19:26:57] I don't think that is an adequate explanation [19:27:40] The switchover looks really nice on that graph btw, nice work team and especially Amir1! [19:28:49] I'm checking hadoop at the moment [19:29:09] awight: I didn't do anything, I guess you meant akosiaris ? :D [19:29:53] For sure, but you correctly anticipated that our DC architecture was safe :) [19:30:12] :) [19:30:24] https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=3&fullscreen&orgId=1&from=now-90d&to=now-1m [19:30:45] We had an increase in number of requests in the past couple of weeks for sure [19:31:00] akosiaris: since you're around: https://phabricator.wikimedia.org/T206654 [19:31:02] Right. I think it is researchers. [19:33:51] interesting that everything went back to normal right around the time of the mediawiki switchover [19:34:15] https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen&orgId=1&from=1539189934551&to=1539192046300 [19:34:30] ah no wait [19:34:32] that's UTC [19:34:42] And awight restarted celery :D [19:34:57] ^ that [19:34:58] yup, that was when we started to restart celery [19:35:09] ok nothing to do with the switchover [19:35:17] * awight facepalms for not SAL'ing [19:36:37] so what ? celery went haywire ? [19:36:45] akosiaris: do you know where requests to eqiad comes from? we had some requests when eqiad was passive [19:37:13] Everything started on 22 of September: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=5&fullscreen&orgId=1&from=1537477850460&to=1537720679875 [19:37:17] external ones ? that can only happen if someone hardcodes the IP in their client [19:37:18] It was rainy day [19:38:15] I'm checking hadoop, there is one case that requests around 100k a day [19:40:02] Mountain View? ;-) [19:40:18] LOL [19:42:45] 100k per day is OK [19:42:59] I expect we get up to 1-2m per day from responsible researchers who are gathering data. [19:46:00] this is looking more like some misbehaving celery than anything [19:46:15] if it was a reaction to user traffic it would start picking up pretty quickly [19:46:35] and would not have that very smooth slope [19:47:05] it needed 10days to climb from 5% to 50% [19:47:12] cpu usage that is [19:47:33] akosiaris: is there any recent changes to puppet of celery? or that's a red herring [19:47:43] this is a useful dashboard btw https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&from=now-30d&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=ores&var-instance=All [19:48:02] breakdown per host of the entire cluster and on top the sums [19:48:44] Amir1: nothing in puppet for modules celery since 2017 [19:49:22] hunspell was installed on the other hand on ores nodes on Mon Sep 24 [19:49:31] em [19:49:35] hunspell-gl [19:49:47] akosiaris: might be something with poolcounter that has latency [19:49:52] but that does not add up [19:50:08] the Mon Sep 24 I mean... dates you 've given me are Sep 22 [19:50:18] poolcounter ? why ? [19:50:56] * akosiaris needs to work on that poolcounter prometheus exporter [19:51:02] PoolCounter was deployed in 19 [19:51:30] I’m trying to open the blog post and I’m getting a WordPress login screen [19:53:58] 10JADE, 10MediaWiki-ContentHandler, 10TechCom-RFC: Content model version field to accompany content model - https://phabricator.wikimedia.org/T205921 (10Anomie) >>! In T205921#4655970, @Krinkle wrote: > I don't think there will ever be a case where we drop support for handling content models that once existe... [19:54:35] well there isn't any increase from Sep 19 to Sep 22 18:00 UTC so it would be weird if indeed it was that [19:55:10] then there is some, then it drops on Sep 27th only to pick up again until today [19:55:51] 27th we deployed something which means celery got restarted [19:56:13] ok, so this is going to manifest again [19:57:42] heh, I already see a process on ores1001 that has consumed 248h of CPU time since Sep30 [19:57:51] pid 24526 [19:58:23] it's doing some mmap and munmap [19:59:16] yeah it did not die from the looks of it today [19:59:20] lemme see why [19:59:42] should it die from time to time? [19:59:55] if awight restarted it should definitely had [20:00:10] all the other ones are from today [20:01:34] well, I thought awight did only codfw [20:01:53] because we are not getting traffic in eqiad [20:02:39] aaah good point [20:02:40] akosiaris: ^ [20:02:46] sorry :D [20:02:56] no, my bad I should have remembered that [20:04:08] that thing is doing something weird and it does not look like it's calculating scores [20:04:38] I 'll try and gdb to it just for the heck of it [20:05:20] I also checked if we deployed anything in ores beside poolcounter but everything is just home page redesign [20:10:12] 10Scoring-platform-team, 10MediaWiki-API, 10MediaWiki-Database, 10Wikimedia-production-error: Certain prop=revisions API queries timeout with "internal_api_error_DBQueryError" - https://phabricator.wikimedia.org/T121333 (10Krinkle) [20:11:34] https://github.com/celery/celery/issues/1558 [20:16:07] I don't why this issue is biting us now but upgrading kombu from 3.0.37 (production atm) to 4.0.1 would solve it [20:17:17] hmm, maybe we never noticed it because we were deploying often? [20:17:57] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Ladsgroup) Might be related: https://github.com/celery/celery/issues/1558 Our current kombu version is 3.0.37, people reported such issues with tha... [20:18:15] [back from IRL attack] [20:21:29] harej: Maybe that's part of the new comms rationing measures ;-) [20:21:59] 10JADE, 10MediaWiki-ContentHandler, 10TechCom-RFC: Content model version field to accompany content model - https://phabricator.wikimedia.org/T205921 (10Krinkle) [20:26:38] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Ladsgroup) And also https://github.com/celery/celery/issues/2142 Both of them basically mean we need to upgrade kombu, celery 3.1.26 (latest pre 4... [20:27:09] I'm done for the day [20:27:14] maybe, not sure [20:27:50] akosiaris: https://github.com/celery/celery/issues/2142 and https://github.com/celery/celery/issues/1558 [20:27:57] It seems like a known issue [20:28:55] aha [20:29:01] thanks for the pointer [20:31:09] akosiaris: I don't know if you know, but I'm working to upgrade ores to celery 4, it will change lots of things and probably makes everything a little bit more robust [20:32:29] will eat something and be back from home [20:35:26] I was vaguely aware [20:46:20] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10awight) Both issues mention a poll syscall returning POLLNVAL that we should be able to see in the strace if it's the same problem. Is it the work... [21:00:39] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/465739 (owner: 10L10n-bot) [21:04:49] 10ORES, 10Scoring-platform-team (Current), 10WMF-JobQueue, 10MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), and 2 others: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753 (10Krinkle) [21:07:30] Amir1, if you're around, check out https://gist.github.com/halfak/ebbf452eb50b6664bb3eded07c3a60cb [21:07:37] I added the batch processing to the bottom. [21:07:51] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10akosiaris) ``` mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144)... [21:08:39] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10akosiaris) And gdb bt output ``` #0 0x00005628905a4140 in ?? () #1 0x00007fbf68c51b8b in Tokenizer_push_textbuffer (self=self@entry=0x7fbfd9867... [21:08:48] halfak: I'm waiting in McDonald's. That seems nice, we should make a follow up patch to gerrit [21:09:03] Sure. I'd love to have it in soon so I can use it :) [21:09:11] Also, it's totally untested. Just captures the idea. [21:09:21] Probably has typos and stuff [21:10:01] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10akosiaris) Unfortunately due to the venv I could not get (yet) the niceties of `py-bt` and `py-list` working but there's already an indication that... [21:10:06] Hmm, we should add qunit tests [21:12:14] I am starting to think this is busy looping working on some text [21:13:01] It's possible that we have a new regex from hell. [21:13:04] that strlen("wiki_markup") over at the ltrace of the mwparserfromhell dominating the first half of the backtrace means it's pretty surely restricted to us and has nothing to do with the other bugs [21:13:04] It's happened before. [21:13:39] the ltrace pretty much repeats itself btw [21:13:48] I 've just copied one part of it [21:14:08] the gdb output is full, but it's a point in time [21:15:09] Could you dump your notes into the task? I think updating mwparserfromhell could be the easiest potential fix to test. [21:15:16] already done [21:15:19] If only there were some way to track what revision was being processed... [21:15:29] yeah that's what I trying to figure out [21:15:46] #81 0x00007fbf68c52640 in Tokenizer_tokenize (self=0x7fbfd9867718, args=0x7fbf540a95e8) [21:15:46] at mwparserfromhell/parser/ctokenizer/tokenizer.c:167 [21:15:46] #82 0x000056289059d6df in PyCFunction_Call () [21:15:54] this is the point we just from python to pure C [21:16:25] now to figure out how to get the input parameters of that thing [21:16:33] I still can't read the logs on the machine X( [21:16:46] How is it that we still don't have perms to read ORES main.log [21:17:18] hmmm [21:17:42] I vaguely remember some issues with uwsgi ... got to find the task [21:17:44] 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10Patch-For-Review, 10User-Ladsgroup: Implement JS ORES client in mw-ORES extension - https://phabricator.wikimedia.org/T201691 (10Halfak) I did some work to implement batching. See https://gist.github.com/halfak/ebbf452eb50b6664bb3eded07c3a60cb [21:18:19] akosiaris, this was showing up in celery [21:18:24] celery was using a ton of CPU [21:18:26] FWIW [21:18:43] yeah but app.log is uwsgi [21:20:22] Ahh I see. [21:22:59] 24526 Sun Sep 30 11:16:36 2018 <= that the when the process started. No idea when it when haywire [21:24:04] I'm not sure if it's helpful, but if we install the https://pypi.org/project/setproctitle/ module, Celery workers will change their process title to show you what they're doing. [21:25:15] doubtful they will go into the level of detail we 'd currently need [21:25:30] not against it though. Could prove useful [21:27:58] Build a list of tokens from a string of wikicode and return it. [21:28:06] that's what Tokenizer_tokenize does [21:28:15] https://github.com/earwig/mwparserfromhell/blob/develop/mwparserfromhell/parser/ctokenizer/tokenizer.c#L137 [21:28:45] args is a PyObject [21:28:46] hmm [21:29:10] If we enable celery debug logging, I think we'll see the revision IDs [21:29:30] It's a flood, but we can handle a few hours of it IIRC [21:30:02] well we would have better luck if we could reproduce it somehow [21:30:14] this seems like it slowly manifests over weeks [21:30:16] [back in 15] [21:30:23] +1 and it ramps up in sudden bursts [21:40:56] awight: i'm here at the tech engagement offsite and i came up with this weird but interesting use case for JADE for wikitech/mw.org. A gadget on the bottom of doc pages "Is this page helpful?" yes/no, explanation for no, with JADE as the backend. [21:45:57] harej: That's very cool. & we can train a model to predict whether pages will be helpful? [21:58:14] awight: if you can figure it out, sure! But I wouldn’t consider that to be a precondition. (What would be would be having JADE in production. Also a schema for page helpfulness.) [22:03:24] if we get logstash stuff reviewed, we probably can see more stuff [22:03:34] *cough* halfak *cough* awight [22:08:03] (03PS1) 10Krinkle: When service fails to respond, retry the job [extensions/ORES] (wmf/1.32.0-wmf.24) - 10https://gerrit.wikimedia.org/r/465775 (https://phabricator.wikimedia.org/T204753) [22:13:30] 10Scoring-platform-team (Current), 10revscoring, 10User-Ladsgroup, 10artificial-intelligence: Rewrite scoring libraries to replace pywikibase with mwbase - https://phabricator.wikimedia.org/T194758 (10Halfak) https://github.com/wikimedia/revscoring/pull/406 This should be working now. I made several impr... [22:15:37] (03CR) 10Krinkle: [C: 032] When service fails to respond, retry the job [extensions/ORES] (wmf/1.32.0-wmf.24) - 10https://gerrit.wikimedia.org/r/465775 (https://phabricator.wikimedia.org/T204753) (owner: 10Krinkle) [22:16:23] Amir1, I see you :P [22:16:39] I'm still catching up on stuff I said I'd do months ago. [22:16:48] Should review more tomorrow. [22:17:06] Amir1, do you know if we break stuff if we edit our i18n files directly? [22:17:16] I got some translation fixes through email and could just push them. [22:17:40] halfak: it would be rewritten by the bot AFAIK [22:17:46] except qqq and en [22:17:56] Ahh. OK. So it's a bad idea. [22:17:58] Damn it. [22:18:09] you can add them direcctly to translatewiki [22:18:34] Oh? That'd work. [22:19:57] yup [22:20:14] Amir1: logstash is great, but we can already just set celery to DEBUG in logging_config.yaml [22:23:12] Dieting is great but you can already just give yourself food poisoning :P [22:23:30] OK bad joke. I still can't figure out how to get any information about celery from our logs. [22:23:32] Debug or not. [22:24:00] halfak: we can't basically [22:24:09] have you tried /srv/logs/ores/ ? [22:24:14] yup [22:24:19] loggers.celery.level = DEBUG doesn't work? [22:24:27] we can't read [22:24:31] app.log has nothing from uwsgi and I still can't read main.log [22:24:35] permission [22:24:57] yeah we can just get someone with root /me glances around [22:25:03] mwbase tests are passing! [22:25:05] YES [22:25:05] 1-time need and all [22:26:04] Alex can do it but I guess he's done for the day [22:26:44] when we're ready to debug production, we can probably ask any opsen? [22:27:47] yeah I think [22:30:35] I'm out of here too. Time for evening chores. Moving a dresser! Always fun. [22:30:38] Have a good night. [22:31:39] (03Merged) 10jenkins-bot: When service fails to respond, retry the job [extensions/ORES] (wmf/1.32.0-wmf.24) - 10https://gerrit.wikimedia.org/r/465775 (https://phabricator.wikimedia.org/T204753) (owner: 10Krinkle) [22:33:09] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) >>! In T206037#4656194, @Halfak wrote: > "talk quality"? Generally we don't refer to discussions as "con... [22:54:44] I'm done for the day [22:54:50] for real [22:57:10] Seems that you worked two days [22:57:14] o/ [23:05:44] (03CR) 10jenkins-bot: When service fails to respond, retry the job [extensions/ORES] (wmf/1.32.0-wmf.24) - 10https://gerrit.wikimedia.org/r/465775 (https://phabricator.wikimedia.org/T204753) (owner: 10Krinkle) [23:15:15] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10awight) I found this config on production, so in theory our workers should be restarting themselves: CELERYD_MAX_TASKS_PER_CHILD: 100 [23:18:04] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10awight) Following up from IRC, I set `loggers.celery.level` to `DEBUG` in `logging_config.yaml`, and was able to demonstrate locally that celery de... [23:24:15] 10JADE, 10Scoring-platform-team (Current), 10Patch-For-Review: Validate global user ID; revisit user schema - https://phabricator.wikimedia.org/T206573 (10awight) a:03awight [23:25:09] 10ORES, 10Scoring-platform-team (Current), 10Operations, 10Scap, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10awight) [23:25:17] 10ORES, 10Scoring-platform-team (Current), 10Operations, 10Scap, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10awight) [23:26:24] 10Scoring-platform-team (Current), 10Patch-For-Review, 10User-Ladsgroup: [Epic] Use LFS for large ORES files - https://phabricator.wikimedia.org/T197096 (10awight) This task should be deduped or linked to T171619 [23:27:04] 10ORES, 10Scoring-platform-team (Current), 10Documentation: Draft of ORES threshold optimization documentation - https://phabricator.wikimedia.org/T198232 (10awight) a:05awight>03Halfak [23:28:33] 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10Global-Collaboration: Provide a user-accessible page showing current thresholds for each sensitivity level - https://phabricator.wikimedia.org/T195083 (10awight) 05Open>03Resolved a:03awight Nice hustle, this is deployed! https://en.wiki... [23:29:21] 10Scoring-platform-team (Current), 10editquality-modeling, 10artificial-intelligence: Simplify and modularize the Makefile template - https://phabricator.wikimedia.org/T190968 (10awight) Can we shelf this in the backlog? [23:29:55] 10Scoring-platform-team, 10edittypes-modeling, 10artificial-intelligence: Edittypes repo setup - https://phabricator.wikimedia.org/T191214 (10awight) a:05Sumit>03None Unassigning.