[01:20:18] (03PS5) 10Awight: Hooks to maintain judgment link tables [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466806 (https://phabricator.wikimedia.org/T202596) [01:20:20] (03PS3) 10Awight: Maintenance scripts for judgment indexes [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466808 (https://phabricator.wikimedia.org/T202596) [04:40:45] (03CR) 10Legoktm: [C: 032] build: Updating npm dependencies for security issues [extensions/ORES] - 10https://gerrit.wikimedia.org/r/466262 (owner: 10Libraryupgrader) [04:42:13] (03CR) 10jenkins-bot: build: Updating npm dependencies for security issues [extensions/ORES] - 10https://gerrit.wikimedia.org/r/466262 (owner: 10Libraryupgrader) [07:30:19] (03PS1) 10Awight: Spaces to tabs in JSON files [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466829 [09:21:39] o/ [10:06:36] akosiaris: hey, do you think we can merge this? https://gerrit.wikimedia.org/r/#/c/466716/ [10:06:52] sure [10:08:28] 99-main.yaml just got [10:08:30] +logging: [10:08:30] + handlers: [10:08:30] + logstash: [10:08:30] + host: logstash.svc.eqiad.wmnet [10:08:30] + port: 12201 [10:08:52] should be in all of the cluster in the next 30 mins [10:11:06] btw, I 've tested all last 500 revision of https://es.wikipedia.org/wiki/Usuario:Danielalfredo/Taller with mwparserfromhell 0.5.1 and none cause the issue [10:11:26] s/last/latest/ [10:12:54] but this is weird. cause clearly this is stuck in mwparserfromhell.parse and https://phabricator.wikimedia.org/P7669 does indeed point to that page [10:15:55] Oh thanks! [10:15:59] let me dig [10:18:19] it's a rabbithole mind you. I am trying to figure it out for years [10:18:22] hours* [10:19:58] :)) the best typo ever [10:20:32] yeah, that subconscious thing is funny [10:20:55] it probably does look that long to me already [10:21:37] that's something about those tables [10:21:52] half the stack trace is Tokenizer_parse_table and Tokenizer_handle_table_row [10:25:37] akosiaris: let's just delete the page :D [10:25:44] lol [11:37:11] (03PS1) 10Ladsgroup: Start using logstash [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/466857 (https://phabricator.wikimedia.org/T181546) [11:42:48] (03CR) 10Ladsgroup: [V: 032 C: 032] Start using logstash [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/466857 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup) [11:54:59] akosiaris: https://logstash-beta.wmflabs.org/goto/80b35b5099d8d95f993a2e7ec18754ff [11:56:53] 10Scoring-platform-team (Current), 10Operations, 10Wikimedia-Logstash, 10monitoring, and 3 others: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 (10Ladsgroup) It works: https://logstash-beta.wmflabs.org/goto/f9a3fea8c95c02724813fb7cbebb6471 Shall we go prod? let'... [12:57:07] YES!!! I 've finally reproduce it! [12:57:20] * akosiaris adding info to phab [12:57:57] wanna know the best part ? I think I know why timeouts don't work [12:58:14] mwparserfromhell does something with signals I guess [12:58:23] cause I can't even Ctrl-C the process [12:59:04] Coool [12:59:08] akosiaris: https://github.com/glenfant/stopit [12:59:17] "Will not stop the execution of blocking Python atomic instructions that acquire the GIL. In example, if the destination thread is actually executing a time.sleep(20), the asynchronous exception is effective after its execution." [12:59:23] Is it related? [13:01:50] no, I don't think so [13:02:36] there must be some malformed character in all of that btw that causes all this pain [13:03:23] oh good (you saved me from going to a rabbit hole) [13:21:49] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10akosiaris) I think I 've managed to reproduce the problem. After fetching and trying the last 500 revisions from https://es.wikipedia.org/wiki/Usua... [13:29:38] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10akosiaris) The dumped memory when converted back to utf8 has the exact same MD5 as https://es.wikipedia.org/w/index.php?title=Usuario:Danielalfredo... [13:41:59] o/ [13:42:54] akosiaris, great catch! No nested signals! It explains the problem. [13:44:25] oh... wait. Can't trigger the bug again with that revision? Hmmm. [13:44:30] can't find any evidence of signal handling in mwparserfromhell however [13:44:42] How did you get it to spin out of control? [13:44:49] (03CR) 10Ladsgroup: [C: 032] Always require damaging and goodfaith together [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466781 (owner: 10Awight) [13:44:50] look at the memory dump file [13:45:01] download it and parse that with the sample script I provided [13:45:03] (03CR) 10Ladsgroup: [C: 032] Followup: convert user subschema to use oneOf [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466786 (owner: 10Awight) [13:46:42] OK. So somehow this is in memory but it doesn't match the revid in question exactly? [13:47:02] This might be enough to file a bug and get earwig involved. [13:47:52] (03CR) 10Ladsgroup: Split user schema into ID or IP; validate (031 comment) [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461502 (https://phabricator.wikimedia.org/T206573) (owner: 10Awight) [13:49:27] 10Scoring-platform-team, 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for Galician Wikipedia - https://phabricator.wikimedia.org/T201146 (10Halfak) FWIW, I'll be working on the ArticleQuality.js documentation today so you'll be able to translate and make u... [13:53:25] this is very very weird. The internal memory representation of this thing reproduces the bug, but only that. Why ? [13:56:12] I'm struggling with this memory dump. The encoding is strange somehow. [13:56:36] Ahh. UTF-16? [13:56:47] Emojis? [13:57:22] UTF-16 ~ UTF-8 [13:57:37] Just forces early ascii chars to use more bytes. [13:57:39] I think? [13:58:22] the memory dump is in UCS2 cause that's what python3 uses internal for this specific string (since pep393 (python3.3) it's variable and entirely depends on the string) [13:58:48] for converting from ucs2 to a native python string, using UTC16 is fine (the inverse doesn't work) [13:58:59] I just converted it to UTF-8 and the signal timeout worked just fine. [13:59:08] yup I have that too [13:59:14] but only if I write the file first [13:59:27] for example if I add a line under text = data.decode('utf16') [13:59:28] So, there's something about loading it as USC2... :| [13:59:35] text = data.encode('utf8') [13:59:46] and feed that variable to mwparserfromhell I still reproduce [13:59:59] that former being of type str and the later of type bytes [14:00:12] but if I write that to a file and try to reproduce with that.. no dice [14:00:23] Oh interesting. Let me load this in as bytes. [14:00:26] WHAT ON EARTH ? [14:00:59] I wonder if pickle is doing something weird. We use pickle as a transport between uwsgi and celery. [14:01:11] At some point, this text will get pickled and unpickled. [14:02:09] A straight up pickle/unpickle doesn't affect anything [14:07:54] akosiaris, so I'm struggling to find a way to reproduce. [14:08:01] I can't even get the process to hang. [14:08:09] * halfak pastes code. [14:08:37] https://phabricator.wikimedia.org/P7672 [14:08:42] Do you see something I am missing? [14:08:46] I only reproduced up to now in the virtualenv on ores1001 btw [14:08:56] Oh interesting. [14:09:05] I'll try that. I have been running in my localhost. [14:10:40] with a few minor modification (I 've removed the timeout call) it does indeed reproduce on ores1001 [14:11:36] (03Merged) 10jenkins-bot: Always require damaging and goodfaith together [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466781 (owner: 10Awight) [14:11:38] (03Merged) 10jenkins-bot: Followup: convert user subschema to use oneOf [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466786 (owner: 10Awight) [14:12:11] ok my brain is fried. I am gonna step away a few hours [14:12:18] totally reasonable. [14:12:23] maybe I 'll have some epiphany in the meantime [14:12:32] So maybe the timeout *is* working after all but the job just keeps getting resubmitted. [14:12:48] Regardless, mwparserfromhell shouldn't be spinning out of control. [14:12:50] no that definitely not it [14:13:03] it's a busy loop for sure [14:13:33] I don't know why the timeout does not catch it but the fact Ctrl-c does not work makes me think something weird is going on [14:13:41] (03CR) 10Ladsgroup: Render Judgment pages as wikitext (034 comments) [extensions/JADE] - 10https://gerrit.wikimedia.org/r/464737 (https://phabricator.wikimedia.org/T206346) (owner: 10Awight) [14:13:48] normally that would happen if a library/application was catching signals and handling them [14:14:00] maybe there's something esoteric about binary modules in python [14:14:22] Amir had a point a few hours ago about an exception to the stopit module [14:14:26] something about atomic operations [14:14:34] akosiaris, but you said it spun once you removed the timeout. [14:14:42] I fail to see how mwparserfromhell is an atomic operation [14:14:42] akosiaris, that is for ThreadedTimeouts. [14:14:54] SignalTimeouts can stop anything supposedly. [14:15:02] Even "atomic operations" [14:15:13] yeah I removed the the timeout because ores.util was not directly importable [14:15:32] probably because I am running it from my homedir [14:15:43] even though the venv is activated [14:15:49] * halfak SCP's up a cloned git. [14:15:49] lemme retry that [14:16:46] but later, RL calls [14:16:59] kk [14:17:10] Will give a good ores1001 test in the meantime. [14:17:39] (03CR) 10jenkins-bot: Always require damaging and goodfaith together [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466781 (owner: 10Awight) [14:20:40] (03CR) 10jenkins-bot: Followup: convert user subschema to use oneOf [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466786 (owner: 10Awight) [14:22:38] I just ran the darn'ed thing on ores1001 and it didn't even spin. The timeout didn't even get caught. [14:22:43] * halfak sighs [14:34:05] Aha! I wasn't using the right file. I have also managed to replicate! [14:34:38] Indeed the signal does not work to stop it. [14:34:55] Confirmed that ^C does nothing. [14:35:40] OK. So now, I want to see if I can process this revision in labs without anything going crazy. [14:37:53] Confirmed. I can reproduce on ores-staging with http://ores-staging.wmflabs.org/v3/scores/eswiki/111186880/damaging [14:38:54] OK! We're getting somewhere. [14:39:07] Now can I reproduce on my localhost in ORES? [14:39:32] This is second time eswiki is bringing ores down :D [14:39:40] haha. [14:41:28] (03PS6) 10Ladsgroup: Introduce ext.ores.api [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) [14:42:53] ^ nice [14:43:31] Amir1, I really want to get that batch processing in place so that I can fix ArticleQuality.js. I'm seriously considering just implementing it within the user-script. [14:44:49] I recommend doing it but in a way that's easily portable to the extension. [14:46:38] (03PS7) 10Ladsgroup: Introduce ext.ores.api [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) [14:47:07] Hmm. Not sure what might make it easily portable. [14:47:19] (03CR) 10Ladsgroup: "Thanks." (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) (owner: 10Ladsgroup) [14:47:29] Other than just wholesale grabbing your code and extending it in the user-script. [14:48:12] That would work :D [14:49:01] Aha! So, the problem persists on my local machine too (ubuntu 16.04) [14:49:45] I need to kill -9 the process to get it to stop [14:49:53] * halfak copies notes. [14:59:57] that's interesting [15:05:53] is the text anywhere I could try it? [15:06:43] Platonides, there's a file attached to https://phabricator.wikimedia.org/T206654#4661442 [15:07:05] That contains the strangely encoded text that fails. akosiaris pulled it out of memory. [15:07:29] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Halfak) OK trying to replicate... I have found that you absolutely *must* use the file provided by @akosiaris directly. Even re-saving it UTF-16... [15:08:56] which version of mwparserfromhell and ores.util should I install? [15:09:20] ores.util from git clone of https://github.com/wikimedia/ore [15:09:26] mwparserfromhell==0.5.1 [15:10:35] * halfak attempts running the same thing with mwparserfromhell==0.5.0 [15:11:43] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Halfak) OK trying to replicate... I have found that you absolutely *must* use the file provided by @akosiaris directly. Even re-saving it UTF-16... [15:12:58] I've confirmed that mwparserfromhell==0.5.0 still spins out of control but that the signal timeout works. [15:13:21] sigh No module named 'stopit' [15:13:57] ok, it certainly runs [15:14:02] install ores requirement :P [15:14:28] pip install -r requirements.txt in the ores base dir. [15:16:02] This is so weird. [15:16:49] the revision doesn't seem to stop, either [15:17:29] Interesting. Are you parsing the revision directly or using the file that akosiaris produced? [15:17:32] it has been running for more than 5 seconds with the version downloaded from wikipedia [15:17:47] I should re-try that. [15:17:48] probably because I downloaded it directly [15:17:59] wget "https://es.wikipedia.org/w/index.php?title=Usuario:Danielalfredo/Taller&oldid=111186880&action=raw" -O 111186880.txt [15:18:21] maybe you copied and pasted it instead? [15:18:54] Maybe. I'll use mwapi. [15:23:28] Ha! Confirmed. Nice work Platonides! [15:23:34] So it's not celery! [15:23:40] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Platonides) I can reproduce it with the version downloaded from wikipedia: $ wget "https://es.wikipedia.org/w/index.php?title=Usuario:Danielalfr... [15:23:42] That makes life way better. [15:23:56] :) [15:24:09] We can now file a bug against mwparserfromhell. [15:24:23] I need to make a nice minimal version that also shows that signal timeouts don't work. [15:29:37] hmm [15:29:44] signaltimeouts seem to use alarm [15:29:52] Right. [15:30:04] I don't see why it would faild :S [15:30:12] I don't have much left to do for today (and this helps to reduce some overtime) I might work later though if there are some stuff to do :D [15:30:25] Have a nice weekend people [15:30:40] you too, Amir1 [15:30:44] o/ Amir1 [15:30:47] Thanks for your work on this. [15:31:14] Amir1, if you have a minute (no worries if you don't), I don't understand why you removed the try-catch around stopit. [15:31:20] the alarm is generated [15:31:51] the problem may be that mwparserfromhell is allocating and deallocating a lot of memory [15:32:01] halfak: sure, basically because it already raises [15:32:04] (brk,munmap,mmap) [15:32:32] and thus, the alarm interrupts one of those internal syscalls [15:32:51] https://github.com/wikimedia/ores/pull/273/files#diff-3d43799c54cb415864fb94bd365dcb60R32 [15:33:13] halfak: ^ [15:33:25] 17:32:22.320229 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x723fe97b0000 [15:33:29] 17:32:22.329990 munmap(0x723fe97b0000, 262144) = 0 [15:33:31] 17:32:22.330182 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x723fe97b0000 [15:33:34] 17:32:22.344833 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [15:33:36] 17:32:22.344974 rt_sigreturn() = 125618827244528 [15:33:39] 17:32:22.355381 brk(0x3591000) = 0x3591000 [15:33:42] 17:32:22.358384 munmap(0x723fe97b0000, 262144) = 0 [15:33:56] I wonder if it reaches python code even [15:34:04] or if the loop is in the C part [15:34:38] Amir1: Hey, thanks for the CR! [15:35:15] Platonides, should be that we can still alarm during the C execution. [15:35:19] *should* [15:35:20] :) [15:35:34] OK I have a simple script that works great. [15:35:47] Filing bug report [15:36:05] In the meantime, we can downgrade to 0.5.0 and expect this specific degenerate condition to go away. [15:40:18] halfak: I was thinking that perhaps it , , [15:40:21] https://github.com/earwig/mwparserfromhell/issues/206 [15:41:01] if you run kill -15 it ignores you, too [15:41:11] but kill -TERM works [15:41:24] Platonides, there's also the problem of nested signals. Basically, they are not allowed. So if mwparserfromhell is doing anything with signals, then it would overwrite what we're trying to do. [15:43:37] Platonides, yes. Same experience. [15:45:30] * halfak looks through difference between 0.5.0 and 0.5.1 [15:45:41] I don't see them using any signal code [15:45:55] and in fact, there was no signal syscall [15:45:56] Me either. [15:46:07] (in the strace output), and it was delivered... [15:49:16] Yeah... I really don't understand how the signal is getting fudged. [15:51:34] I wonder if that revision may contain a malformed utf-8 character [15:51:54] I'm guessing something like that is the case. [15:52:18] Maybe the code that reads in the bytes does something weird with signal. But then why would 0.5.1 behave any differently from 0.5.0. [15:52:48] $ LANG=C iconv -f utf-8 -t iso-8859-15 < 111186880.txt > 111186880.iso [15:52:48] iconv: illegal input sequence at position 196065 [15:53:49] I don't see anything odd there [15:54:01] although I wonder how does iconv count positions... [15:57:03] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10Halfak) articlequality and itemquality have very much in common. They are all based on similar scales and refer t... [15:58:18] that iconv was just a special - [15:58:33] U+2013 EN DASH [15:59:14] I note that this article has many unclosed tables [15:59:47] maybe it's not a loop [16:00:03] but just makes it go a way that takes too long [16:02:03] 170 {| but only 120 |} [16:03:59] the problem was likely introduced in 86c805d59b835146e792504550da860f95b11c9a [16:06:58] hmmm, nope [16:08:05] Why "nope"? [16:08:30] I'm testing with several versions from mwpfh git [16:08:54] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Simplify and modularize the Makefile template - https://phabricator.wikimedia.org/T190968 (10Halfak) a:05Halfak>03None Yeah. Was looking into it and it'll take some work to revive it. It doesn't seem pressing now. [16:09:25] Gotcha. [16:10:04] awight, have you been following this stuff ^ ? [16:10:10] Hi Channel. halfak, awight or Amir1, does anyone have 30 minutes today to talk over hangouts and give me a walkthrough of the anatomy of a [campaign]quality repo? I've been stepping through without documentation, but it would be a bit easier just to do it synchronously with an expert. Also we can setup a time for next week. [16:10:28] halfak: yes, looks like great fun so far. [16:11:09] I do enjoy a mystery, if you want to hand off your baton at any point [16:11:21] awight, I propose we downgrade to mwparserfromhell==0.5.0 in the short term. [16:11:39] Because then our timeouts will continue to work. [16:12:00] And prevent this degenerative behavior. [16:12:04] halfak: I couldn't quite parse what you had discovered--ah okay, so it still gets stuck in something like an infinite loop, but we can break out? [16:12:27] & you confirmed 0.5.0 behavior on ores1001? [16:12:57] I am still seeing it on mwparserfromhell version: 0.5 [16:13:01] fwiw we've been using 0.5.1 since mid-April [16:13:19] awight, Platonides you are not able to get a signal timeout to work in 0.5? [16:13:25] no [16:13:31] 0.5 compiled from git [16:13:40] Oh. It works for me. I'm able to demo [16:14:10] * halfak double-checks on ores1001 [16:14:56] (03CR) 10Awight: "Note to self: AbstractContentHandler::getSecondaryDataUpdates might be an alternative to some of these hooks." [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466806 (https://phabricator.wikimedia.org/T202596) (owner: 10Awight) [16:16:22] fails with mwparserfromhell==0.5.0 from pip, too [16:17:23] Platonides, define "fails" [16:17:46] I need to send a kill externally [16:17:53] and time spent > 5s [16:17:55] eg. 16s [16:18:07] Weird. Not my experience. [16:18:16] $ time python3 test3.py 111186880.txt [16:18:16] mwparserfromhell version: 0.5 [16:18:16] PID: 3423 [16:18:16] Terminado [16:18:16] real 0m16.761s [16:18:30] different environments, probably [16:19:12] I can try from scratch on a new VM [16:22:45] * halfak tries from Jessie 9.3 [16:23:03] Well.. [16:23:15] Indeed this doesn't work. Let me try again from my Ubuntu 16.04 [16:23:44] yes, it is failing on a clean VM too [16:24:40] Confirmed. 0.5.0 is interruptable in Ubuntu 16.04 but not Debian Jessie 9.3 [16:24:49] funny [16:25:47] which parameters are used on 16.04 when compiling the tokenizer? [16:27:01] Platonides, I'm running "pip install mwparserfromhell==0.5.0" [16:27:36] yes, but when I do that I see that it is doing eg. [16:27:40] x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.4m -c mwparserfromhell/parser/ctokenizer/tok_support.c -o build/temp.linux-x86_64-3.4/mwparserfromhell/parser/ctokenizer/tok_support.o [16:29:18] * halfak tries a verbose install [16:29:49] For debian 9.3: x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fdebug-prefix-map=/build/python3.5-3.5.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/halfak/venv/include/python3.5m -c mwparserfromhell/parser/ctokenizer/tokenizer.c -o build/temp.linux-x86_64-3.5/mwparserfromhell/parser/ctokenizer/tokenizer.o [16:29:57] I didn't need anything special :/ [16:30:27] btw, 9.3 is not Jessie but stretch [16:31:08] Yes. Thank you. [16:31:13] Ooh maybe a breakthrough. [16:31:23] "compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1" Falling back to pure Python mode." [16:31:33] * halfak runs to meeting [16:31:36] back in 30 [16:31:53] :) [16:32:01] I wonder if pure-python mode is the ticket. [16:32:02] could be the defference [16:32:12] *difference [16:32:26] I noticed that if I add a signal handler for SIGTERM it doesn't work either [16:33:13] Interesting note in the CHANGELOG: [16:33:14] +- Fixed the parser getting stuck in deeply nested HTML tags with unclosed, [16:33:17] + quoted attributes. (#190) [16:33:49] isn't that the one I mentioned earlier? [16:34:47] hmmm 0.4.4 is working :) [16:35:04] * Platonides bisects [16:35:22] Platonides: Ah, good! I did see your note about the unclosed HTML tables but missed the issue # [16:36:09] I mean when I said: [16:36:09] 16:02:45 < Platonides> the problem was likely introduced in 86c805d59b835146e792504550da860f95b11c9a [16:36:40] oops yeah and I recognize the problem report as already linked from our Phab task [16:37:01] Of course, the *fix* is supposedly in 0.5.1 [16:38:45] bisection lead me to 6ee61789da11a23720e743e14d856f7d5ed1c234 is the first bad commit [16:39:00] but that's just a commit making the compilation work [16:42:06] O_O [16:44:44] if compilation of C code fails, it falls back to python, which works :P [16:45:46] forcing compilation with -std=c99, the problem seems to be this other: [16:45:51] $ git bisect good [16:45:51] 8a9c9224be6cb2020ed4ad67a401081096dd21d1 is the first bad commit [16:45:51] commit 8a9c9224be6cb2020ed4ad67a401081096dd21d1 [16:45:52] Author: Ben Kurtovic [16:45:52] Date: Fri Jun 23 01:08:19 2017 -0400 [16:45:54] Speed up parsing deeply nested syntax by caching bad routes (fixes #42) [16:45:57] [16:45:59] Also removed the max cycles stop-gap, allowing much more complex pages [16:46:02] to be parsed quickly without losing nodes at the end [16:46:05] [16:46:07] Also fixes #65, fixes #102, fixes #165, fixes #183 [16:46:09] Also fixes #81 (Rafael Nadal parsing bug) [16:46:12] Also fixes #53, fixes #58, fixes #88, fixes #152 (duplicate issues) [16:46:14] :100644 100644 7d34015e6617e8fe64bd9e7b944aed16d0bfd197 bebacbf7f150ba65808db796170305d440c5e007 M CHANGELOG [16:46:17] :100644 100644 230bc5c0ddb9faf37c2a443fe21de7ca95f23f6f 588e73714318b294c252a556089ef56be4aedf4f M LICENSE [16:46:20] :040000 040000 0948488c8bdf20dad56f5fc9319027b82eec7c86 42357d1ea27a29f466b7055952ae44b398c4eeea M docs [16:46:23] :040000 040000 5f76a541f8c350d3dc8b87ebe2039f07463fcddb d6ebf96d1e3f99bd563a1ea187f69fee519448dd M mwparserfromhell [16:46:26] :040000 040000 83761869237f086c06b82cc3677dddd3c4825b34 5b27b8f4fde83ae6ab21fb7c1bfc06e7c176d67a M tests [16:46:54] seems logical to be that [16:48:39] Looks exciting! I'm confused about which tokenizer we're using, though-- halfak seems to be saying that our production boxes are using the Python tokenizer? [16:48:43] and the article indeed has many nested tables (as they aren't closed) [16:50:16] oh good, commit 8a9c9224be6 includes the same "fixes" to the Python parser [16:52:05] Production is showing _tokenizer.cpython-35m-x86_64-linux-gnu.so, I assume that means the wheel includes correctly compiled C. [16:54:08] The relevant point is probably that the fallback python tokenizer behaves differently when receiving a signal, so that invalidates some of our sandbox tests. [16:58:39] Back! [16:59:10] awight, I think my local Ubuntu install was using the python tokenizer. [16:59:44] halfak: +1 so that would account for why signals behaved correctly [16:59:48] Right. [17:00:31] Though, I'm really confused about why we can't interrupt the C-code with signals. They should be independent of the process itself. Maybe the C-code needs to handle signals in order to *be* interruptable. [17:00:41] rats. Platonides found a pretty compelling trail of breadcrumbs, luckily. [17:01:15] Yeah arcane stuff happens around signals. I'm happy to investigate that aspect of the mystery, if it's helpful. [17:02:15] maybe it's malloc() the one that scrubs it :/ [17:02:32] happy to have been helpful [17:02:39] Very helpful ^_^ [17:09:17] "A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes." [17:09:20] https://docs.python.org/3/library/signal.html [17:09:22] Well shit. [17:09:32] Why ever use signals then? [17:10:33] 10ORES, 10Scoring-platform-team (Current), 10Documentation: Draft of ORES threshold optimization documentation - https://phabricator.wikimedia.org/T198232 (10Halfak) 05Open>03Resolved [17:10:52] https://stackoverflow.com/questions/14271697/ctrlc-doesnt-interrupt-call-to-shared-library-using-ctypes-in-python [17:11:11] We just need to implement signal handling in the C module, it seems. [17:11:36] I like this barbaric approach, too: https://stackoverflow.com/a/45570686 [17:11:51] can you get the timeout library to send SIGTERM? [17:12:04] I think that's exactly what we're doing. [17:12:24] not SIGINT? hmm [17:12:36] Oh. Not sure. [17:13:16] SIGALRM [17:13:22] https://github.com/glenfant/stopit/blob/master/src/stopit/signalstop.py#L33 [17:15:24] We could subclass and override which signal is used... [17:16:53] Would SIGTERM just kill the whole process? [17:17:54] no, apparently it's handleable as well [17:18:11] Donno yet why it's rumored to have more of an effect on the non-GIL call [17:18:21] reading cpython source... [17:19:23] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Halfak) OK we have a good working demo script and an issue filed against mwparserfromhell. See https://github.com/earwig/mwparserfromhell/issues/... [17:19:29] OK need a break. Getting lunch [17:29:07] It's really too bad that Guido thought the world needed a new, "modern" language which wasn't thread-safe. [17:29:22] threading too hard? Just don't support it :-/ [17:45:05] I wonder what would happen to malloc'd memory if we signal out of a C function. [18:23:37] awight, new proposal. Let's use the pure-python implementation of mwparserfromhell [18:23:47] * halfak checks out performance. [18:24:59] devious [18:26:26] "--without-extension" seems like what we want [18:29:33] Confirmed that timeouts work still. [19:48:18] (03PS6) 10Awight: Split user schema into ID or IP; validate [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461502 (https://phabricator.wikimedia.org/T206573) [19:48:20] (03PS8) 10Awight: Render Judgment pages as wikitext [extensions/JADE] - 10https://gerrit.wikimedia.org/r/464737 (https://phabricator.wikimedia.org/T206346) [19:48:22] (03PS8) 10Awight: Tests to demonstrate SpamBlacklist integration [extensions/JADE] - 10https://gerrit.wikimedia.org/r/464727 (https://phabricator.wikimedia.org/T206255) [20:01:04] (03CR) 10Awight: Render Judgment pages as wikitext (034 comments) [extensions/JADE] - 10https://gerrit.wikimedia.org/r/464737 (https://phabricator.wikimedia.org/T206346) (owner: 10Awight) [20:12:55] 10ORES, 10Scoring-platform-team (Current): ORES workers using dramatically higher CPU, increasing linearly with time - https://phabricator.wikimedia.org/T206654 (10Halfak) OK so my proposal is that we deploy the python-only build of mwparserfromhell so that we can timeout effectively. I've run a basic perform... [20:13:07] And with ^, I'm changing locations. Back online in about 30 minutes.