[03:04:31] Anybody here? [19:58:28] TimStarling: Hm. that includes our Shell.php in core. [19:58:37] It only has an alternate implementation for Windows [19:59:03] e.g. Shell::escape [19:59:05] yeah I know, I'm saying that's fixable [19:59:12] right [19:59:25] I said it when that bug was current but for some reason the solution was not liked [19:59:55] all you really need is $arg = str_replace("'", "'\''", $arg); [20:00:04] So breaking escapeshellarg() is the only thing blocking us adopting LANG/LC_ALL=C universally or smth like that? [20:00:43] And we can't require C.UTF-8 to be present cross-platform? Don't know how long a tail has it missing but that could be a consideration as well. [20:00:54] that's the only thing that stops us from using C internally, the other issue is what environment to use for external shell commands since they have their own bugs with LANG=C [20:02:36] Right, we don't distinguish that right now. We use the same for both (via wgShellLocale), not overridden by Shell.php et all. [20:03:34] Although I suppose if we can't require C.UTF-8 then distinguishing between what we set for MW proc vs sub proc presumably won't help. The one for MW proc itself can work with C but the difficult one remainig then is sub procs. [20:05:24] * Krinkle reads https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 [20:06:42] Ah, this is much n ewer than I thought [20:06:57] yeah, `locale -a` on macOS for me doesn't contain C.UTF-8 for example [20:07:44] although it has `en_US.UTF-8` I don't know if there are distros we need/want to support that would exclude that intentionally or something like that? If not, that might be good enough as a fallback. [20:10:00] If we're not sure, I suppose we could do a Pingback bucket for this to gather some data in the wild over a period of time. [20:13:14] 2011 in Debian, 2015 in Redhat [20:16:40] just reading those bugs to figure out whether it is default and/or optional [20:18:37] From the macOS-related thread linked from on sourceware, I gather that on macOS "C" behaves as "C.UTF-8" despite not being called that [20:19:12] LANG=C LC_ALL=C LC_COLATE=C php ~/Documents/Temp/tmp.php [20:19:13] string(6) "'dög'" [20:20:16] confimed also via /usr/bin/locale that it doesn't normalise to something different. It seems to identify fully as "C". [20:27:00] whatever is running behind 3v4l.org doesn't have C.UTF-8 or C-like-UTF-8 it seems https://3v4l.org/u3D8X [20:27:36] I was curious about the mac case so I looked at the implementation in PHP [20:28:31] escapeshellarg() calls mbrlen() which gives you a locale-sensitive number of bytes in the next character in a string [20:29:09] with LANG=C in linux it appears that mbrlen() on a non-ASCII character gives an error, "invalid multibyte sequence" [20:29:20] which causes PHP to skip it [20:30:02] so if OSX's mbrlen() just always returns 1, like an 8-bit clean locale, then it would pass through UTF-8 [20:30:37] * Krinkle has looked at more C code this months than the past 5 years prior [20:30:41] month* [20:31:15] right, I follow you there. I'm trying to get a hello world .c file to run now to test that in isolation. [20:34:22] I'm impressed my two lines of code can produce such a long wall of errors [20:34:29] lol [20:36:40] for internal usage the correct fallback sequence is probably C.UTF-8 -> C, with escapeshellarg() usage replaced with our own thing [20:37:06] for external commands, I guess C.UTF-8 -> en_US.UTF-8 -> C [20:38:24] OK. I copied a C++ program and was using gcc instead of g++. Fair enough. [20:39:28] if we're too lazy to replace escapeshellarg(), then I guess C.UTF-8 should be required except if we're on OSX [20:40:19] OSX could be detected by just doing escapeshellarg('ⓒ') === 'ⓒ', i.e. a feature test instead of an OS test [20:41:21] Right, yeah, that is assuming C.UTF-8 has proliferated enough for our needs. Requiring PHP 7.2+ like we do and it only applying to the next release, while obvious, does narrow it down quite a bit. [20:44:06] I've been working on a CPP project recently and one thing I did fall in love with immediately is the compiler and its super helpful warnings and errors (not joking). I don't know if real gcc is better or worse than the clang alias macOS ships, but it's spot on every time and easy to use. [20:44:43] clang claims to have the best errors, gcc's are not bad though [20:45:06] better than they used to be [20:45:10] Like, it never tells me something generic like "Syntax error, was expecting random thing thing you most definitely didn't want to do" as PHP or JS would. [20:45:40] but instead it tells me "you need a pointer here, use * to make it so" or "missing semi colon" [20:46:25] this virtual offsite idea is kind of working, although I'm not sure what to do about meals when I've been up for 5 hours and it's only 7:45am [20:46:53] I think I'll get something before techcom but not sure what [20:51:34] At that sort of extent, unless you're gonna have like breakfast with the kids, have whatever takes your fancy [20:52:46] well, I'm not sure I did this right, but; https://gist.github.com/Krinkle/6c2fc025d0bdba08143329264f1f4034 [20:52:58] Yeah, on LANG=C/LC_ALL=C, ⓒ produces len 1 [20:59:23] I think your mbstate_t needs to be zeroed out, that syntax would give uninitiazed stack garbage in s [21:03:07] e.g. mbstate_t s = {0}; [21:06:13] the manual suggests using memset(), see https://www.gnu.org/software/libc/manual/html_node/Converting-a-Character.html#index-mbrlen [23:56:07] TimStarling: hm. no protection/warning against that uh? ok. (done) still seems to behave the same. Although trying out the counting approach (so that Hello yields 5 instead of 1) didn't work for me, got -1 instead. [23:56:31] TimStarling: btw, not sure if you can squeeze this in, but could really use a hand here at least to confirm or rule out my existing theory - https://phabricator.wikimedia.org/T239724#5729709