[13:41:19] <_joe_> anyone around that could help me understand how to configure the mw exception handler to show our error pages as a result of catching a fatal? [13:42:02] <_joe_> context is https://phabricator.wikimedia.org/T187147#5295715 [14:21:19] _joe_: The change from 'fatal' to 'exception' is due to php7 allowing most of these (formerly "fatal" errors) to be caught via the new Throwable concept. [14:21:46] including Timeout and Nomethod [14:21:48] <_joe_> Krinkle: yes, I got that [14:22:19] We treat those two channels the same in all our processing, so not too worried there. [14:22:25] We could probably merge them. [14:22:34] We consider all uncaught exceptions as "fatals" anyway. [14:22:45] <_joe_> you mean exception.log and fatal.log? [14:22:54] Yeah, the distinction is not useful to developers in any way. [14:23:08] <_joe_> ok, good [14:23:10] they are all cases where an exception was uncaught, and HTTP 500 presented with an error page. [14:23:28] <_joe_> so remains the problem of how to show a nicer page to the user :) [14:23:53] a very very tiny fraction of them (OOM, and Segfault) are really "uncatchable/non-recoverable", but not sure it's worth having a separate channel over. Would certainly make dashboard queries easier etc. [14:24:08] _joe_: Yeah, the semi-blank page, what did that look like? [14:24:27] Afaik MediaWiki only uses that if there is a cascading failure where there was an exception when trying ot rendering the error page. [14:24:34] <_joe_>

[XRnhuQpAAC4AAHD2Y2UAAAAC] 2019-07-01 10:35:33: Fatal exception of type "WMFTimeoutException"

[14:24:49] <_joe_> that's all the content [14:25:06] <_joe_> of I mean [14:25:46] <_joe_> for segfaults there isn't much we can do it seems, php-fpm closes the socket connection abruptly [14:26:05] Ah, that's for PHP 7. [14:26:07] Interesting. [14:26:14] &action=nomethod [14:26:20] renders a mostly blank page indeed [14:28:00] _joe_: That might be due to fatal-error.php not initiating the skin. So it might be fine. [14:28:23] That is, if it really came from mediawiki, it would render a page with sidebar and this stuff in the content area under

Internal error, like a DB error. [14:28:38] The reason it gets this page, is because it is a new area we haven't previously had before... [14:28:57] The "error" page for Wikimedia, come from either Varnish, or from php7-wmerror, or from hhvm/fatal-error.php. [14:29:10] action=nomethod results in uncatchable fatal on hhvm/php5, so gets the nice error page there. [14:29:25] <_joe_> yes [14:29:42] for PHP 7, it is caught internally by MW's error handler, so it prints its own page, which is without skin for /w/fatal-eror.php because that mock entry point doesn't initialise the skin first, so this is expected more or less. [14:29:53] <_joe_> ahhhh [14:29:57] <_joe_> gotcha [14:30:25] It would've happened previously on HHVM as well for when a fatal happens before the skin initialises. [14:30:41] but that we can catch. I don't know of one like that though, off hand, but the code did exist before, anyway. [14:31:00] Main thing left is to get the stack trace right :) [14:31:45] _joe_: which server is it installed on? [14:32:14] <_joe_> Krinkle: wmerrors is installed everywhere, but I restarted only the mwdebugs [14:32:24] <_joe_> so, mwdebug* is where it now works [14:32:28] #0 /srv/mediawiki/w/fatal-error.php(138): {closure}(integer) [14:32:28] #1 /srv/mediawiki/w/fatal-error.php(80): CauseFatalError::doTimeout() [14:32:28] #2 /srv/mediawiki/w/fatal-error.php(20): CauseFatalError::go() [14:32:28] #3 {main} [14:32:36] That trace looks good. [14:32:37] for timeout [14:32:39] <_joe_> yeah [14:32:48] <_joe_> and nomethod too IIRC [14:32:52] the first line is weird [14:32:52] WMFTimeoutException from line 39 of /srv/mediawiki/wmf-config/set-time-limit.php: the execution time limit of 60 seconds was exceeded [14:32:55] <_joe_> and so does oom [14:32:56] but the trace is good [14:33:09] unavoidable I suppose, no problem. [14:33:09] <_joe_> well, that's expected, that's where the exception is set up [14:33:20] yeah [14:33:31] will make dashboarding harder and grouping less useful. [14:33:39] The HHVM timeout attributes directly to the #0 of the original trace [14:33:42] <_joe_> so wmerrors also fixes the true fatals like oom, excluding segfaults [14:34:06] action=oom, php7, mwdebug1002, now gives me the wmerror page indeed. [14:34:36] ah, but logstash gets no useful trace [14:34:51] <_joe_> uhm I must have been mistaken before [14:35:05] <_joe_> I suggest you check fatal.log first [14:35:18] <_joe_> it's less probable there is lag there [14:35:26] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [14:35:34] has the error handler trace instead of the real code [14:35:36] for action=pom [14:35:38] oom* [14:36:37] <_joe_> right [14:36:44] no trace on mwlog1001 either, [14:36:48] checked just in case it was different [14:37:18] <_joe_> no you're right, I must have seen there was a trace and overlooked the content [14:37:34] 'nomethod' and 'timeout' are great. [14:37:47] segfault is still meh, indeed. no exception ID or log entry, full failure of the php-fpm layer [14:38:01] Something I guess PHP7 just isn't able to catch internally [14:38:05] so nothing we can do there [14:38:17] <_joe_> https://phabricator.wikimedia.org/T223336#5296591 [14:38:43] cool [14:38:46] furl is alive [14:38:50] <_joe_> Krinkle: modified version of furl coming your way :P [14:39:02] dejavu [14:39:22] do we get a php-fpm/syslog entry at least for the segfault having ocurred at all? [14:39:33] <_joe_> I think so [14:39:35] <_joe_> lemme see [14:40:14] Would be good if we can find a not-too-noisy logstash query for that so that we can still include them in fatal-monitor and in Scap canary Logstash checker [14:40:15] <_joe_> [WARNING] [pool www] child 28641 exited on signal 11 (SIGSEGV) after 14209.822144 seconds from start [14:40:20] <_joe_> sigh [14:40:23] to halt deployments if they were to become more common [14:42:46] type: apache2 channel: proxy_fcgi [14:42:51] mm not much [14:45:31] looking at that cluster-wide, quite a few of these in the logs: [14:45:33] > AH01079: failed to make connection to backend: [14:47:16] https://logstash.wikimedia.org/goto/0492a4a6caf9bbb74fc2561f42b830aa [14:49:59] <_joe_> that is for both hhvm and php7 though [14:50:25] <_joe_> also almost all from codfw [14:50:42] <_joe_> heh [14:52:28] <_joe_> Krinkle: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520025 [21:57:29] TimStarling: any thoughts about making https://gerrit.wikimedia.org/r/c/mediawiki/core/+/510637 use a feature flag instead of being WIP? [21:57:49] it would be nice if the API would be testable out of the box on vagrant [22:58:05] we can have a feature flag, but I think the flag should be removed rather than making it on by default, when the time for that comes [22:58:34] the action API and the AJAX API had feature flags, and non-WMF users would break client-side features by turning them off [22:59:36] I have a hilariously huge change almost ready to support isReadMode() and isWriteMode() [23:02:28] agreed, no point in having a flag once the API is properly released [23:24:23] Yeah, we're /still/ removing the last traces of "disable the API" feature flags. [23:28:57] Is the entry point expected to be useable/stable/similar in the 1.34 cycle? [23:30:21] I've got a meeting now [23:33:26] Reedy: Which one? [23:52:44] Reedy: possibly stable enough to use solely as a facility for the new parsoid