[12:59:19] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'https://en.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 1164 bytes in 0.072 second response time [13:01:16] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'https://en.m.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 1164 bytes in 0.072 second response time [13:01:18] PROBLEM - App Server Main HTTP Response on deployment-mediawiki04 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 728 bytes in 0.075 second response time [13:04:18] Project selenium-Math » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #215: 04FAILURE in 17 sec: https://integration.wikimedia.org/ci/job/selenium-Math/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/215/ [13:04:19] Project selenium-Math » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #215: 04FAILURE in 17 sec: https://integration.wikimedia.org/ci/job/selenium-Math/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/215/ [13:20:48] (03PS1) 10Paladox: Update test-requirements.txt to use jessie and not precise [integration/config] - 10https://gerrit.wikimedia.org/r/322500 [13:21:23] (03CR) 10jenkins-bot: [V: 04-1] Update test-requirements.txt to use jessie and not precise [integration/config] - 10https://gerrit.wikimedia.org/r/322500 (owner: 10Paladox) [13:23:57] (03PS2) 10Paladox: Update test-requirements.txt to use jessie and not precise [integration/config] - 10https://gerrit.wikimedia.org/r/322500 [13:45:27] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #219: 04FAILURE in 1 min 27 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/219/ [14:32:23] Project selenium-WikiLove » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #215: 04FAILURE in 23 sec: https://integration.wikimedia.org/ci/job/selenium-WikiLove/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/215/ [15:38:06] Project selenium-MobileFrontend » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #234: 04FAILURE in 16 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/234/ [16:01:29] Project selenium-CentralNotice » chrome,beta,Windows 7,contintLabsSlave && UbuntuTrusty build #218: 04FAILURE in 28 sec: https://integration.wikimedia.org/ci/job/selenium-CentralNotice/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Windows%207,label=contintLabsSlave%20&&%20UbuntuTrusty/218/ [16:01:30] Project selenium-CentralNotice » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #218: 04FAILURE in 29 sec: https://integration.wikimedia.org/ci/job/selenium-CentralNotice/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/218/ [16:01:35] Project selenium-CentralNotice » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #218: 04FAILURE in 34 sec: https://integration.wikimedia.org/ci/job/selenium-CentralNotice/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/218/ [16:01:42] Project selenium-CentralNotice » firefox,beta,Windows 7,contintLabsSlave && UbuntuTrusty build #218: 04FAILURE in 42 sec: https://integration.wikimedia.org/ci/job/selenium-CentralNotice/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Windows%207,label=contintLabsSlave%20&&%20UbuntuTrusty/218/ [16:01:55] Project selenium-CentralNotice » chrome,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #218: 04FAILURE in 54 sec: https://integration.wikimedia.org/ci/job/selenium-CentralNotice/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/218/ [16:02:14] Project selenium-CentralNotice » firefox,beta,OS X 10.9,contintLabsSlave && UbuntuTrusty build #218: 04FAILURE in 1 min 14 sec: https://integration.wikimedia.org/ci/job/selenium-CentralNotice/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=contintLabsSlave%20&&%20UbuntuTrusty/218/ [16:28:24] PROBLEM - Puppet run on integration-slave-docker-1000 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:35:51] A user has reported in -labs that the beta cluster is down [17:38:54] yup, for me too [17:39:09] according to the errors earlier, it looks like it is a bit longer down [17:39:11] I will create a task [17:39:36] Yep [17:39:41] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Wikipedia' not found on 'https://en.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 1164 bytes in 0.072 second response time [17:39:45] Sagan ^^ [17:40:08] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: The Beta-Cluster is down - https://phabricator.wikimedia.org/T151159#2809241 (10Luke081515) [17:40:20] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: The Beta-Cluster is down - https://phabricator.wikimedia.org/T151159#2809253 (10Luke081515) p:05Triage>03Unbreak! [17:41:24] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: The Beta-Cluster is down - https://phabricator.wikimedia.org/T151159#2809255 (10Paladox) shinken-wm reported problem first happening at 12:59 pm utc + 0 time. PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICA... [17:41:37] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: The Beta-Cluster is down - https://phabricator.wikimedia.org/T151159#2809256 (10Luke081515) deployment & de show a different message, zh shows the same as at en. [17:41:53] two wikibugs? [17:42:08] ah, a good idea to prevent flood [17:42:38] Sagan could you report two wikibugs in -labs please? [17:42:44] they need restarting [17:42:59] Sagan i now get http 500 error [17:43:12] do they? actually there is not a problem isn't it? They don't send every message twice [17:43:35] Sagan if there are two wikibugs it is likly it will say the message twice [17:44:07] it looks like they have access to the same queue [17:44:18] so they can send more messages at once, but won't sent a message twice [17:44:24] Oh [17:44:34] so I'm not sure if that effect is expected, but it's not a negative effect [17:45:20] Oh [18:24:22] PROBLEM - Host integration-puppetmaster is DOWN: CRITICAL - Host Unreachable (10.68.16.42) [20:38:03] The last Puppet run was at Tue Oct 18 02:21:12 UTC 2016 (48616 minutes ago). [20:38:21] no logs on fluorine [20:39:40] No logs on fluorine? [20:39:45] helpful [20:43:33] Project selenium-Echo » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #217: 04FAILURE in 2 min 32 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/217/ [20:43:39] Project selenium-Echo » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #217: 04FAILURE in 2 min 38 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/217/ [20:48:48] PROBLEM - Puppet run on deployment-fluorine02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:49:29] Error: Failed to apply catalog: Could not find dependent Service[ganglia-monitor] for File[/usr/lib/ganglia/python_modules/udp_stats.py] at /etc/puppet/modules/ganglia/manifests/plugin/python.pp:28 [20:57:31] Reedy https://github.com/wikimedia/operations-puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/ganglia/manifests/monitor/decommission.pp [20:57:43] it should not be available in labs [21:01:50] All of the error logs are empty on the apaches [21:01:57] What is this shit, seriously? [21:02:13] Reedy enable that debug thing for mw? [21:02:26] It's enabled [21:02:33] I think I filed a bug for this one before [21:02:53] Oh [21:03:12] Reedy is the bug the one where it woulden debug mediawiki updater? [21:03:16] I thought it was fixed [21:04:46] I don't think so [21:05:53] Oh [21:07:10] Reedy try http://stackoverflow.com/questions/17693391/500-internal-server-error-for-php-file-not-for-html ? [21:07:17] ini_set('display_errors', 1); [21:07:22] no [21:07:28] the API is fine if you browse it [21:07:46] oh [21:07:57] load.php gives 503 [21:08:06] Reedy https://en.wikipedia.beta.wmflabs.org/api.php dosent load for me [21:08:08] i get 404 [21:08:15] w/api.php [21:08:17] as it always is [21:08:19] oh thanks [21:08:42] Reedy maybe it is an extension? [21:08:49] Why? [21:09:09] Because api.php is working [21:09:22] https://phabricator.wikimedia.org/T148957 [21:09:27] That makes no sense [21:10:28] Reedy https://github.com/wikimedia/mediawiki/commit/eeb382e3c7addd13ee70a58542412628e884f8ec [21:10:29] ? [21:10:35] that shows a 500 status code [21:10:42] That's not a new commit [21:11:13] Nope, but it seems that the config has been broken a while [21:12:30] Reedy but the $wgShowExceptionDetails config is set in that file [21:12:52] It's set in DefaultSettings [21:12:57] And it's true in eval.php [21:13:13] Reedy https://github.com/wikimedia/mediawiki/blob/eeb382e3c7addd13ee70a58542412628e884f8ec/includes/exception/MWException.php#L106 [21:13:25] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:13:30] yeah, and? [21:13:36] that's not setting anything [21:13:58] Oh [21:22:18] Reedy it seems like it is related to the change done in https://github.com/wikimedia/mediawiki/commit/00bee029718f3215396e984d04b9450bc3872503 [21:29:54] Reedy should we make this https://phabricator.wikimedia.org/T148957 hi priority? Since if you carn't debug your wiki's then how can someone fix something they carn't find broken? [21:30:07] well, if it logged to the error logs, it'd be alright [21:30:14] but it doesn't do that either [21:30:49] Reedy i get errors to my log [21:30:56] That doesn't help [21:30:59] There's non on beta [21:31:02] Oh [21:33:20] it doesn't even get to the place it uses that [21:33:36] Oh [21:44:11] paladox: Did you link those commits on that task? [21:44:20] Nope [21:44:36] Original exception: [WDIZMQpEE4AAAEoeUxYAAAAG] /wiki/?foobar DBQueryError from line 1054 of /srv/mediawiki/php-master/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? [21:44:39] Query: SHOW SLAVE STATUS [21:44:41] Function: DatabaseMysqlBase::getLagFromSlaveStatus [21:44:43] Error: 1227 Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation (10.68.18.35) [21:44:59] Oh [21:45:11] Where did you get that? [21:45:14] Krenair: ^ Haven't you touched this recently? [21:45:15] By hacking it [21:45:53] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: The Beta-Cluster is down - https://phabricator.wikimedia.org/T151159#2809241 (10Reedy) Various hacking to get this done ``` MediaWiki internal error. Original exception: [WDIZMQpEE4AAAEoeUxYAAAAG] /wiki/?foobar DBQueryError from line 1054 of /srv/med... [21:46:27] Reedy what hacks did you do to get ^^ working [21:46:35] Commenting out some stuff [21:46:38] Oh [21:46:42] to force the backtraces to be written out [21:46:48] Oh :) [21:47:36] Reedy would you be able to do a paste please? I would be interested to see the stuff you crossed out that got it working [21:47:41] we could link it to the task [21:47:45] I already did [21:47:48] oh [21:48:25] I meant the code changes [21:48:30] I did change some grants around [21:49:01] I think i hit the same errors as you described but i thought it was fixed as my wiki started working after updating it again [21:49:06] to a version i forgot [21:49:07] I thought the ones I did were close to production's [21:51:14] Reedy, so it seems wikiadmin can do 'show slave status' but not wikiuser [21:52:22] so [21:52:24] prod has this [21:52:27] GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'wikiuser'@'10.64.%' IDENTIFIED BY PASSWORD [21:52:34] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: deployment-fluorine02 puppet broken - https://phabricator.wikimedia.org/T151169#2809529 (10Reedy) [21:52:53] and we do not [21:52:56] so I'll fix that [21:53:11] thanks [21:53:17] Reedy i get this error [21:53:18] [20-Nov-2016 21:28:00 Europe/London] PHP Fatal error: Call to undefined method OutputPage::setContext() in /home/randomwi/public_html/en/includes/OutputPage.php on line 312 [21:53:28] That's nice [21:53:31] File a task? [21:53:35] Ok [21:53:50] Reedy, try now [21:54:04] yup, that fixes it, ta [21:54:06] Reedy strange thing though, it hasent said that error for 40+ mins now [21:54:21] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 44599 bytes in 1.523 second response time [21:56:17] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 17430 bytes in 2.277 second response time [21:56:19] RECOVERY - App Server Main HTTP Response on deployment-mediawiki04 is OK: HTTP OK: HTTP/1.1 200 OK - 44165 bytes in 1.342 second response time [21:58:04] I just ran `GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'wikiuser'@'10.%';` as mysql root on deployment-db03 [21:58:37] Project selenium-Core » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #225: 04FAILURE in 6 min 36 sec: https://integration.wikimedia.org/ci/job/selenium-Core/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/225/ [21:59:05] Krenair: Mind commenting it on the task? https://phabricator.wikimedia.org/T151159 [22:01:25] 10Beta-Cluster-Infrastructure: deployment-fluorine02 does not have logs - https://phabricator.wikimedia.org/T146723#2669408 (10Reedy) Puppet is rather busted on this server too... T151169 [22:01:30] ok [22:01:53] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: The Beta-Cluster is down - https://phabricator.wikimedia.org/T151159#2809561 (10Krenair) 05Open>03Resolved a:03Krenair This was probably due to my messing with the grants etc. - beta used to use wikiadmin for *everything*, after fixing up the gra... [22:04:20] 10Beta-Cluster-Infrastructure, 07Puppet, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2809567 (10Krenair) [22:04:21] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: deployment-fluorine02 puppet broken - https://phabricator.wikimedia.org/T151169#2809566 (10Krenair) [22:04:55] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: deployment-fluorine02 puppet broken - https://phabricator.wikimedia.org/T151169#2809529 (10Krenair) see also T134808 [22:06:37] 10Beta-Cluster-Infrastructure, 05Goal, 07Puppet: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973295 (10Krenair) I've been doing some of this recently: https://gerrit.wikimedia.org/r/#/c/322403/ - role::beta::uploadservice https://gerrit.wikimedia.org/r/#/c/322404/ - role::bet... [22:06:48] Reedy i noticed that in installHandler [22:06:56] it's calling non exiting functions [22:06:56] like [22:06:57] set_exception_handler( 'MWExceptionHandler::handleException' ); [22:06:57] set_error_handler( 'MWExceptionHandler::handleError' ); [22:06:57] // Reserve 16k of memory so we can report OOM fatals [22:06:58] self::$reservedMemory = str_repeat( ' ', 16384 ); [22:06:58] register_shutdown_function( 'MWExceptionHandler::handleFatalError' ); [22:07:16] Never mind [22:07:22] I was loking at it wrong [22:07:44] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: hhvm error logs on beta showing no errors when cluster down - https://phabricator.wikimedia.org/T151171#2809575 (10Reedy) [22:11:14] 10Beta-Cluster-Infrastructure, 10Staging, 07Wikimedia-Incident: Rework beta apache config - https://phabricator.wikimedia.org/T1256#2809595 (10Krenair) I've uploaded some patches recently to bring beta closer to prod in this area: * https://gerrit.wikimedia.org/r/#/c/322413/ - mediawiki: move redirects site... [22:16:41] 10Beta-Cluster-Infrastructure, 10Staging, 07Wikimedia-Incident: Rework beta apache config - https://phabricator.wikimedia.org/T1256#2809598 (10Krenair) Or we could go the other way: Turn production's config files into templates, make them work with beta then get rid of beta's ones. (except for the one or two... [22:19:34] Reedy this https://github.com/wikimedia/mediawiki/commit/00bee029718f3215396e984d04b9450bc3872503 deffitly broke it [22:19:38] by looking at it [22:19:49] it changes all the calls from a config to calling a function [22:20:17] what? [22:21:40] Reedy breaking the config that helps debug [22:21:55] Why is calling a function bad? [22:22:01] Why don't you try testing it? [22:22:13] I did, but i have nothing broken to test [22:22:17] Do something that should be handled, check it isn't, then git revert that patch [22:22:26] Oh [22:22:38] I have found this function showBackTrace [22:22:45] I don't think it was lost [22:22:50] I think you can just throw an MWException in a random place [22:23:01] Which does, return ($wgShowExceptionDetails &&( !( $e instanceof DBError ) || $wgShowDBErrorBacktrace ) ); [23:05:17] 10Beta-Cluster-Infrastructure, 10Staging, 07Wikimedia-Incident: Rework beta apache config - https://phabricator.wikimedia.org/T1256#2809655 (10Krenair) Actually I prefer the other way, uploading patches for that.