[00:52:02] Hi! Whom should I ask for a user to get additional permissions on the beta cluster? [01:06:32] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3253896 (10mmodell) [01:06:42] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3181337 (10mmodell) [01:32:53] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3253975 (10mmodell) [06:05:12] 10Deployment-Systems, 10Scap (Scap3-MediaWiki-MVP), 06Operations, 13Patch-For-Review, 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3254105 (10Joe) >>! In T163565#3214272, @mmodell wrote: > @joe: That all seems reasonable. I don't particularly want to dupl... [06:23:23] Yippee, build fixed! [06:23:24] Project selenium-Wikibase » chrome,test,Linux,BrowserTests build #357: 09FIXED in 1 hr 43 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=BrowserTests/357/ [06:25:17] 10Scap (Scap3-MediaWiki-MVP), 10scap2, 06Operations: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#3254119 (10Joe) [06:25:19] 10Deployment-Systems, 10Scap (Scap3-MediaWiki-MVP), 06Operations, 13Patch-For-Review, 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3254117 (10Joe) 05Open>03Resolved a:03Joe [06:25:21] 10Scap (Scap3-MediaWiki-MVP), 03releng-201617-q4, 10scap2, 06Operations, and 2 others: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#3254120 (10Joe) [06:34:46] PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:54:36] 10Deployment-Systems, 10Scap (Scap3-MediaWiki-MVP), 06Operations, 13Patch-For-Review, 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3254140 (10mmodell) Thanks @joe! [07:04:01] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3254169 (10mmodell) [07:09:45] RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [08:50:40] hashar: o/ [08:51:02] hashar: do you think that we could deploy https://gerrit.wikimedia.org/r/#/c/353247/ today? [08:51:26] elukey: the labs one? yes definitely [08:51:30] and see what happens :-} [08:51:59] nice! [08:52:00] I thought that was already the case. Sorry I have not closely followed the state of the redis/jobrunner things [08:52:39] there is two job runners instances: deployment-jobrunner02.deployment-prep.eqiad.wmflabs [08:52:44] and deployment-tmh01.deployment-prep.eqiad.wmflabs [08:52:58] but the second one is a videoscaler right [08:52:59] ? [08:53:01] and I am not sure which redis db they end up hitting [08:53:02] yes [08:53:21] "tmh" probably stands for TimedMediaHandler [08:53:26] there were some rdb instances in labs, let me chec [08:53:27] the mediawiki extension that handles transcoding of video [08:53:28] check [08:53:49] that show up in the file changed by the patch above [08:53:49] https://gerrit.wikimedia.org/r/#/c/353247/1/wmf-config/jobqueue-labs.php [08:53:55] deployment-redis01 apparently [08:54:04] then I dont think beta is affected by the socket timeout [08:54:45] yeah [08:55:12] but we can measure the TCP time waits and see the number of connections [08:58:27] elukey: most probably we will want to deploy that on a single production jobrunner [08:58:40] monitor it for a few and see whether there is any impact [08:59:00] elukey: would be for later unfortunately. I am not around today :\ [08:59:57] okok let me know when you want to test it! I tried to live hack some days ago the jobrunner but didn't find any joy [09:06:56] elukey: maybe others can assist. For now I am off for rest of the day sorry ! [09:06:59] maybe tomorrow :} [09:07:02] * hashar waves [09:19:08] (03PS1) 10Addshore: Add extension-qunit-generic for TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/353258 (https://phabricator.wikimedia.org/T165021) [09:34:48] (03CR) 10WMDE-leszek: [C: 031] Add extension-qunit-generic for TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/353258 (https://phabricator.wikimedia.org/T165021) (owner: 10Addshore) [09:35:24] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Add extension-qunit-generic for TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/353258 (https://phabricator.wikimedia.org/T165021) (owner: 10Addshore) [11:21:07] (03CR) 10Tobias Gritschacher: [C: 032] Add extension-qunit-generic for TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/353258 (https://phabricator.wikimedia.org/T165021) (owner: 10Addshore) [11:22:13] (03Merged) 10jenkins-bot: Add extension-qunit-generic for TwoColConflict [integration/config] - 10https://gerrit.wikimedia.org/r/353258 (https://phabricator.wikimedia.org/T165021) (owner: 10Addshore) [12:34:20] RECOVERY - Puppet errors on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [12:34:34] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [12:35:13] RECOVERY - Puppet errors on deployment-aqs03 is OK: OK: Less than 1.00% above the threshold [0.0] [12:38:10] RECOVERY - Puppet errors on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:40:29] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:57:12] !log cherry-pick https://gerrit.wikimedia.org/r/#/c/353282/ [12:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:44:46] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3255112 (10Addshore) [13:46:39] Yippee, build fixed! [13:46:39] Project selenium-VisualEditor » firefox,beta,Linux,BrowserTests build #394: 09FIXED in 2 min 38 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/394/ [13:57:51] There are quite some reports today about new JS/Gadget/RL breakage which I fail to debug (no helpful output in browser's DevTools). See https://phabricator.wikimedia.org/maniphest/?ids=165040,165031,165015#R for a list so far. [13:58:22] ^ Krinkle: FYI (if you have somebody better/else in mind please share names :) [14:16:51] Krinkle,AaronSchulz - https://gerrit.wikimedia.org/r/#/c/353247/1/wmf-config/jobqueue-labs.php got merged and it is now on deployment-jobrunner02.deployment-prep.eqiad.wmflabs, but I am not really seeing less connections in TIME-WAIT as I was expecting.. Am I missing something or is it intended to be in this way? [14:29:36] PROBLEM - Host deployment-phab02 is DOWN: CRITICAL - Host Unreachable (10.68.19.232) [14:41:52] (I am checking deployment-prep.deployment-jobrunner02.network.connections.TIME_WAIT) [14:42:00] (in https://graphite-labs.wikimedia.org/) [15:20:12] Going to deploy a fix for T165011 (cc twentyafterfour ) [15:20:12] T165011: Global default 'hard' is invalid for field oresDamagingPref - https://phabricator.wikimedia.org/T165011 [15:38:51] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3255500 (10Ladsgroup) [15:49:11] 10Scap, 13Patch-For-Review: scap should always announce when it halts a sync due to error rate - https://phabricator.wikimedia.org/T164981#3255532 (10thcipriani) 05Open>03Resolved [16:17:37] 10Continuous-Integration-Infrastructure, 10MediaWiki-Unit-tests, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 2 others: Segmentation fault in mwext-testextension-hhvm-composer-jessie builds - https://phabricator.wikimedia.org/T165064#3255706 (10Lucas_Werkmeister_WMDE) [16:51:03] Mmm could someone remind me of whom I might ask to get permissions on the beta cluster (short of just updating the beta cluster db directly, which I suppose I could do)? [16:51:08] thx in advance!!! [16:51:29] AndyRussG: If you've got access, you can just do that [16:51:40] But what permissions do you want/need on what wikis? :) [16:52:48] Reedy: Ah K, thx... Mmm I need to give User:Pcoombe (WMF) Central notice administrator rights on meta.wikimedia.beta.wmflabs.org [16:53:38] Yeah I do have ssh access, so if directly updating the db is the "right" way... [16:54:00] 16:53, 11 May 2017 Reedy (talk | contribs | block) changed group membership for Pcoombe (WMF) from petitiondata to petitiondata and central notice administrator [16:54:17] It's not the right way, but no one is likely to care if you were to do so [16:54:26] Ah K... [16:54:54] Reedy: thx so much!!!!! :) [16:55:28] no problem! [16:55:30] * AndyRussG hides puritan concerns about correctness behind a rock 8p [16:55:54] :) [16:55:58] If you did it on the production wikis... Unless for very good reason, yes someone would probably complain :P [16:57:10] Heh yeah rightly so [16:57:48] Reedy: I complain regardless xD [16:58:21] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3255927 (10matmarex) [17:04:10] PROBLEM - Puppet errors on integration-slave-docker-1000 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:19:17] is T165069 within releng's scope? [17:19:19] T165069: Update swat deployers documation - https://phabricator.wikimedia.org/T165069 [17:31:44] Zppix: yes [17:32:00] Zppix: a better explainantion of the problem would be helpful. What was missing etc. [17:33:06] otherwise that's simply bug1/task2001 [17:39:07] RECOVERY - Puppet errors on integration-slave-docker-1000 is OK: OK: Less than 1.00% above the threshold [0.0] [17:40:18] Hey, https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie looks pretty unwell (about 90% of jobs are failing with "Lost parent, LightProcess exiting"). Nothing looked likely on Phab directly, though T145819 has the same error. [17:40:19] T145819: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes - https://phabricator.wikimedia.org/T145819 [17:41:12] It's likely it's hhvm related, indeed [17:41:20] Upgrade test of 3.18.2 [17:42:24] 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256111 (10daniel) [17:42:43] 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256127 (10daniel) [17:43:20] Ah. [17:44:16] 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256111 (10Jdforrester-WMF) Not just Wikibase. Errors in VE-MW and MobileFrontend. :-( Can we revert the test for now to see if that fixes it? [17:45:08] ugh, that error [17:45:27] I think it's not so easy to revert... As it's in the apt repo [17:45:49] Oh. Did we just do a cluster upgrade of HHVM? [17:45:56] Not just [17:45:58] Earlier today [17:46:02] And in production, it's not all servers [17:46:03] So, yes. [17:46:04] who? moritz? [17:46:15] or joe? [17:46:24] Moritz [17:46:41] I know there was some prod testing of the new HHVM which had been going well, but I hadn't heard anything for a couple of weeks. [17:46:43] I guess, specifically this is the problem [17:46:45] 15:18 moritzm: uploaded HHVM 3.18.2 and HHVM extensions to apt.wikimedia.org/main (previously only in experimental) [17:46:48] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3256150 (10Krinkle) [17:46:51] from yesterday [17:49:19] 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256153 (10greg) @MoritzMuehlenhoff HHVM upgrade is causing segfaults in CI [17:49:35] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256158 (10greg) [17:53:50] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256111 (10Paladox) May be related T165043 [17:55:41] 06Release-Engineering-Team, 07Documentation, 15User-Zppix: Update swat deployers documentation - https://phabricator.wikimedia.org/T165069#3256178 (10Zppix) per @greg in IRC [17:56:47] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3256193 (10Krinkle) [17:58:47] 06Release-Engineering-Team, 07Documentation, 15User-Zppix: Update swat deployers documentation - https://phabricator.wikimedia.org/T165069#3256197 (10Zppix) [17:58:59] greg-g: Updated task description ^ [17:59:05] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256198 (10greg) Can someone create a simple repo case? Or at least a backtrace? [17:59:30] greg-g: Hey, added another window for cleaning the table: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_May_8th I hope that's okay for you [17:59:51] typical, ci-jessie-wikimedia-658297 has disappeared [18:01:24] sorry for mistagging greg :/ I must of misunderstood your answer to my first question. [18:02:58] Hmmmmmmm [18:03:05] well, you asked about SWAT deploys, which is us, but your actual question/issue is with the script, not SWATs [18:03:12] typical X/Y problem [18:04:18] Zppix: you aren't making any sense. [18:04:29] How do we ssh onto the ci slaves? [18:04:49] greg-g: What do you have questions upon? [18:05:04] exactly* [18:05:24] THAT'S WHAT I'M ASKING YOU [18:06:08] Zppix: seriously, you don't appear to know what the issue is, so please just let others take care of it if they need to. [18:06:48] 10Continuous-Integration-Infrastructure, 05Security: SSH Host Key Verifiers are not configured for all SSH slaves on this Jenkins instance - https://phabricator.wikimedia.org/T165075#3256226 (10Reedy) [18:27:07] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256310 (10Reedy) >>! In T165074#3256198, @greg wrote: > Can someone create a simple repo case? Or at least a backtrace? Do we have a jessie hh... [18:27:47] Reedy: a host in beta cluster I guess? [18:28:09] Preferably somewhere we can trivially run the php unit script under gdb or something [18:29:54] I guess beta cluster should work [18:29:56] * Reedy looks at tin [18:33:55] No phpunit installed on tin [18:37:49] Reedy: Is this failure related to that? https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer-jessie/2278/console [18:38:44] "Segmentation fault" [18:39:07] Very likely, yeah [18:41:29] I saw that and guessed it was a different HHVM upgrade bug. [18:41:31] But yes. [18:43:37] (03PS1) 10Reedy: Branch LoginNotify [tools/release] - 10https://gerrit.wikimedia.org/r/353348 [18:43:42] Niharika: ^ [18:44:43] Reedy: Thanks! I also need to add it to extension-list too? [18:44:59] Yup, remove it from extension-list-labs too [18:45:28] Reedy: What all does one need to do for adding a new extension? 1. Add to tools/release 2. Add to extension-list 3. Add to CS/IS [18:45:30] Anything else? [18:47:42] https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code [18:48:04] That's pretty much it... [18:48:27] tools/release needed because whereas beta has access to all the extensions, production only has that list [18:48:50] extension-list (or moving from beta to productions -- beta uses productions too) for scap and localisation update to include the messages [18:48:55] Then CS/IS for loading/configuration [18:50:00] greg-g: I wonder if beta should just have a server provisioned like tin but with phpunit... Or just put phpunit on tin? [18:50:13] * greg-g shrugs [18:50:17] probably easier to do option 2 [18:50:28] the ci boxes include it via composer [18:50:39] I'm sure we don't really want to apt-get install it... or pear [18:50:43] * Reedy wget's phpunit.phar [18:52:19] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256387 (10Jdforrester-WMF) p:05Triage>03High This is at least High, as it's stopping merges into master in most repos. [18:52:22] 10Continuous-Integration-Infrastructure, 10MediaWiki-Unit-tests, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 2 others: Segmentation fault in mwext-testextension-hhvm-composer-jessie builds - https://phabricator.wikimedia.org/T165064#3255706 (10Mattflaschen-WMF) I'm getting them as well: E.g. h... [18:52:59] Of course, this is made harder by having no phpunit.php flag for the phar anymoe [18:53:17] So i have this in my dev wiki... [18:53:18] if ( defined( 'MW_PHPUNIT_TEST' ) && MW_PHPUNIT_TEST ) { [18:53:18] include_once ( '/var/www/wiki/mediawiki/phpunit-old.phar' ); [18:53:18] } [18:55:56] Why do I feel I'm over thinking this [18:57:49] Reedy: Is there a way to resurrect a merged (and the reverted) patch? Like https://gerrit.wikimedia.org/r/#/c/351195/ [18:57:58] Or do I have to make a new one? [18:57:59] 10Continuous-Integration-Infrastructure, 10MediaWiki-Unit-tests, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 2 others: Segmentation fault in mwext-testextension-hhvm-composer-jessie builds - https://phabricator.wikimedia.org/T165064#3256438 (10Paladox) [18:58:01] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256436 (10Paladox) [18:58:13] Niharika: You can revert the revert [18:58:27] And edit the commit summary, to make it nicer [18:58:35] Ah, nice. [18:58:38] "Enable LoginNotify on testwiki (take 2)" [18:58:39] or similar [19:00:19] And of course, tin doesn't use hhvm by default [19:01:45] Niharika: I didn't realize you were deploying LoginNotify today [19:02:30] kaldari: I added it to the swat but didn't add it to tools/release. [19:02:35] So reverted for now. [19:02:42] Can retry in evening swat. [19:02:59] You'll need to manually add it to the branch, and run scap too obviously when you want to deploy it [19:03:04] Niharika: Yeah, I wrote up some instructions at https://phabricator.wikimedia.org/T165007 [19:03:48] Niharika: There is more documentation here: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Add_new_extension_to_extension-list_and_release_tools [19:05:17] Niharika: Sorry, I didn't mention that part to you :P [19:25:06] PROBLEM - Puppet errors on deployment-puppetmaster02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:44:53] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure: Have a way to run phpunit (etc) manually on a machine in beta - https://phabricator.wikimedia.org/T165088#3256575 (10Reedy) [19:51:57] Well...] [19:52:06] I've got a fairly minimal replication case for facebook [19:52:09] use our vagrant [19:52:12] run phpunit in hhvm [19:52:15] segfault [19:53:09] Reedy: I put all of the related changes on https://gerrit.wikimedia.org/r/#/c/353352/ (all changes in config that is) [19:53:30] Reedy: How do I "manually add it to the branch"? [19:53:38] Niharika: git submodule add... [19:53:45] I think it's on the how to deploy code page [19:53:53] Ah, okay. [19:54:25] Or it was, it may have been removed at some point :P [19:54:31] Reedy: MaxSem and ebernhardson were pretty good at getting useful traces for HHVM crashes in the past if I'm remembering correctly [19:55:09] I've just gotta wait for 500MB of debug stuff to install ;) [19:55:57] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:01:24] ^ aww, why now? [20:01:32] was fine yesterday [20:04:07] Notice: /Stage[main]/Confd/Base::Service_unit[confd]/Service[confd]/ensure: ensure changed 'stopped' to 'running' [20:04:13] Notice: Finished catalog run in 96.21 seconds [20:05:05] RECOVERY - Puppet errors on deployment-puppetmaster02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:06:06] .. ok then .. [20:16:00] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [20:16:38] Getting a segmentation fault from mw-phpunit.sh: https://gerrit.wikimedia.org/r/#/c/353341/ [20:17:07] Ah, looks like it's due to https://phabricator.wikimedia.org/T165074 [20:17:08] kaldari: known issue atm [20:17:09] :) [20:17:11] :( [20:17:24] :| [20:40:20] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256780 (10hashar) p:05High>03Unbreak! That is caused by the upgrade of HHVM {T158176}. 3.18 has been uploaded to apt.wikimedia.org under j... [20:42:21] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256789 (10hashar) The snapshots we have: | ID | Provider | Image | Hostname | Version | Image ID... [20:43:40] !log nodepool: delete today jessie image snapshot. It comes with HHVM 3.18 which segfault with MediaWiki/PHPUnit. Rolled back to snapshot-ci-jessie-1494425642 from 30 hours ago. T165074 [20:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:43:44] T165074: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074 [20:44:05] thcipriani: ^^ [20:44:21] heh [20:44:23] some new version of HHVM ends up segfaulting on PHPUnit so I have deleted nodepool jessie image [20:44:29] yeah, we know [20:44:30] :P [20:44:35] in theory it should rollback to the image from yesterday [20:44:40] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256794 (10greg) @MoritzMuehlenhoff we should probably downgrade the HHVM version from Beta and CI and work on repro'ing elsewhere. This is prev... [20:44:44] I'm currently fighting it to get a backtrace out of it [20:44:51] ah, right, I remember that "fix" from last time we upgraded hhvm :\ [20:45:21] moritz told me there is a known bug in the new version but systemd restarts it and it's not critical for now [20:45:52] so some (automatic) restarts would still be in the known category, but manual ones should not be needed [20:45:53] yeah [20:45:58] We've possibly found other segfaults [20:46:02] ugh, ok [20:46:33] trying to get a backtrace... to see if it's one of the others we know about [20:46:41] and if so, a minimal replication case [20:46:43] or if it's a new one [20:46:55] mutante: i would need the old hhvm 3.12 to be uploaded to jessie-wikimedia/main [20:47:09] else hhvm 3.18 is going to be reinstalled again tomorrow :/ [20:49:11] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256803 (10hashar) Jessie instances are now being booted from `snapshot-ci-jessie-1494425642` which should have the previous HHVM version. What... [20:50:01] hashar: if it is urgent for right now i'd rather make the phone call, instead of trying to downgrade and remove new version from reprepro. [20:50:21] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256804 (10Reedy) Ok, so a clean vagrant vm (with 4GB ram!), will segfault by running phpunit with no extensions From gdb attached... ``` Cont... [20:50:21] Reedy: should be good now [20:50:38] mutante: well I think CI is fine now [20:50:39] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256805 (10MoritzMuehlenhoff) We can't easily downgrade the HHVM package in the main repo, it's otherwise working fine in production and running... [20:50:42] i have tried that for releases/misc before and became a looong issue.. including caching [20:50:49] hashar: pheew. ok! great [20:50:49] mutante: I will circle back with moritz tomorrow morning [20:50:58] hashar: awesome [20:53:32] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256807 (10greg) We should really use the new HHVM in testing first before going to production. If the tests are broken it means fix the tests/t... [20:56:19] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256810 (10hashar) p:05Unbreak!>03High a:03hashar CI instances have been rollbacked to the last known snapshot which uses HHVM 3.12.14. I... [20:56:57] mutante: yeah new jobs definitely run on 3.12 so it is all fine and there is no need to page :] [20:59:09] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256816 (10MoritzMuehlenhoff) The new HHVM version has been extensively tested on five canary servers in production for 5-6 week now. As per Ree... [21:00:01] hashar: yay:) [21:01:29] mutante: moritz has been in touch with me about it almost on a daily basis [21:01:35] we will figure out something tomorrow :] [21:02:17] it's fun that we had a bug when running wikipedia and then just running the testsuite exposes it :P [21:03:02] Platonides: yuuup ;) [21:03:13] Can't seem to identify which test is causing it to segfault [21:03:15] It's one of the last few [21:03:38] ........................................................... 14927 / 14947 ( 99%) [21:03:38] .................... [21:03:58] maybe it's related to the cleanup rather than the actual tests executed ? [21:04:03] It kinda seems to be [21:04:06] there's 20 dots after [21:04:20] so, all the tests are running [21:04:20] https://github.com/facebook/hhvm/issues/7779#issuecomment-300914747 [21:04:39] I guess these are telling [21:04:40] #0 HPHP::UserFile::close (this=0x7f1afd461090) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/user-file.cpp:212 [21:04:40] #1 0x0000000001a74ec2 in HPHP::XMLReader::close (this=0x7f1afd411ec0) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/ext/xmlreader/ext_xmlreader.cpp:95 [21:04:40] #2 0x0000000002133130 in HPHP::MemoryManager::sweep (this=0x7f1b25a8c840, this@entry=) [21:04:40] at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/memory-manager.cpp:471 [21:05:00] I had seen the comment :) [21:05:16] seems a bug in xml extension [21:08:07] Yup, narrowed case [21:08:08] tests/phpunit/includes/import/ImportTest.php [21:10:04] How do I get phpunit to run individual tests in the file? [21:11:06] --filter apparently.. [21:11:25] testUnknownXMLTags [21:17:01] Reedy: I dont think it is a specific test. it is most probably late when hhvm clean up the memory [21:17:10] hashar: I know [21:17:16] ok ok :] [21:17:20] But I'm narrowing the test case for hhvm people [21:17:29] Rather than saying "run all our phpunit tests!" [21:17:30] :P [21:17:30] neat! [21:17:42] they're not quick to run, as we know [21:17:44] that seems like a test which doesn't even need a db [21:17:56] so having only one test file to run... which takes seconds [21:18:11] so that would help building the build environment [21:18:21] Reedy: have you seen https://phabricator.wikimedia.org/T156923 ? "New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests" [21:18:30] mentions xmlreader as well [21:18:51] The stack trace looks very similar [21:19:13] Reedy: that tasks has log of my debugging / repro journey [21:19:30] Of course, this just means it's not been fixed in hhvm ;) [21:19:45] hhvm -v Eval.Jit=false tests/phpunit/phpunit.php tests/phpunit/includes/import/ [21:19:48] try that one maybe? [21:20:03] that should hit includes/import/WikiImporter.php / XMLReader [21:20:09] it might just be that bug surfacing again [21:20:13] [492dbf97df60049a66692f6d] [no req] Wikimedia\Rdbms\DBConnectionError from line 769 of /vagrant/mediawiki/includes/libs/rdbms/database/Database.php: Cannot access the database: Unknown database 'tests/phpunit/includes/import/' (127.0.0.1) [21:20:14] :D [21:20:33] ???!!!! [21:21:40] vagrant [21:22:00] 10Continuous-Integration-Infrastructure, 06Operations, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256869 (10hashar) Might be {T156923} surfacing again which mentionned XMLReader. [21:22:02] ;) [21:22:06] hhvm -v Eval.Jit=false tests/phpunit/phpunit.php --wiki=wiki tests/phpunit/includes/import/ [21:22:18] But I don't get core dumps [21:22:22] Which is annoying as hell [21:22:40] gotta enable them and set the core file max size [21:22:55] ResourceLimit.CoreFileSize + some ulimit [21:23:02] ulimit -c unlimited [21:23:04] that's enough [21:23:07] ah [21:23:08] :] [21:23:14] before... the vm just didn't have enough memory :P [21:25:56] "/vagrant/mediawiki/core": not in executable format: File format not recognized [21:26:00] silly hhvm-gdb [22:00:50] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.30.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T163512#3257029 (10mmodell) [22:09:38] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3257055 (10mmodell) [23:15:12] 06Release-Engineering-Team (Deployment-Blockers), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 05Release: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954#3257236 (10mmodell) 05Open>03Resolved [23:42:27] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]