[02:11:59] PROBLEM - Puppet staleness on integration-slave-precise-1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [02:13:30] PROBLEM - Puppet staleness on integration-slave-precise-1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [02:14:06] PROBLEM - Puppet staleness on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [03:03:08] (03PS1) 10Reedy: Simplify config, remove additions/removal upto REL1_23 [tools/release] - 10https://gerrit.wikimedia.org/r/343589 [03:05:24] (03PS2) 10Reedy: Simplify config, remove additions/removal upto REL1_23 [tools/release] - 10https://gerrit.wikimedia.org/r/343589 [03:05:26] (03PS1) 10Reedy: After REL1_23 is obsolete, simplify config further [tools/release] - 10https://gerrit.wikimedia.org/r/343591 [03:05:47] (03CR) 10Reedy: [C: 04-2] "Probably shouldn't be merged till ~May when REL1_23 goes EOL" [tools/release] - 10https://gerrit.wikimedia.org/r/343591 (owner: 10Reedy) [03:45:01] 10Gerrit, 06Operations: Decide how to support polygerrit - https://phabricator.wikimedia.org/T158479#3113507 (10Dzahn) Very nice that we don't need the rewrites anymore :) [04:04:33] PROBLEM - Puppet staleness on deployment-aqs02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:07:45] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,BrowserTests build #335: 04FAILURE in 11 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=BrowserTests/335/ [04:17:56] Project selenium-MultimediaViewer » firefox,beta,Linux,BrowserTests build #335: 04FAILURE in 21 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/335/ [05:48:19] Project mediawiki-core-code-coverage build #2644: 04STILL FAILING in 2 hr 48 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2644/ [08:51:03] !log Jenkins: depooling / deleting Precise instances. [08:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:52:08] PROBLEM - Host integration-slave-precise-1012 is DOWN: CRITICAL - Host Unreachable (10.68.17.174) [08:52:49] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Depool precise jenkins instances - https://phabricator.wikimedia.org/T158652#3113710 (10hashar) Removed from Jenkins, puppet master and salt master. I have deleted the three instances via Horizon. [08:53:14] PROBLEM - Host integration-slave-precise-1011 is DOWN: CRITICAL - Host Unreachable (10.68.17.70) [08:53:36] PROBLEM - Host integration-slave-precise-1002 is DOWN: CRITICAL - Host Unreachable (10.68.17.87) [10:09:44] (03PS2) 10Hashar: [operations/switchdc] add tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/343005 (owner: 10Volans) [10:09:52] (03CR) 10Hashar: [C: 032] [operations/switchdc] add tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/343005 (owner: 10Volans) [10:11:41] (03Merged) 10jenkins-bot: [operations/switchdc] add tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/343005 (owner: 10Volans) [10:12:55] (03PS3) 10Hashar: test: invoke rspec directly [selenium] - 10https://gerrit.wikimedia.org/r/330856 (https://phabricator.wikimedia.org/T137112) [10:12:57] (03PS2) 10Hashar: (WIP) have cucumber to auto install phantomjs (WIP) [selenium] - 10https://gerrit.wikimedia.org/r/330864 [13:25:29] twentyafterfour: hi, are you ready to switch phab search to codfw? [13:26:11] dcausse: yeah, I just need to make a config change patch and get +2 in ops/puppet [13:26:52] twentyafterfour: ok, we are going to first depool eqiad so that mw stops writing to it, then we will shutdown the cluster [13:27:05] so please switch when you want [13:29:53] dcausse: ok I'll put the change through right away [13:30:19] thanks :) [13:32:50] dcausse: https://gerrit.wikimedia.org/r/#/c/343635/ [13:36:24] twentyafterfour: will ask gehel to +2 when we're ready but not sure what to do about this tox failure :/ [13:36:53] yeah I don't know what's up with that either, I can't actually see anything but warnings [13:36:59] * gehel is having alook... [13:37:57] yeah, the 140 char per line is a new rule. The checks are run only on files that are changed, so we need to fix it. [13:39:12] fix coming up... [13:39:45] that shouldn't be listed as a warning if it's gonna fail and it should only be enforced on lines of code that changed [13:40:05] agreed and agreed, but that's not the case... [13:46:40] Yippee, build fixed! [13:46:40] Project selenium-VisualEditor » firefox,beta,Linux,BrowserTests build #342: 09FIXED in 2 min 39 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/342/ [13:49:09] PROBLEM - Puppet run on buildlog is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [13:49:31] RECOVERY - Puppet staleness on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [3600.0] [13:52:17] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 3816 bytes in 0.198 second response time [13:52:17] PROBLEM - App Server Main HTTP Response on deployment-mediawiki04 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 3250 bytes in 0.092 second response time [13:52:20] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 3816 bytes in 0.185 second response time [13:57:16] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 33422 bytes in 0.957 second response time [13:57:16] RECOVERY - App Server Main HTTP Response on deployment-mediawiki04 is OK: HTTP OK: HTTP/1.1 200 OK - 45865 bytes in 0.769 second response time [13:57:18] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 46474 bytes in 0.678 second response time [15:33:26] 10Scap: If aborting a scap due to test canary error rate, output some errors for reference - https://phabricator.wikimedia.org/T159991#3114626 (10thcipriani) p:05Triage>03Normal It does seem like showing some errors would be helpful; although possibly a little tricky to get at that information. The reason s... [15:43:29] 10Deployment-Systems, 10Scap, 15User-bd808: sync-wikiversions not syncing wikiversions.json with mira - https://phabricator.wikimedia.org/T121585#3114671 (10thcipriani) 05Open>03Resolved a:03bd808 Haven't noticed the problem since {rMSCA9fbd6f8f486a7405e3df7125bd50ad03164f5512} merged. [15:50:20] RainbowSprinkles hi, i've managed to fix polygerrit for prefixed urls fully (there may be some places i havent fixed it but it should work in almost all cases in my testing.) I have found significate performance improvements using polygerrit compared to gwt. That includes cherry pick which worked immeditly for me. This dosen't messure up to prod though. :) (this dosen't require rewrites either only a new config in gerrit.config) [15:53:20] Project selenium-MobileFrontend » firefox,beta,Linux,BrowserTests build #365: 04FAILURE in 31 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/365/ [15:54:27] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:22:38] (03PS4) 10Paladox: [operations-puppet-catalog-compiler] Adding it to jenkins job builder [integration/config] - 10https://gerrit.wikimedia.org/r/315994 (https://phabricator.wikimedia.org/T97513) [16:25:47] (03PS5) 10Paladox: Create a test that deploy's jjb changes without needing to ssh in and deploy your self [integration/config] - 10https://gerrit.wikimedia.org/r/323198 [16:25:51] (03PS6) 10Paladox: Create a test that deploy's jjb changes without needing to ssh in and deploy your self [integration/config] - 10https://gerrit.wikimedia.org/r/323198 [16:26:14] (03PS7) 10Paladox: Create a test that deploy's jjb changes without needing to ssh in and deploy your self [integration/config] - 10https://gerrit.wikimedia.org/r/323198 [16:32:23] 10Continuous-Integration-Infrastructure (Little Steps Sprint): For operations/puppet : merge tox / rake jobs in a single job? - https://phabricator.wikimedia.org/T160923#3114888 (10hashar) [16:35:17] 10Continuous-Integration-Infrastructure (Little Steps Sprint): Get rid of zend tests for wmf branches - https://phabricator.wikimedia.org/T94149#3114924 (10hashar) Talked about it again during the release engineering meeting. We believe that although maintenance script are still using Zend php5.5, running tests... [16:37:51] (03PS4) 10Paladox: Revert "Rely on `php` in assert-phpflavor macro" [integration/config] - 10https://gerrit.wikimedia.org/r/337225 (https://phabricator.wikimedia.org/T157750) [16:40:23] (03PS3) 10Paladox: Whitelist tosfos [integration/config] - 10https://gerrit.wikimedia.org/r/343403 [16:44:00] (03Abandoned) 10Paladox: Reuse phplint code in job-template.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/313230 (owner: 10Paladox) [16:52:00] anybody here that knows about browser tests for mediawiki? [16:52:40] SMalyshev: zeljkof / marxarelli. Though we are in a team meeting right now [16:52:56] and we have a patch to migrate the testruner to javascript :] [16:53:49] https://gerrit.wikimedia.org/r/#/c/328191/ which will land this week [16:56:52] SMalyshev: what's the question? [16:57:40] zelikof: I got an impression from https://www.mediawiki.org/wiki/Browser_testing/Writing_tests#API_login that browser tests don't support logging in via browser session [16:57:46] is that true [16:58:00] zeljkof: ^? [16:58:18] SMalyshev: oh man, where to start [16:58:34] not sure how you stumbled upon that page, or why it isn't marked as obsolete [16:58:56] I just looked for any docs on our browser tests. Are there better ones? [16:58:57] in short, you can log in via the API or via the web interface [16:59:19] SMalyshev: https://www.mediawiki.org/wiki/Selenium [16:59:34] so, at the moment we have two stacks, ruby and nodejs [16:59:54] we are phasing out ruby stack, and introducing nodejs stack [16:59:58] what do you want to do? [16:59:59] zeljkof: ah, so you can log in via web int. Cool! [17:00:06] I'm using cucumber tests so that'd be ruby I imagine [17:00:28] what are you trying to do? what are you testing? [17:00:31] since all other tests we have for Cirrus are in cucumber I guess I need that one too [17:00:50] ok, yes, cirrus has big investment in ruby stack [17:00:59] zeljkof: a feature in Special:Undelete. Unfortunately it is a) only available in GUI and b) requires admin to see [17:01:27] you can log in via the api and then insert a cookie in the browser and be logged in [17:01:33] I think it is the fastest way to do it [17:01:34] you can see it here: https://gerrit.wikimedia.org/r/#/c/281077/19/tests/browser/features/update_general_api.feature [17:01:59] zeljkof: are there any examples in the code of how to do it? [17:02:21] SMalyshev: sure, looking for them, it should be in the mediawiki_selenium gem [17:02:57] https://phabricator.wikimedia.org/diffusion/MSEL/browse/master/lib/mediawiki_selenium/step_definitions/login_steps.rb [17:03:08] I think you can use it like this, it has been a while... [17:03:37] in the feature file: Given I am logged in as Admin [17:04:07] aha sure I will try that. Thanks! [17:04:26] and it should just work, I mean log you in as Admin, if you have Admin and credentials in env variables or config file [17:05:03] it's 6pm here so I will probably be away, but feel free to ask, or send me e-mail [17:05:25] zeljkof: sure, thanks for your help! [17:25:45] RainbowSprinkles, hi im wondering do you know how i can get java to look into json like data: {["fields": {"name": "test"}]}? [17:26:00] Im trying to do it here https://gerrit-review.googlesource.com/#/c/98611/12/src/main/java/com/googlesource/gerrit/plugins/its/phabricator/conduit/Conduit.java [17:26:09] on replacing phabricator deprecated code in the its-phabricator plugin but carn't find a way to do it [17:26:18] Whatever json library they have bundled in, I suppose [17:26:21] I dunno what it is [17:26:25] gson [17:26:28] Yeah that then [17:27:02] Yep, been looking at a lot of docs which kind of do what i want to do but in the end it dosen't do it. [17:53:25] RainbowSprinkles fixed it by removing the if checks. [17:56:37] RainbowSprinkles i've tested https://gerrit-review.googlesource.com/#/c/98576/ and https://gerrit-review.googlesource.com/#/c/98611/ and https://gerrit-review.googlesource.com/#/c/98613/ [17:56:44] twentyafterfour ^^ [18:06:23] (03PS1) 10Reedy: Start branching CollaborationKit [tools/release] - 10https://gerrit.wikimedia.org/r/343686 (https://phabricator.wikimedia.org/T138326) [18:19:05] anybody knows why git review -d may fail with 404? [18:19:06] Cannot query patchset information [18:19:06] The following command failed with exit code 104 [18:19:07] "GET https://gerrit.wikimedia.org/changes/?q=Id6099fe9fbf18481068a6f0a329bbde0d218135f&o=CURRENT_REVISION" [18:19:46] looks like maybe some old config but no idea where it comes from... [18:22:49] git-review does dumb things sometimes :( [18:23:07] known workaround: ssh as your remote instead of https [18:23:20] best workaround: stop using git-review [18:23:33] not really an option here I think [18:24:28] Not using git-review isn't an option? [18:24:56] well, how do I download gerrit change without git-review? [18:25:30] The top right corner of a change has a box that says "Download" [18:25:39] Gives you copy+pasteable commands for checkout / cherry pick / pull [18:25:43] That's what I use all day :) [18:25:46] Project mediawiki-core-code-coverage build #2645: 04STILL FAILING in 3 hr 25 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2645/ [18:25:59] unfortunately python scripts are not great at finding boxes and copy-pasting things [18:26:35] eg: git fetch https://gerrit.wikimedia.org/r/operations/puppet refs/changes/88/342788/2 && git checkout FETCH_HEAD [18:26:48] The only part you don't know from the CLI is the /2 for the current PS # [18:27:02] riight. And also I need to get there from change id [18:27:09] that's why there is git-review? [18:27:15] 88 comes from the last 2 digits from the changeid (342788 -> 88, 1234 -> 32) [18:27:26] * RainbowSprinkles shrugs [18:27:44] 342788 is not change id. Change id is something like Id6099fe9fbf18481068a6f0a329bbde0d218135f [18:28:11] anyway, I'm not looking for a way to redesign the whole system, just to make git-review work as it worked before [18:28:41] I dunno, point is this is a known issue and the only known workaround is to use ssh :\ [18:29:01] ok tahnks, will try to use ssh then [18:29:08] Basically, git-review assumes you're running from the docroot and strips the /r/ sometimes [18:29:14] (but not always!) [18:29:28] is there phab task about it? [18:29:33] I think.... [18:30:02] I thought so, hmm.... [18:31:19] T159869 / T100987 / T154760 [18:31:20] T154760: 404 downloading any changes with https remote url: "The requested URL /changes/ was not found" - https://phabricator.wikimedia.org/T154760 [18:31:20] T159869: git review -d fails with 404 on https://gerrit.wikimedia.org/changes/ endpoint - https://phabricator.wikimedia.org/T159869 [18:31:20] T100987: "git review -d XXX" doesn't work for http gerrit - https://phabricator.wikimedia.org/T100987 [18:32:17] T100987 is the best one with workarounds & such [18:34:11] RainbowSprinkles: thank you! [18:34:16] yw [18:37:39] (03CR) 10Chad: [C: 032] Start branching CollaborationKit [tools/release] - 10https://gerrit.wikimedia.org/r/343686 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [18:45:52] (03CR) 10Harej: [C: 031] Start branching CollaborationKit [tools/release] - 10https://gerrit.wikimedia.org/r/343686 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [18:49:16] (03Merged) 10jenkins-bot: Start branching CollaborationKit [tools/release] - 10https://gerrit.wikimedia.org/r/343686 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [18:49:32] 06Release-Engineering-Team, 10TimedMediaHandler, 10TimedMediaHandler-Transcode, 07Wikimedia-maintenance-script-run: Please mass-reset the video transcodes of tens of thousand videos stuck in "Unknown" state - https://phabricator.wikimedia.org/T151199#2810203 (10brion) I've started a job running `requeueTra... [18:49:33] hasharAway: thanks for taking care of those precise integration instances [19:22:29] chasemp: ah yeah Precise is all gone from beta and integration finally!!!! :] [19:23:46] chasemp: my next step will be to phase out Trusty entirely and then try to migrate to bootstrap-vz for the base images [19:24:03] bootstrap-vz is pretty slick [19:25:17] I gave it a short try last week, it lacks a lot of features I could use but then it is python so that is easily hackable [19:25:38] probably a good opportuniy to simplify the base images we use currently [19:47:49] hashar: Hey, it seems jenkins times out every time on this: https://gerrit.wikimedia.org/r/#/c/343661/ [19:47:55] Can you take a look [19:48:01] or it was just bad luck [19:49:13] E_TOOMANYTESTS [19:52:07] That's Wikibase ;) [19:52:38] So now jenkins won't anything on master because there are too many tests? [19:52:53] (03PS1) 10Umherirrender: [FileImporter] Add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/343734 [19:53:53] jerkins is a dick [19:54:31] Maybe up it to 40mins? [19:58:24] (03PS2) 10Umherirrender: [FileImporter] Add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/343734 [19:59:32] (03PS1) 10Umherirrender: [FileExporter] Add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/343737 [19:59:36] (03PS1) 10Ladsgroup: Increase timeout for Wikidata jobs from 30 minutes to 40 minutes [integration/config] - 10https://gerrit.wikimedia.org/r/343738 [19:59:51] paladox: https://gerrit.wikimedia.org/r/343738 [19:59:56] afak [19:59:59] *afk [20:00:03] Thanks [20:00:22] (03CR) 10Paladox: [C: 031] "Needed due to ci getting slower at peak times." [integration/config] - 10https://gerrit.wikimedia.org/r/343738 (owner: 10Ladsgroup) [20:00:55] Amir1: Another solution is to delete a bunch of tests ;-) [20:03:51] (03PS1) 10Umherirrender: [TheWikipediaLibrary] Add npm job [integration/config] - 10https://gerrit.wikimedia.org/r/343740 [20:04:01] and they're merging [20:09:28] !log Update mobileapps to c0ab01d [20:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:11:53] I would like to cause some downtime for the following beta VMs: deployment-pdf01, deployment-puppetmaster02, deployment-urldownloader [20:11:58] does anyone object, or have words of wisdom? [20:12:01] thcipriani for example? [20:12:13] :) [20:12:51] I do not object, puppetmaster02 is going to be noisy to take down in here, but should be fine. [20:13:11] thcipriani: ok, I'll do that one first. Right now OK? [20:13:48] andrewbogott: should be fine now. [20:14:03] thcipriani: ok! Thanks [20:17:45] PROBLEM - Host deployment-puppetmaster02 is DOWN: CRITICAL - Host Unreachable (10.68.21.200) [20:18:11] PROBLEM - Puppet run on deployment-jobrunner02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:18:54] PROBLEM - Puppet run on deployment-fluorine02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:19:06] PROBLEM - Puppet run on deployment-phab02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:19:20] PROBLEM - Puppet run on deployment-kafka04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:19:32] PROBLEM - Puppet run on deployment-pdfrender02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:20:00] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:20:14] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:20:30] PROBLEM - Puppet run on deployment-zookeeper01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:20:46] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:21:57] PROBLEM - Puppet run on deployment-phab01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:22:57] PROBLEM - Puppet run on deployment-prometheus01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:23:05] PROBLEM - Puppet run on deployment-apertium02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:23:11] PROBLEM - Puppet run on deployment-salt02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:23:33] PROBLEM - Puppet run on deployment-elastic06 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:23:43] PROBLEM - Puppet run on deployment-puppetdb01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:24:04] PROBLEM - Puppet run on deployment-kafka05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:24:13] PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:25:24] PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:25:30] PROBLEM - Puppet run on deployment-stream is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:25:48] PROBLEM - Puppet run on deployment-mx is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:26:06] PROBLEM - Puppet run on deployment-aqs01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:27:04] PROBLEM - Puppet run on deployment-ores-redis is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:27:28] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:28:18] PROBLEM - Puppet run on deployment-trending01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:28:30] PROBLEM - Puppet run on deployment-poolcounter04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:29:05] PROBLEM - Puppet run on deployment-ms-be01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:29:16] PROBLEM - Puppet run on deployment-ircd is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:29:21] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:29:31] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:32:07] PROBLEM - Puppet run on deployment-restbase02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:32:13] PROBLEM - Puppet run on deployment-memc04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:32:13] PROBLEM - Puppet run on deployment-kafka03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:32:38] PROBLEM - Puppet run on deployment-db03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:33:16] PROBLEM - Puppet run on deployment-sentry01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:33:36] PROBLEM - Puppet run on deployment-pdf01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:34:20] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:34:20] PROBLEM - Puppet run on deployment-tmh01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:34:26] PROBLEM - Puppet run on deployment-secureredirexperiment is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:34:45] thcipriani: am I somehow causing a jenkins outage by migrating that beta VM? [20:35:04] PROBLEM - Puppet run on deployment-kafka01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:35:30] PROBLEM - Puppet run on deployment-aqs02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:35:42] PROBLEM - Puppet run on deployment-memc05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:35:50] andrewbogott: you shouldn't be... [20:36:36] maybe it's just the normal mid-afternoon jenkins jam [20:36:39] deployment-puppetmaster02 and jenkins are not connected in any way except that it's the puppetmaster for beta and jenkins deploys stuff on beta [20:37:07] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:37:09] things appear to be moving normally through zuul [20:37:19] PROBLEM - Puppet run on deployment-elastic07 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:37:21] PROBLEM - Puppet run on deployment-elastic05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:38:03] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:39:17] PROBLEM - Puppet run on deployment-db04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:39:41] PROBLEM - Puppet run on deployment-ms-be02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:40:30] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:40:30] PROBLEM - Puppet run on deployment-cache-upload04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:42:46] PROBLEM - Puppet run on deployment-restbase01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:42:50] RECOVERY - Host deployment-puppetmaster02 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [20:44:47] PROBLEM - Puppet run on deployment-cache-text04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:44:53] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:46:23] PROBLEM - Puppet run on deployment-mediawiki06 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:46:45] PROBLEM - Host deployment-pdf01 is DOWN: CRITICAL - Host Unreachable (10.68.16.73) [20:52:17] andrewbogott: thcipriani: looks like that is just the instance deployment-puppetmaster02 is DOWN [20:52:28] so puppet fails on all beta cluster instances which is probably not a big deal [20:52:29] RECOVERY - Host deployment-pdf01 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [20:52:37] hashar: that was me, migrating it. Should be back and happy by now. [20:52:55] \o/ [20:52:56] indeed, starting to see recoveries [20:54:02] thcipriani, hashar, I guess I never explained… one of the labvirts is having goofy IO problems so I'm trying to evacuate it. Moving the staff-managed stuff first but at some point I'll probably have to make a list post about it. [20:54:16] the symptoms are weird, generally undetectable from within a VM but still concerning. [20:54:23] which labvirt ? [20:55:30] RECOVERY - Puppet run on deployment-zookeeper01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:55:45] PROBLEM - Host deployment-urldownloader is DOWN: CRITICAL - Host Unreachable (10.68.16.135) [20:55:58] hashar: 1001 [20:56:09] It's already out of the scheduler pool, so nothing new should land there [20:56:52] RainbowSprinkles: It seems you maintain gerritbot in phab. It has been a week that it doesn't make a comment when I make a patch. It does when they get merged though. And it happens only to me more strangely [20:57:17] https://phabricator.wikimedia.org/T160462#3105216 [20:57:29] https://phabricator.wikimedia.org/T151194 [20:57:45] https://phabricator.wikimedia.org/T160613 [20:58:03] RECOVERY - Puppet run on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:58:07] RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:58:13] RECOVERY - Puppet run on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:58:35] RECOVERY - Puppet run on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [20:58:45] Amir1: I do not :) [20:58:53] RECOVERY - Puppet run on deployment-fluorine02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:58:55] I don't touch the bot [20:59:02] https://wikitech.wikimedia.org/wiki/Gerrit_Notification_Bot [20:59:09] RECOVERY - Puppet run on deployment-phab02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:59:10] If there's a problem, file a bug in Phabricator and bother Chad or Christian. [20:59:17] I'm the CollaborationKit person. I make myself available for questions, complaints, and other assorted commentary. [20:59:23] RECOVERY - Puppet run on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:59:33] RECOVERY - Puppet run on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:59:39] Amir1: That's old text :p [20:59:44] I've actually never touched it [20:59:59] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:03] loool [21:00:08] Do you know who does? :D [21:00:14] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:22] RECOVERY - Puppet run on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:29] Amir1: Paladox is the one who works on its-phabricator (the phab/gerrit bridge) [21:00:38] RECOVERY - Host deployment-urldownloader is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [21:00:48] RECOVERY - Puppet run on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:48] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:51] I worked on its-bugzilla, the prior version. But since the rewrite never looked at it :) [21:01:06] RECOVERY - Puppet run on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:01:28] Thanks. Sorry for bothering [21:01:58] RECOVERY - Puppet run on deployment-phab01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:02:00] No worries. I'm the likely suspect :p [21:02:06] RECOVERY - Puppet run on deployment-ores-redis is OK: OK: Less than 1.00% above the threshold [0.0] [21:02:38] :))) [21:02:56] RECOVERY - Puppet run on deployment-prometheus01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:03:18] RECOVERY - Puppet run on deployment-trending01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:03:35] RECOVERY - Puppet run on deployment-pdf01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:03:45] RECOVERY - Puppet run on deployment-puppetdb01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:04:05] RECOVERY - Puppet run on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:04:15] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:04:19] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:05:30] RECOVERY - Puppet run on deployment-stream is OK: OK: Less than 1.00% above the threshold [0.0] [21:07:06] RECOVERY - Puppet run on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:07:12] RECOVERY - Puppet run on deployment-kafka03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:07:26] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [21:08:17] RECOVERY - Puppet run on deployment-sentry01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:08:33] RECOVERY - Puppet run on deployment-poolcounter04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:07] RECOVERY - Puppet run on deployment-ms-be01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:09] RECOVERY - Puppet run on deployment-ircd is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:17] RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:25] RECOVERY - Puppet run on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:29] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:37] Amir1 hi, what is your patch url please? [21:10:05] RECOVERY - Puppet run on deployment-kafka01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:10:12] paladox: I commented in the phab cards [21:10:28] or when they got merged, you can see gerrit-bot's comment [21:10:30] RECOVERY - Puppet run on deployment-aqs02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:11:29] Hmm strange [21:11:44] RainbowSprinkles seems the bot is not working on a mediawiki/core change [21:12:06] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:12:15] RECOVERY - Puppet run on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:12:18] RECOVERY - Puppet run on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [21:12:37] RECOVERY - Puppet run on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:13:00] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:13:12] Amir1 does it work on any other changes? [21:13:35] paladox: It doesn't work in any repos but it's just for me [21:13:43] other patches get through [21:13:49] Amir1 dosen't work for me either. [21:14:19] RECOVERY - Puppet run on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:14:21] RECOVERY - Puppet run on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:14:39] RECOVERY - Puppet run on deployment-ms-be02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:00] Scratch that it works for me here https://phabricator.wikimedia.org/T86229#3116139 [21:15:03] but on my change [21:15:31] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:31] RECOVERY - Puppet run on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:40] RECOVERY - Puppet run on deployment-memc05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:16:46] Amir1 must be a bug or something. As it works on one of my mediawiki/core change but dosent with yours [21:17:22] RECOVERY - Puppet run on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:17:48] RECOVERY - Puppet run on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:19:39] * paladox wonders if there is any errors to do with the gerritbot in the last few mins in the error log [21:21:23] RECOVERY - Puppet run on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [21:21:47] paladox: It happens in all repos, zuul config/wikibase/core/etc. [21:22:18] Hmm, it must be all your changes as mine worked. hashar did linking to bug reports work for you today? [21:24:47] RECOVERY - Puppet run on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:24:53] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [21:31:39] so, who clogged jenkins today? ;) [21:32:02] andrewbogott: labvirt1001 having an issue, could it be the reason the instances took a while to boot a couple weeks ago ? [21:32:04] At peak times it will get slow [21:32:19] andrewbogott: iirc you ended up dropping labvirt1001 and labvirt1002 from the scheduler pool because of slow response [21:32:24] MatmaRex: depends :D [21:32:39] MatmaRex: yeah too many patches https://integration.wikimedia.org/zuul/ :( [21:32:45] we have a sprint to try to mitigate that [21:32:53] https://integration.wikimedia.org/zuul/ [21:32:59] the short rule is for now CR+2 get priority above everything else [21:33:12] we will eventually prioritize patches to operations/puppet and wmf branches [21:33:34] some oojs/ui builds are suspiciously timing out: https://integration.wikimedia.org/ci/job/oojs-ui-npm-node-6-jessie/628/consoleFull – our build is terribly slow, but it's not half-an-hour slow [21:33:49] npm install took 13 minutes? [21:33:56] hashar: that's part of it, yes, scheduling on 1001 was super slow. I'm not convinced that 1002 is actually troubled but it needs more study. [21:34:11] hashar what about merging mediawiki-extensions-qunit-jessie and mediawiki-core-qunit-jessie together? [21:34:17] karma runs took 2 and 3 minutes? [21:34:38] As we could do an if check to run mediawiki/core specific checks. Which will also let a few more nodes avilable for use [21:34:48] andrewbogott: good luck with whatever hardware/driver/kernel issue you might end up hitting :( [21:35:05] thanks. Mostly I'm just trying to clear it out so I can drop it in Chris's lap [21:35:08] MatmaRex: potentially that might be some networking delay as well [21:35:38] svg2png and image are also slower than i would expect. these are cpu-heavy [21:35:50] and yeah eventually some jobs will get merged so we avoid doing the same git clones / composer install / npm insall for each of the jobs [21:37:20] MatmaRex: there is some labvirt with high cpu usage (labvirt1006) https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?panelId=5&fullscreen&from=now-24h&to=now [21:37:29] svg2png taking 158 seconds. that's insane. it takes 60 seconds on my machine, and my machine has an intel core 2 duo. [21:37:33] yeah [21:37:44] that is for oojs/ui right? [21:37:50] yes [21:37:57] numbers from here specifically: https://integration.wikimedia.org/ci/job/oojs-ui-npm-node-6-jessie/628/consoleFull [21:38:10] we have to streamline the three jobs it is triggering in a single one :/ [21:38:18] * paladox wonders what labvirt1014 is. It's at 0% [21:38:24] each patchset ends up invoking svg2png three time (via three different jobs) [21:38:38] (which is a failed gate build for https://gerrit.wikimedia.org/r/#/c/343748/) [21:38:42] that caused some nice overload last tuesday with 30+ patches being sent [21:39:07] paladox: I assume it is a spare machine in case one of the other labvirt explodes [21:39:13] oh [21:39:16] thanks [21:39:42] yeah, i was here last tuesday too ;) but we didn't make quite as many patchsets this time (and we started merging them 24 hours early ;) ) [21:39:47] MatmaRex: yeah and on that build npm install took 13 minutes apparently [21:40:35] if the machines can't handle all the parallel cpu-heavy jobs, can they be made to run fewer jobs in parallel? [21:41:05] it's silly for jobs to timeout because of that. it would be better if they are slow, but actually finish [21:41:39] Woulden't concurrent fix that? [21:41:46] ie set concurrent to false [21:43:04] MatmaRex: I cant remember the default timeout, but it is probably 30 minutes [21:43:45] it is. that job timed out after 30 minutes [21:44:07] it should totally be able to complete in 30 minutes though. it takes like 10 usually [21:44:29] unless we face CPU or I/O starvation on the node the job happened to run [21:46:36] We could put it at 40? [21:46:54] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Increase timeout for Wikidata jobs from 30 minutes to 40 minutes [integration/config] - 10https://gerrit.wikimedia.org/r/343738 (owner: 10Ladsgroup) [21:48:43] thcipriani: also on friday I crafed a new Grafana board showing the Nodepool pool details https://grafana.wikimedia.org/dashboard/db/nodepool-pool-details [21:48:49] eg split the pool graph in a graph per metric [21:51:08] nice [21:51:23] so in practice there are 17 nodepool instances ready at any time [21:52:05] yeah in nodepool.yaml that is the min-ready parameter [21:52:11] so 12 jessie and 5 trusty [21:52:35] we could probably lower the trusty base line [21:52:48] 10Continuous-Integration-Config, 06Release-Engineering-Team, 13Patch-For-Review: Switch MediaWiki coverage job from Trusty/Zend PHP 5.5 to Jessie/Zend PHP 7.0 - https://phabricator.wikimedia.org/T147778#3116264 (10Krinkle) [21:52:50] and like 10 are needed for every core job :( [21:52:51] 10Continuous-Integration-Config, 06Release-Engineering-Team, 10MediaWiki-Unit-tests: MediaWiki code coverage no longer runs parser tests - https://phabricator.wikimedia.org/T147779#3116262 (10Krinkle) 05Open>03Invalid I don't know about coverage, but both before and after the aforementioned changes, Pars... [21:52:56] core g+s [21:53:06] yeah gotta merge a bunch of those jobs [21:53:21] ie make the job that runs phpunit tests to first run composer test [21:53:23] 10Continuous-Integration-Config, 06Release-Engineering-Team, 10MediaWiki-Unit-tests: MediaWiki code coverage no longer runs parser tests - https://phabricator.wikimedia.org/T147779#3116266 (10Krinkle) [21:53:25] 10Continuous-Integration-Config, 06Release-Engineering-Team, 13Patch-For-Review: Switch MediaWiki coverage job from Trusty/Zend PHP 5.5 to Jessie/Zend PHP 7.0 - https://phabricator.wikimedia.org/T147778#2703156 (10Krinkle) [21:53:33] and we could run npm test as well in the same job [21:53:38] indeed, optimizing for instance use vs optimizing for fast parallel permanent agents [21:53:42] so we would do: composer test, npm test, phpunit [21:53:49] and if composer fails early, the rest is skip entirely [21:54:03] MatmaRex James_F they are warnning for users who use the cdn for phantomjs to eanble the caching feature https://github.com/Medium/phantomjs/commit/d8ebc23016c784fe84e5d2b29ae57d157c8c5d84 [21:54:22] thcipriani: and mediawiki-extensions-hhvm has to be overhauled. It is too slow [21:55:16] though i have no idea how it will work for us [21:55:46] anyway I am heading to bed it is largely over time [21:56:03] will work on the little steps sprint tomororow. and most probably deploy the high prio test pipeline [22:01:20] getting out. more CI refactoring tomorrow [22:50:48] oojs/ui is hogging the nodepool instances again :/ [22:56:27] (03PS4) 10Ejegg: Use upstream civicrm-buildkit [integration/config] - 10https://gerrit.wikimedia.org/r/336960 [22:57:30] (03CR) 10Ejegg: "This is no longer blocked! We now have mcrypt on the integration servers along with the rest of the extensions that upstream buildkit asks" [integration/config] - 10https://gerrit.wikimedia.org/r/336960 (owner: 10Ejegg) [23:17:47] Hm.. yeah, took 45minutes to get a response on a GuidedTour patch. [23:17:50] each job only taking 2 minutes [23:17:55] but it got stalled behidn the queue for a long time [23:17:58] https://grafana.wikimedia.org/dashboard/db/nodepool [23:18:10] 25 limites in total does seem fairly low though [23:18:20] Seems worth increasing quota just in general? [23:48:54] Krinkle: the last time we bumped the nodepool quota is just leaked runners faster