[00:10:45] hashar_: we can probably survive as is until your tomorrow, right? [00:10:53] yeah [00:11:05] I think I will give up [00:11:15] :/ [00:11:18] sleep is good [00:11:18] and just focus on finishing the npm/node to nodepool [00:11:36] well, do you want us/anyone to create new trusty instances tomorrow with more ram? [00:11:38] then write down a list of actions to enhance ci [00:12:34] I think we gotta drop all the ci.medium instances we created with timo a few hours ago [00:12:43] figure out how to consume less ram [00:12:49] (i.e. the tmpfs used 35% of 2GB of ram) [00:13:05] and yeah add instances that have the same size as the one we know are working [01:07:02] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017518 (10thcipriani) 3NEW [01:39:01] and then again when it recovers [01:39:01] Oh so there is monitoring [01:39:06] but not on every failure [01:39:14] Sure [01:39:19] That would get noisy pretty quickly [01:39:20] it's jenkins irc notificaitons [01:39:22] But regular repeats might be helpful [01:40:54] apparently jenkins doesn't support that -- https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/configure [01:41:14] it's "new failure and fixed" right now [01:41:44] the only other levels look to alert on every failure [01:42:02] which honestly for beta-scap might be a good idea [01:42:17] because it has gone for hours and hours without being noticed before [01:43:17] Yeah the main thing I'm annoyed about right now is the 18 hours part [01:44:00] I merge a change on one day, Elena QAs is the next day, it doesn't work, and we spend an hour being confused before we realize beta updates are broken [01:44:49] bd808: Also, do we have Jenkins reporting here for the sub-jobs [01:45:34] Because the most recent time I remember this happening (~3 weeks ago, I just came back from 2 weeks of vacation+conference) the cause was a missing extension breaking the timer sub-job which caused the main beta-scap-eqiad job to just never be triggered [01:46:31] I think they all alert, yes [01:46:48] but I may be wrong. You can check the yaml config [01:46:50] jjb [01:48:53] 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 6Discovery, 7Blocked-on-Operations, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2003572 (10JGirault) Thanks! @thcipriani ! [02:17:35] 10Beta-Cluster-Infrastructure: Beta labs updates broken again - https://phabricator.wikimedia.org/T126573#2017647 (10Catrope) 3NEW [02:17:44] 10Beta-Cluster-Infrastructure: Beta labs updates broken again - https://phabricator.wikimedia.org/T126573#2017654 (10Catrope) p:5Triage>3Unbreak! [02:21:16] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 3Wikipedia-Android-App: Wikipedia Android CI tests are failing - https://phabricator.wikimedia.org/T126532#2017661 (10Niedzielski) @hashar, I believe the issue is here[0] but I don't know what implications changing to Jessie would have. How... [02:42:28] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 3Wikipedia-Android-App: Wikipedia Android CI tests are failing - https://phabricator.wikimedia.org/T126532#2017693 (10Legoktm) 5Resolved>3Open a:5Niedzielski>3Legoktm [02:42:53] (03PS1) 10Legoktm: Switch apps/android/wikipedia to use tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/269893 (https://phabricator.wikimedia.org/T126532) [02:43:27] (03CR) 10Legoktm: [C: 032] Don't run 5.3 tests for HtmlFormatter [integration/config] - 10https://gerrit.wikimedia.org/r/269853 (owner: 10MaxSem) [02:43:43] (03CR) 10Legoktm: [C: 032] Switch apps/android/wikipedia to use tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/269893 (https://phabricator.wikimedia.org/T126532) (owner: 10Legoktm) [02:44:26] (03Merged) 10jenkins-bot: Don't run 5.3 tests for HtmlFormatter [integration/config] - 10https://gerrit.wikimedia.org/r/269853 (owner: 10MaxSem) [02:45:04] (03Merged) 10jenkins-bot: Switch apps/android/wikipedia to use tox-jessie [integration/config] - 10https://gerrit.wikimedia.org/r/269893 (https://phabricator.wikimedia.org/T126532) (owner: 10Legoktm) [02:45:20] !log deploying https://gerrit.wikimedia.org/r/269853 https://gerrit.wikimedia.org/r/269893 [02:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [02:47:59] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team, 5Patch-For-Review, 3Wikipedia-Android-App: Wikipedia Android CI tests are failing - https://phabricator.wikimedia.org/T126532#2017699 (10Legoktm) 5Open>3Resolved Tested on , pass using tox-jessie. [02:51:27] (03PS1) 10Legoktm: Simplify node definitions to only use phpflavor, not OS [integration/config] - 10https://gerrit.wikimedia.org/r/269895 [02:52:33] (03CR) 10Legoktm: [C: 032] Simplify node definitions to only use phpflavor, not OS [integration/config] - 10https://gerrit.wikimedia.org/r/269895 (owner: 10Legoktm) [02:55:21] (03Merged) 10jenkins-bot: Simplify node definitions to only use phpflavor, not OS [integration/config] - 10https://gerrit.wikimedia.org/r/269895 (owner: 10Legoktm) [03:15:25] PROBLEM - Host deployment-mediawiki02 is DOWN: CRITICAL - Host Unreachable (10.68.16.127) [03:17:53] RECOVERY - Host deployment-mediawiki02 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [03:21:29] 10Beta-Cluster-Infrastructure: Beta labs updates broken again - https://phabricator.wikimedia.org/T126573#2017725 (10greg) Tyler is working to fix this, which unfortunately means creating a new Jessie bastion in Beta Cluster. See T126568. [03:22:28] 10Beta-Cluster-Infrastructure: Beta Cluster updates broken due to php incompatibilities - https://phabricator.wikimedia.org/T126573#2017727 (10greg) [04:43:44] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017817 (10greg) [04:43:46] 10Beta-Cluster-Infrastructure: Beta Cluster updates broken due to php incompatibilities - https://phabricator.wikimedia.org/T126573#2017816 (10greg) [04:48:00] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017820 (10bd808) Did @yuvipanda start making changes to remove NFS from the non-MediaWiki nodes in beta cluster? I can't find a task but remember an... [07:16:37] 3Scap3, 10scap: Scap should touch symlinks when originals are touched - https://phabricator.wikimedia.org/T126306#2017889 (10Tgr) >>! In T126306#2010715, @bd808 wrote: > Is there anything other than wmf-config/PrivateSettings.php that would really benefit from this protection? Not that I know of. [08:10:29] 10Beta-Cluster-Infrastructure, 6Labs: Completely remove Beta Cluster dependency on NFS - https://phabricator.wikimedia.org/T102953#2017949 (10hashar) [08:10:32] 10Beta-Cluster-Infrastructure, 6Labs: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#2017948 (10hashar) [08:11:02] 10Beta-Cluster-Infrastructure, 6Labs: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#1992944 (10hashar) Similar to T102953 itself blocked on Swift [08:12:00] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017962 (10hashar) A recent task is {T125624} [08:12:56] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017966 (10hashar) Forgot to quote the relevant part from Yuvi: > The way to disable this would be to set mount_nfs hiera variable to false in Hiera:... [08:20:26] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017975 (10hashar) IIRC the NFS server whitelists instances of a project on a per IP basis. So potentially deployment-tin with IP 10.68.17.240 would... [08:22:05] 3Scap3, 6Phabricator, 5Patch-For-Review, 7WorkType-Maintenance: Refactor phabricator module in puppet to remove git tag pinning behavior - https://phabricator.wikimedia.org/T125851#2017977 (10Paladox) [09:11:40] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018008 (10Paladox) It looks like texlive-generic-extra is not installed on trusty. Could that package be causing the issue. [09:13:12] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018011 (10Paladox) Also it looks like these 'texlive-math-extra', 'texlive-pictures', 'texlive-pstricks', 'texlive-publishers', 'texlive-generic-extra' Aren't installed on trusty either. Th... [09:18:41] 7Browser-Tests, 10Wikidata: Sitelink browser test fails with firefox - https://phabricator.wikimedia.org/T126585#2018020 (10adrianheine) 3NEW [09:24:52] 7Browser-Tests, 10Wikidata: Sitelink browser test sometimes fails with firefox - https://phabricator.wikimedia.org/T126585#2018039 (10adrianheine) [09:34:00] 7Browser-Tests, 10Wikidata: Sitelink browser test sometimes fails with firefox - https://phabricator.wikimedia.org/T126585#2018043 (10adrianheine) This is a rate limiting problem. Whenever the test fails, I have the following in my debug log: ``` [ratelimit] User '0:0:0:0:0:0:0:1' (IP ::1) tripped my_wiki:limi... [09:36:29] 7Browser-Tests, 10Wikidata: Sitelink browser test sometimes fails with firefox - https://phabricator.wikimedia.org/T126585#2018047 (10Addshore) Probably something relating to https://www.mediawiki.org/wiki/Manual:$wgRateLimits ? [09:49:30] 7Browser-Tests, 10Wikidata: Sitelink browser test sometimes fails with firefox - https://phabricator.wikimedia.org/T126585#2018077 (10adrianheine) @tobi_WMDE_SW @addshore @janzerebecki Rate limiting settings for beta need to be adapted, either by IP or in general. [09:52:24] 7Browser-Tests, 10Wikidata: Sitelink browser test sometimes fails with firefox - https://phabricator.wikimedia.org/T126585#2018080 (10Tobi_WMDE_SW) Probably @aude can help with this? [09:55:19] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 15User-greg: Migrate leftover tox jobs to CI Nodepool - https://phabricator.wikimedia.org/T126588#2018085 (10hashar) 3NEW a:3zeljkofilipin [10:04:41] ok lets try to reclaim memory [10:06:27] !log disabling puppet on all CI slaves. Trying to lower tmpfs 512MB to 128MB ( https://gerrit.wikimedia.org/r/#/c/269880/ ) [10:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:13:59] 7Browser-Tests, 10MediaWiki-extensions-RelatedArticles: Fix failed RelatedArticles Selenium Jenkins jobs - https://phabricator.wikimedia.org/T126589#2018099 (10zeljkofilipin) 3NEW [10:14:25] oh [10:16:48] !log pooling back integration-slave-trusty-1009 and integration-slave-trusty-1010 (tmpfs shrunken) [10:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:16:50] 10Browser-Tests-Infrastructure, 5Release-Engineering-Epics, 7Epic, 7Tracking: Fix or delete failing browser tests Jenkins jobs - https://phabricator.wikimedia.org/T94150#2018106 (10zeljkofilipin) [10:17:29] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2018109 (10hashar) I have disabled puppet, cherry picked the patch for tmpfs to 128MB and pooled back integration-slave-trusty-1009 and integration-slave-trusty-1010 w... [10:33:24] 3Scap3, 10scap, 5Patch-For-Review: Make puppet provider for scap3 - https://phabricator.wikimedia.org/T113072#2018151 (10mmodell) 5Open>3Resolved [10:33:27] 3Scap3, 6Phabricator, 5Patch-For-Review, 7WorkType-Maintenance: Refactor phabricator module in puppet to remove git tag pinning behavior - https://phabricator.wikimedia.org/T125851#1998830 (10mmodell) [10:36:57] 3Scap3, 10scap: Parameterize global /etc/scap.cfg in ops/puppet - https://phabricator.wikimedia.org/T126259#2018165 (10mmodell) p:5Normal>3High [10:40:15] 7Browser-Tests, 10MediaWiki-extensions-RelatedArticles: Fix failed RelatedArticles Selenium Jenkins jobs - https://phabricator.wikimedia.org/T126589#2018168 (10zeljkofilipin) a:3zeljkofilipin [10:46:17] PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [10:50:58] !log reenabled puppet on CI. All transitioned to a 128MB tmpfs (was 512MB) [10:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:52:11] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2018181 (10hashar) All slaves now have 128MB tmpfs instead of 512MB. I pooled back the various ci.medium slaves we have created yesterday. [11:00:42] 7Browser-Tests, 10Wikidata: Sitelink browser test sometimes fails with firefox - https://phabricator.wikimedia.org/T126585#2018185 (10aude) I think https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings-labs.php#L555-L557 is where this can be set. https://webcache.g... [11:00:44] 10Continuous-Integration-Infrastructure: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2018186 (10hashar) 3NEW [11:01:17] 7Browser-Tests, 10Wikidata: Sitelink browser test sometimes fails with firefox - https://phabricator.wikimedia.org/T126585#2018192 (10aude) do we need more than saucelabs ip range for this? e.g. our office IP range if we run these also from the office? [11:31:58] !log disabling hhvm service on CI slaves ( https://phabricator.wikimedia.org/T126594 , cherry picked both patches ) [11:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:40:41] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2018281 (10hashar) a:3hashar I have cherry picked both patches on the integration puppetmaster. That got rid of the hhvm service. Now pending ops review / merging of... [11:42:18] RECOVERY - Puppet failure on integration-slave-trusty-1015 is OK: OK: Less than 1.00% above the threshold [0.0] [11:43:57] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018289 (10JanZerebecki) What base image is the vagrant using where this works? [11:46:14] !log salt -v '*' cmd.run '/etc/init.d/apache2 restart' might help for Wikidata browser tests failling [11:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [12:14:49] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<12.50%) [12:18:34] Yippee, build fixed! [12:18:35] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce-T94162 build #2: 09FIXED in 2 min 31 sec: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce-T94162/2/ [12:22:44] (03PS2) 10Zfilipin: Recreate VisualEditor Selenium Jenkins job [integration/config] - 10https://gerrit.wikimedia.org/r/269732 (https://phabricator.wikimedia.org/T94162) [12:22:52] (03PS3) 10Zfilipin: Recreate VisualEditor Selenium Jenkins job [integration/config] - 10https://gerrit.wikimedia.org/r/269732 (https://phabricator.wikimedia.org/T94162) [12:23:14] (03PS4) 10Zfilipin: Recreate VisualEditor Selenium Jenkins job [integration/config] - 10https://gerrit.wikimedia.org/r/269732 (https://phabricator.wikimedia.org/T94162) [12:59:24] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018449 (10Paladox) @JanZerebecki or @Hashar Would this https://github.com/wikimedia/operations-puppet/commit/f02218aee730a78284a44bdbf0c6197d7f9e5774 have caused the problem by removing that. [13:13:24] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018460 (10hashar) The `mediawiki-math-texvc` package is installed on all slaves with version `2:1.0+git20140526-1`. Confirmed via: `salt -v '*slave*' pkg.version mediawiki-math-texvc` It co... [13:14:31] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018464 (10hashar) [13:19:15] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018474 (10hashar) Edited task detail and copy pasted the failures. For `\Digamma` there is T29754 which got filled when we had Ubuntu Hardy. It references `teubner` which is found on Trusty:... [13:25:33] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018481 (10Paladox) >>! In T126422#2016586, @JanZerebecki wrote: > ``` > integration-slave-trusty-1012:~$ dpkg -s 'ocaml-native-compilers' 'texlive' 'texlive-bibtex-extra' 'texlive-font-utils... [13:29:41] 10Continuous-Integration-Infrastructure, 10Math: Math test fail for php55 - https://phabricator.wikimedia.org/T126422#2018485 (10hashar) The old patch for texvc: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/97014 that bring back a previous change https://www.mediawiki.org/wiki/Special:Code/MediaWiki... [13:38:20] Yippee, build fixed! [13:38:21] Project browsertests-RelatedArticles-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #37: 09FIXED in 1 min 9 sec: https://integration.wikimedia.org/ci/job/browsertests-RelatedArticles-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/37/ [13:39:25] 7Browser-Tests, 10MediaWiki-extensions-RelatedArticles, 5Patch-For-Review: Fix failed RelatedArticles Selenium Jenkins jobs - https://phabricator.wikimedia.org/T126589#2018501 (10zeljkofilipin) 5Open>3Resolved [13:39:28] 10Browser-Tests-Infrastructure, 5Release-Engineering-Epics, 7Epic, 7Tracking: Fix or delete failing browser tests Jenkins jobs - https://phabricator.wikimedia.org/T94150#2018502 (10zeljkofilipin) [13:43:45] 7Browser-Tests, 10Continuous-Integration-Config, 10MediaWiki-extensions-RelatedArticles: RelatedArticles browser tests should run on a commit basis - https://phabricator.wikimedia.org/T120715#2018504 (10zeljkofilipin) I have found where the test fails: https://integration.wikimedia.org/ci/job/mwext-mw-seleni... [13:45:19] 7Browser-Tests, 10Continuous-Integration-Config, 10MediaWiki-extensions-RelatedArticles: RelatedArticles browser tests should run on a commit basis - https://phabricator.wikimedia.org/T120715#2018508 (10bmansurov) >>! In T120715#2008249, @zeljkofilipin wrote: > How can I reproduce the problem? I do not see R... [13:46:58] 7Browser-Tests, 10Continuous-Integration-Config, 10MediaWiki-extensions-RelatedArticles: RelatedArticles browser tests should run on a commit basis - https://phabricator.wikimedia.org/T120715#2018509 (10zeljkofilipin) Looks like Vector skin is not available in CI ([[ https://integration.wikimedia.org/ci/job/... [13:47:36] hashar: any idea why Vector is not there? ^ [13:51:52] zeljkof: Wikidata has a similar issue [13:52:05] zeljkof: maybe skins are no more automatically loaded [13:52:17] hm [13:52:26] zeljkof: you would want to look at the console output to verify whether skins/Vector is properly cloned [13:52:34] let me check [13:52:39] MediaWiki is supposed to automatically / magically loads skins under /skins/ [13:53:04] https://integration.wikimedia.org/ci/job/mwext-mw-selenium/3602/consoleFull [13:53:16] 00:03:01.546 + zuul-cloner --color --verbose --map /srv/deployment/integration/slave-scripts/etc/zuul-clonemap.yaml --workspace src https://gerrit.wikimedia.org/r/p mediawiki/skins/Vector [13:53:59] 00:03:02.793 DEBUG:zuul.Cloner:Project mediawiki/skins/Vector in Zuul does not have ref refs/zuul/master/Zfc3202bc003048d0aa7b5a644eeafcbf [13:54:51] not sure what it says :/ [13:55:29] that is just a debug [13:55:38] the repo is cloned [13:55:49] you can confirm by login in the instance and browse the workspace [13:55:56] should be then under ./src/skins/Vector [13:56:04] no clue why Vector would not load it though [13:58:39] AH HHHHHHH [13:58:49] I finally found a bug we had in our puppet recipes for a while  \O//// [14:17:53] * zeljkof claps [14:21:06] 7Browser-Tests, 10Continuous-Integration-Config, 10MediaWiki-extensions-RelatedArticles: RelatedArticles browser tests should run on a commit basis - https://phabricator.wikimedia.org/T120715#2018557 (10zeljkofilipin) When I run the last scenario targeting beta, this is what I get: ``` $ MEDIAWIKI_ENVIRONM... [15:19:39] (03PS1) 10Krinkle: Remove mediawiki-core-jslint job [integration/config] - 10https://gerrit.wikimedia.org/r/269976 [15:20:00] (03CR) 10Krinkle: "for example https://gerrit.wikimedia.org/r/#/c/269940/" [integration/config] - 10https://gerrit.wikimedia.org/r/269976 (owner: 10Krinkle) [15:20:07] (03CR) 10Krinkle: [C: 032] Remove mediawiki-core-jslint job [integration/config] - 10https://gerrit.wikimedia.org/r/269976 (owner: 10Krinkle) [15:22:52] (03Merged) 10jenkins-bot: Remove mediawiki-core-jslint job [integration/config] - 10https://gerrit.wikimedia.org/r/269976 (owner: 10Krinkle) [15:24:05] zeljkof: By default Zuul clones all the repos to their master branch, but before it does that for mediawiki/core and mw/skins/Vector, it first tries to find the current Gerrit change that triggered the test in that repo. This way, when you submit a change for mw-core, it will check out your change along side mw-vector master. HOwever the logic is agnostic [15:24:05] and does not know which of the two repos the change is for. Hence that irrelevant debug warning about not finding the change id. After that warning it just checks out master, whic is what we want [15:24:16] This allows you to have a single job used by both mw-core and mw-skins-vector [15:24:27] and both ways it'll replace the one with the relevant change [15:24:43] It's a litle odd, but that's how it works [15:27:03] !log Deploy Zuul config change https://gerrit.wikimedia.org/r/269976 [15:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:29:56] Krinkle: thanks [15:55:54] hashar: How would we enabled/privileged teubner. [15:56:27] Krinkle: why the hell have you deleted integration-dev instance ? ???? [15:59:00] hashar it seems mysql is down for https://integration.wikimedia.org/ci/job/mwext-testextension-php55/801/consoleFull [15:59:07] 10Continuous-Integration-Infrastructure, 7WorkType-Maintenance: Rebuild integration-dev (instance to build images) - https://phabricator.wikimedia.org/T126613#2018772 (10hashar) 3NEW [15:59:23] paladox: please fill it as a task [15:59:55] hashar ok. i will do after i do a recheck to make sure it works if not i will file a task. [16:00:14] paladox: and mention slave ( integration-slave-trusty-1021 ) as well as date of event ( 12:16:09 UTC ) and the error "DB connection error: Access denied for user 'jenkins_u0'@'localhost' (using password: YES) (127.0.0.1:3306)" [16:00:30] hashar: Ok. [16:00:41] !log rebuilding integration-dev https://phabricator.wikimedia.org/T126613 [16:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:01:55] hashar: Since we switched to php55 on master branch on all extension would we also need to update the voting no rules for those extensions since mwext-AbuseFilter-testextension-php53 would not work with php55. [16:02:19] 10Continuous-Integration-Infrastructure, 7WorkType-Maintenance: Rebuild integration-dev (instance to build images) - https://phabricator.wikimedia.org/T126613#2018783 (10Paladox) [16:02:37] paladox: I think we should just phase out the non voting mwext-.*-testextension-php53 jobs [16:02:56] and use the zuul template that has no testextension job on them [16:03:06] then folks can check experimental to look whether the job actually pass [16:03:22] hashar: Yes. [16:03:45] hashar: Yes deffintly a mysql issue. So im filing a task now. [16:04:50] 3Scap3, 10scap: Scap should touch symlinks when originals are touched - https://phabricator.wikimedia.org/T126306#2018799 (10bd808) How about instead of a fancy generic mechanism we take the current "touch CommmonSettings.php" code and turn it into something that touches a list of files that can be configured... [16:07:44] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-1020 - https://phabricator.wikimedia.org/T126615#2018801 (10Paladox) [16:07:53] hashar: Ive filled the bug here https://phabricator.wikimedia.org/T126615 [16:09:16] hashar: fyi, i am making a new wikidata build that i think fixes https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/30465/console [16:09:44] (realized that our tests were causing gate and submit fail for other things like core) [16:11:58] https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/30465/artifact/log/mw-error.log/*view*/ [16:13:21] Krinkle: we are aware of that, but it's just a notice and shouldn't cause fail [16:13:36] https://phabricator.wikimedia.org/T126596 [16:18:08] aude: you folks are amazing thank you ! [16:19:11] legoktm hashar it seems that looking at https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/50200/consoleFull with this saying (phpflavor-hhvm contintLabsSlave UbuntuTrusty phpflavor-php55) it should be (phpflavor-hhvm contintLabsSlave UbuntuTrusty phpflavor-hhvm) [16:19:29] thanks thcipriani [16:20:06] andrewbogott: Krenair put you up for SWAT :) [16:28:19] hashar: That slave 1020 is used for hhvm. And seems it is happening to other tests such as wikidata https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/50209/console https://gerrit.wikimedia.org/r/#/c/269990/ [16:30:18] 10Continuous-Integration-Infrastructure, 10IPSet, 5Patch-For-Review: IPSet::__construct() in gets into infinite loop when called from curl on a CI host - https://phabricator.wikimedia.org/T126495#2018889 (10hashar) [16:32:26] Reedy andrewbogott dosent https://wikitech-static.wikimedia.org/wiki/Special:Version need to be updated to php 5.5 or hhvm because of the migration to php5.5 [16:33:59] 10Continuous-Integration-Infrastructure, 10IPSet, 5Patch-For-Review: IPSet::__construct() in gets into infinite loop when called from curl on a CI host - https://phabricator.wikimedia.org/T126495#2018931 (10Tgr) Maybe large IPSet rules should be precompiled? 100+ levels of recursion sounds expensive... [16:36:01] andrewbogott when going to https://wikitech-static.wikimedia.org/wiki/Special:SpecialPages it returns a 500 error. [16:44:36] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-1020 - https://phabricator.wikimedia.org/T126615#2018975 (10JanZerebecki) Also 1021 https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer/1065/consoleFull . [16:46:35] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-(1020|1021) - https://phabricator.wikimedia.org/T126615#2018980 (10Paladox) [16:47:27] hashar: Why is it showing the links that it is downloading from https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer/1065/consoleFull [16:47:56] i have no idea bug fill it as well please [16:48:26] hashar: Ok. [16:51:34] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: etension tests are now showing the links which should not be shown or should be coloured - https://phabricator.wikimedia.org/T126624#2019019 (10Paladox) 3NEW [16:51:43] hashar: Ive filed the task at https://phabricator.wikimedia.org/T126624 [17:04:09] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-(1020|1021) - https://phabricator.wikimedia.org/T126615#2019070 (10aude) this blocks merging a new wikidata build (https://gerrit.wikimedia.org/r/#/c/269990/) which is needed for gate-and-submit (e.... [17:04:22] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-(1020|1021) - https://phabricator.wikimedia.org/T126615#2019071 (10aude) p:5Triage>3High [17:10:49] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: etension tests are now showing the links which should not be shown or should be coloured - https://phabricator.wikimedia.org/T126624#2019087 (10hashar) 5Open>3Resolved a:3hashar It is the instance local cache warming up while downl... [17:13:36] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: etension tests are now showing the links which should not be shown or should be coloured - https://phabricator.wikimedia.org/T126624#2019093 (10Paladox) Oh ok thanks. [17:13:47] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: extension tests are now showing the links which should not be shown or should be coloured - https://phabricator.wikimedia.org/T126624#2019094 (10Paladox) [17:19:00] hashar: Should we enable qunit jobs for skins. [17:20:00] hashar: Also if we were to switch from using npm to npm-node-4.2 would that lets us use the altest npm version. [17:20:12] yeah I would do that [17:20:18] if only we were facing so many crazy issues [17:21:50] hashar: Ok were you refering to skins qunit. [17:21:52] weren't* [17:22:56] hashar: Error: 10 disk I/O error https://integration.wikimedia.org/ci/job/mwext-Wikibase-client-tests-sqlite-hhvm/8224/console [17:25:08] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-(1020|1021) - https://phabricator.wikimedia.org/T126615#2019131 (10JanZerebecki) This doesn't seem to be a disk space issue: ``` integration-slave-trusty-1020:~$ df -h Filesystem... [17:26:44] jzerebecki: ^ [17:27:04] !log I tried. The ci.medium instances are too small and MediaWiki tests really need 1.5GBytes of memory :-( [17:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:27:26] !log Depooling all the ci.medium slaves and deleting them. [17:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:27:40] can't even remember the task # [17:29:20] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019145 (10hashar) [17:29:22] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-(1020|1021) - https://phabricator.wikimedia.org/T126615#2019144 (10hashar) [17:29:43] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: MySQL down on integration-slave-trusty-(1020|1021) - https://phabricator.wikimedia.org/T126615#2018801 (10hashar) The instances we added yesterday do not have enough memory. That is T126545. I am getting rid of them. [17:30:04] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2019152 (10greg) p:5Triage>3Unbreak! [17:30:27] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017518 (10greg) UBN! because this is blocking Beta Cluster updates. [17:30:54] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019155 (10Paladox) @hashar since your getting rid of those instances does that mean the load will get high again or will they be replaced with ones that have more memory. [17:32:02] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019159 (10hashar) The CI slaves we added yesterday do not have enough memory. An example of Linux triggering the OOM: [Thu Feb 11 16:59:51 2016] Killed process 27671... [17:32:53] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019160 (10hashar) 5Open>3Resolved a:3hashar Fixed by deleting all c1.medium instances. There are still blocker but they are not really blocking anymore since th... [17:33:20] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019164 (10Paladox) Ok. Will they be replaced with ones that have more memory. [17:33:57] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019174 (10greg) yes, that's the point :) [17:34:20] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [17:34:54] 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: Bump labs quota for 'integration' project - https://phabricator.wikimedia.org/T126557#2019177 (10Paladox) Where would we change this so it is raised. [17:35:29] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019182 (10Paladox) @greg thanks for replying. [17:36:30] 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: Bump labs quota for 'integration' project - https://phabricator.wikimedia.org/T126557#2019188 (10greg) @Paladox: this task is for the WMF Labs admins to take care of. Quotas are handled in the OpenStackManager on wikitech (iow: not in Ger... [17:36:59] andrewbogott: if you haven't seen it already: https://phabricator.wikimedia.org/T126557 would love your action if you can [17:37:00] 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: Bump labs quota for 'integration' project - https://phabricator.wikimedia.org/T126557#2019190 (10Paladox) @greg Ok. [17:37:29] PROBLEM - Host integration-slave-trusty-1019 is DOWN: CRITICAL - Host Unreachable (10.68.17.61) [17:38:09] PROBLEM - Host integration-slave-trusty-1022 is DOWN: CRITICAL - Host Unreachable (10.68.17.12) [17:38:36] 10Deployment-Systems, 10Adminbot, 6operations: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791#2019192 (10greg) [17:39:01] PROBLEM - Host integration-slave-trusty-1020 is DOWN: CRITICAL - Host Unreachable (10.68.17.66) [17:39:55] PROBLEM - Host integration-slave-trusty-1021 is DOWN: CRITICAL - Host Unreachable (10.68.17.118) [17:40:09] RECOVERY - Host integration-slave-trusty-1020 is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [17:42:13] RECOVERY - Host integration-slave-trusty-1022 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [17:42:19] RECOVERY - Host integration-slave-trusty-1019 is UP: PING OK - Packet loss = 0%, RTA = 2.86 ms [17:42:39] !log Currently testing https://gerrit.wikimedia.org/r/#/c/268802/ in Beta Labs [17:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:42:47] greg-g: I thought I did that yesterday… is this yet more? [17:42:50] (I don’t mind, just unsure) [17:43:22] andrewbogott: uhhh, not sure? /me looks at our quota on wikitech [17:43:35] someone pinged me yesterday, I hadn’t noticed the ticket [17:43:40] 10Continuous-Integration-Config, 10MediaWiki-extensions-OpenStackManager, 5Patch-For-Review, 5WMF-deploy-2016-02-02_(1.27.0-wmf.12), 7WorkType-Maintenance: ApiDocumentationTest failure: Undefined property: AuthPlugin::$boundAs - https://phabricator.wikimedia.org/T124613#2019217 (10hashar) [17:43:40] so probably it’s half done. I’ll check [17:43:43] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-OpenStackManager, 5Patch-For-Review: CI slaves need package php5-ldap for OpenStackManager/LdapAuthentication - https://phabricator.wikimedia.org/T125158#2019215 (10hashar) 5Open>3Resolved Patch has been merged in operations/puppet [17:43:57] 10Continuous-Integration-Infrastructure, 7WorkType-Maintenance: integration-lightslave-jessie-1002 went out of disk space - https://phabricator.wikimedia.org/T113474#2019218 (10hashar) a:5hashar>3None [17:44:14] 10Continuous-Integration-Infrastructure, 7Jenkins: Jenkins files under /var/lib/jenkins/config-history/config need to be garbage collected - https://phabricator.wikimedia.org/T126552#2019220 (10hashar) a:5hashar>3None [17:44:15] greg-g: Seems beta isn't updating mediawiki-config correctly. New files created yesterday still don't exist in beta. https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ is broken [17:44:20] fails on mw php requirement check [17:44:33] PROBLEM - Host integration-slave-precise-1004 is DOWN: CRITICAL - Host Unreachable (10.68.17.54) [17:45:06] !sal [17:45:07] https://tools.wmflabs.org/sal/releng [17:45:26] Krinkle: beta is no more updated at all [17:45:32] Yes [17:45:35] Krinkle: the bastion is still Precise and has Zend 5.3 but MediaWiki dies claiming 5.5 [17:45:41] Yup [17:45:45] so that is pending a rebuild / migration to Jessie [17:46:01] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2019228 (10Andrew) [17:46:04] 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: Bump labs quota for 'integration' project - https://phabricator.wikimedia.org/T126557#2019225 (10Andrew) 5Open>3Resolved a:3Andrew done! [17:46:08] https://phabricator.wikimedia.org/T126568#2017518 [17:46:15] that's the blocker ^ [17:46:24] andrewbogott: thanks! [17:46:32] 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: Bump labs quota for 'integration' project - https://phabricator.wikimedia.org/T126557#2019229 (10hashar) Wonderful @Andrew \O/ Let us know if integration/contintcloud cause too much stress on labs infra. [17:47:37] 6Release-Engineering-Team, 10scap, 10MediaWiki-Database, 10MediaWiki-JobRunner, and 2 others: Scap should restart job runners to pick up new config - https://phabricator.wikimedia.org/T126632#2019230 (10demon) 3NEW [17:48:09] Krinkle and hashar i think that there is a task on this. [17:48:21] greg-g: I have dished all the instances we created yesterday. 2GB is definitely not enough [17:48:31] RECOVERY - Host integration-slave-precise-1004 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [17:48:55] paladox: I just linked it cc Krinkle [17:49:41] I'm going to email about Beta Cluster not updating to wikitech-l and engineering [17:50:00] PROBLEM - Host integration-slave-trusty-1020 is DOWN: CRITICAL - Host Unreachable (10.68.17.66) [17:50:17] hashar: Hm.. not even for 1 executor? [17:50:26] Krinkle: yeah [17:50:28] k [17:50:41] Krinkle: I have removed a few daemon on the instances, setup monitoring [17:50:51] and some of the MediaWiki related jobs definitely are memory hog [17:51:03] so system + daemon + tmpfs takes roughly 700 MB a [17:51:04] RECOVERY - Host integration-slave-trusty-1020 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [17:51:11] and somehow the 1.3GBytes are consumed entirely [17:51:16] at that point the tmpfs misbehave [17:51:26] and mysql is gone (cause it runs on tmpfs) [17:51:29] no clue what is going on exactly. But more ram fix it [17:51:31] :-( [17:52:40] !log Created integration-slave-trusty-{1019-1026} as m1.large (note 1023 is an exception it is for Android). Applied role::ci::slave , lets wait for puppet to finish [17:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:53:26] Krinkle: dont worry about integration-dev . Partly my fault to not have added a role:: class to it which would have indicated the usage :-} [17:54:14] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:54:27] I got one interesting question: You got scap, and scap 3. Why not scap2? [17:54:50] scap2 is called scap [17:54:59] hashar: OK. Nice found. So what's the plan for now. Back to m1.large with 4 executors (and share the minmum ram footprint that way)? [17:55:00] first was bash scripts (scap1) [17:55:13] rewritten to python (scap2) [17:55:23] a thing to replace trebuchet (scap3) [17:55:36] which will hopfully soon be renamed "deploy" [17:55:43] we got a proof of concept in haskell [17:55:48] bd808: thanks for the explaination [17:56:06] Krinkle: yeah m1.large which have 8GB ram [17:56:10] Krinkle: since they are known to work [17:56:18] cool [17:56:26] Krinkle: php 5.5 also might have its own issue. I noticed a few weird things going on ;-. [17:57:09] Luke081515: there is a walkthrough of what scap1 looked like at -- https://www.mediawiki.org/wiki/Deployment_tooling/Notes/What_does_scap_do [17:57:24] great [17:57:33] bah I have used wrong names [17:58:24] greg-g: Thanks. [17:59:38] !log recreated instances with proper names: integration-slave-trusty-{1001-1006} [17:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [17:59:56] * hashar wave [17:59:56] RECOVERY - Host integration-slave-trusty-1021 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [18:00:12] PROBLEM - Host integration-slave-trusty-1009 is DOWN: CRITICAL - Host Unreachable (10.68.16.159) [18:00:26] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team: CI incidents - week of Feb 8th - https://phabricator.wikimedia.org/T126634#2019273 (10greg) 3NEW [18:01:52] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2019281 (10Paladox) [18:02:55] RECOVERY - Host integration-slave-trusty-1009 is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [18:04:27] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [18:04:59] PROBLEM - Puppet failure on integration-trusty-1021 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:05:31] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team: CI incidents - week of Feb 8th - https://phabricator.wikimedia.org/T126634#2019296 (10greg) [18:06:32] RECOVERY - Puppet failure on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [18:08:06] PROBLEM - Puppet failure on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:08:10] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team: CI incidents - week of Feb 8th - https://phabricator.wikimedia.org/T126634#2019316 (10greg) [18:08:35] I'll continue to fill that task out and convert to an incident report later ^^ [18:10:42] PROBLEM - Puppet failure on integration-slave-trusty-1006 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [18:12:24] PROBLEM - Puppet failure on integration-slave-trusty-1002 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:12:24] PROBLEM - Puppet failure on integration-slave-trusty-1005 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [18:13:08] RECOVERY - Puppet failure on integration-slave-trusty-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [18:13:20] PROBLEM - Puppet failure on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:13:57] 10Continuous-Integration-Infrastructure, 10IPSet, 5Patch-For-Review: IPSet::__construct() in gets into infinite loop when called from curl on a CI host - https://phabricator.wikimedia.org/T126495#2019426 (10BBlack) It's only ~128 levels, and it's not that expensive really. And any pre-compilation of course... [18:14:43] 10Continuous-Integration-Infrastructure, 10IPSet, 5Patch-For-Review: IPSet::__construct() in gets into infinite loop when called from curl on a CI host - https://phabricator.wikimedia.org/T126495#2019431 (10BBlack) meta-note: it would be best if IPSet's documentation made a note that it requires the interpre... [18:15:00] PROBLEM - Host integration-trusty-1021 is DOWN: CRITICAL - Host Unreachable (10.68.17.66) [18:15:36] RECOVERY - Puppet failure on integration-slave-trusty-1006 is OK: OK: Less than 1.00% above the threshold [0.0] [18:17:31] RECOVERY - Puppet failure on integration-slave-trusty-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:42] heh, email collision with antoine [18:20:03] RECOVERY - Host integration-trusty-1021 is UP: PING OK - Packet loss = 0%, RTA = 6.20 ms [18:22:25] RECOVERY - Puppet failure on integration-slave-trusty-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [18:23:27] RECOVERY - Puppet failure on integration-slave-trusty-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [18:37:09] PROBLEM - Puppet failure on integration-slave-trusty-1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:43:21] PROBLEM - Puppet failure on integration-slave-trusty-1002 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [18:49:08] PROBLEM - Puppet failure on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:59:27] PROBLEM - Puppet failure on integration-slave-trusty-1005 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [19:07:12] RECOVERY - Puppet failure on integration-slave-trusty-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [19:10:27] PROBLEM - Puppet failure on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [19:10:46] 10Beta-Cluster-Infrastructure: rebuild deployment-bastion on trusty - https://phabricator.wikimedia.org/T126537#2019780 (10mmodell) [19:13:56] 10Continuous-Integration-Infrastructure, 10IPSet, 5Patch-For-Review: IPSet::__construct() in gets into infinite loop when called from curl on a CI host - https://phabricator.wikimedia.org/T126495#2019792 (10BBlack) Circling back around to this and looking again: of course it's possible that a better answer h... [19:24:28] RECOVERY - Puppet failure on integration-slave-trusty-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [19:24:30] !log moving deployment-bastion to deployment-tin [19:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:25:26] RECOVERY - Puppet failure on integration-slave-trusty-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [19:27:29] !log Keeping notes on the ticket: https://phabricator.wikimedia.org/T126537 [19:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:27:55] !log running sudo salt -b '10%' '*' cmd.run 'puppet agent -t' from deployment-salt [19:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:29:04] 10scap: deploy-local (TargetContext) should not default to utils.get_real_username() - https://phabricator.wikimedia.org/T126489#2019841 (10mmodell) @dduvall: I think the user still matters on the master, just not on the targets. [19:34:49] PROBLEM - Puppet failure on deployment-kafka04 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [19:35:26] !log modifying deployment server node in jenkins to point to deployment-tin [19:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:36:16] PROBLEM - Puppet failure on deployment-tmh01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:46:34] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [19:54:45] RECOVERY - Puppet failure on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:46] !log pooling integration-slave-trusty-1001 integration-slave-trusty-1002 integration-slave-trusty-1003 [20:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:16:27] 6Release-Engineering-Team, 5Patch-For-Review, 5WMF-deploy-2016-02-09_(1.27.0-wmf.13): MW 1.27.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T125596#2019962 (10demon) a:5hashar>3demon I ended up doing the rest of the train. [20:16:32] 6Release-Engineering-Team, 5Patch-For-Review, 5WMF-deploy-2016-02-09_(1.27.0-wmf.13): MW 1.27.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T125596#2019964 (10demon) 5Open>3Resolved [20:18:32] ostriches: \O/ [20:20:26] Yippee, build fixed! [20:20:27] Project beta-update-databases-eqiad build #6407: 09FIXED in 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/6407/ [20:25:08] 10Beta-Cluster-Infrastructure: rebuild deployment-bastion on trusty - https://phabricator.wikimedia.org/T126537#2019986 (10dduvall) [20:25:27] hashar: Nice! Does that mean there's progress towards fixing beta-scap-eqiad? [20:26:06] RoanKattouw: I have no clue :-D [20:26:17] but apparently yeah Jenkins looks happy [20:26:50] Hmm so you just rebuild deployment-bastion on trusty [20:26:52] RoanKattouw: mighty Tyler is rebuilding it to Jessie [20:27:02] or trusty [20:27:03] yeah no clue [20:27:05] But https://phabricator.wikimedia.org/T126573#2017725 says this needs rebuilding it on Jessie instead [20:27:05] OK [20:27:09] stuff happens for sure! [20:27:12] * hashar blames 5.5 [20:29:09] !log pooling integration-slave-trusty-1004 integration-slave-trusty-1005 integration-slave-trusty-1006 [20:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:29:42] andrewbogott: we have plenty of Trusty instances for CI now!! thank you a lot : https://integration.wikimedia.org/ci/label/UbuntuTrusty/ [20:32:49] 20:24:49 [Thu Feb 11 20:24:49 2016] [hphp] [14131:7f7c01f42d00:0:000001] [] Uncaught exception: Could not open extension /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so: cannot open shared object file: No such file or directory [20:32:57] On integration-slave-trusty-1002 [20:33:03] https://integration.wikimedia.org/ci/job/mwext-Echo-testextension-hhvm/42/console [20:33:06] I'll try rerunning it first [20:33:17] RoanKattouw: na not going to pass [20:33:26] RoanKattouw: i guess the hhvm-luasandbox package is broken [20:33:33] or the link is populated by puppet and that is broken as well [20:33:37] :/ [20:33:42] or the hhvm conf is wrong bla [20:34:04] !task /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so +continuous-integration-infrastructure [20:34:04] https://phabricator.wikimedia.org//usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so: [20:34:07] ... [20:34:18] twentyafterfour: phabricator is broken, I can't fill task from IRC? :-D [20:35:05] !log depooling the six recent slaves: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so cannot open shared object file [20:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:40:00] 10Continuous-Integration-Config: Remove php55 jobs from "test" pipeline - https://phabricator.wikimedia.org/T126655#2020061 (10Legoktm) 3NEW a:3Legoktm [20:40:35] 10Continuous-Integration-Config: Remove php55 jobs from "test" pipeline - https://phabricator.wikimedia.org/T126655#2020072 (10Paladox) [20:41:13] 10Continuous-Integration-Config: Remove php55 jobs from "test" pipeline - https://phabricator.wikimedia.org/T126655#2020061 (10Paladox) Is this for all tests or just for mediawiki/core. Because php53 is in the test pipeline for most tests. [20:41:58] 10Continuous-Integration-Config: Remove php55 jobs from "test" pipeline - https://phabricator.wikimedia.org/T126655#2020090 (10Legoktm) Sorry, mediawiki/core and extension-gate. [20:42:09] 10Continuous-Integration-Config: Remove php55 jobs from "test" pipeline for mediawiki/core and extension-gate - https://phabricator.wikimedia.org/T126655#2020101 (10Legoktm) [20:43:10] 10Continuous-Integration-Infrastructure, 6operations, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020104 (10hashar) 3NEW [20:43:52] 10Continuous-Integration-Config: Remove php55 jobs from "test" pipeline for mediawiki/core and extension-gate - https://phabricator.wikimedia.org/T126655#2020122 (10Paladox) @Legoktm thanks for replying. Do we remove mediawiki-core-php55lint, mediawiki-phpunit-php55, mediawiki-phpunit-parsertests-php55 from the... [20:44:17] 10scap: include refreshCdbJsonFiles in scap's debian package - https://phabricator.wikimedia.org/T126660#2020123 (10mmodell) 3NEW [20:44:28] 10scap: include refreshCdbJsonFiles in scap's debian package - https://phabricator.wikimedia.org/T126660#2020130 (10mmodell) p:5Triage>3High [20:44:39] 10Continuous-Integration-Config: Remove php55 jobs from "test" pipeline for mediawiki/core and extension-gate - https://phabricator.wikimedia.org/T126655#2020133 (10hashar) Well tried, but yeah 5.5 is not that much faster after all :-( antoine-approve [20:45:11] 10Continuous-Integration-Infrastructure, 6operations, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020138 (10Paladox) [20:45:22] 10scap: include refreshCdbJsonFiles in scap's debian package - https://phabricator.wikimedia.org/T126660#2020123 (10mmodell) [20:45:27] 3Scap3, 10scap, 6Phabricator, 7WorkType-NewFunctionality: Deploy Phabricator with scap3 - https://phabricator.wikimedia.org/T114363#2020139 (10mmodell) [20:46:40] (03PS1) 10Legoktm: Remove php55 jobs from test pipeline in mw/core and extension-gate repos [integration/config] - 10https://gerrit.wikimedia.org/r/270098 (https://phabricator.wikimedia.org/T126655) [20:47:27] (03PS1) 10Paladox: Remove php55 from test pipeline in templates mediawiki/core and extension-gate [integration/config] - 10https://gerrit.wikimedia.org/r/270100 (https://phabricator.wikimedia.org/T126655) [20:47:52] 3Scap3: include refreshCdbJsonFiles in scap's debian package - https://phabricator.wikimedia.org/T126660#2020163 (10mmodell) [20:47:58] (03Abandoned) 10Paladox: Remove php55 from test pipeline in templates mediawiki/core and extension-gate [integration/config] - 10https://gerrit.wikimedia.org/r/270100 (https://phabricator.wikimedia.org/T126655) (owner: 10Paladox) [20:48:58] PROBLEM - Puppet failure on mira is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [20:50:19] (03CR) 10Legoktm: [C: 032] Remove php55 jobs from test pipeline in mw/core and extension-gate repos [integration/config] - 10https://gerrit.wikimedia.org/r/270098 (https://phabricator.wikimedia.org/T126655) (owner: 10Legoktm) [20:50:27] (03CR) 10Paladox: [C: 031] Remove php55 jobs from test pipeline in mw/core and extension-gate repos [integration/config] - 10https://gerrit.wikimedia.org/r/270098 (https://phabricator.wikimedia.org/T126655) (owner: 10Legoktm) [20:52:02] (03Merged) 10jenkins-bot: Remove php55 jobs from test pipeline in mw/core and extension-gate repos [integration/config] - 10https://gerrit.wikimedia.org/r/270098 (https://phabricator.wikimedia.org/T126655) (owner: 10Legoktm) [20:52:05] 10Continuous-Integration-Infrastructure, 6operations, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020180 (10hashar) The link is provided by puppet. `modules/hhvm/manifests/init.pp` has: ``` hhvm... [20:52:50] !log deploying https://gerrit.wikimedia.org/r/270098 [20:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:54:48] (03Abandoned) 10Paladox: Add php55 to mwext-testextension- [integration/config] - 10https://gerrit.wikimedia.org/r/269189 (owner: 10Paladox) [20:55:45] 10Continuous-Integration-Config, 5Patch-For-Review: Remove php55 jobs from "test" pipeline for mediawiki/core and extension-gate - https://phabricator.wikimedia.org/T126655#2020202 (10Legoktm) 5Open>3Resolved [20:55:52] legoktm: Could you review https://gerrit.wikimedia.org/r/#/c/269188/ please it is todo with php55 [20:57:15] paladox: I just changed it so you comment "check php5" and it will run the php55 or php53 tests depending on the branch [20:57:30] legoktm: Ok thanks. [20:59:01] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:01:02] paladox: legoktm does that still support 'check zend' ? :} [21:01:12] yes [21:01:18] +1 [21:01:35] check (php53?|zend) [21:01:37] good to see php55 jobs running around [21:01:42] yeah that looks like a good regex [21:02:03] legoktm would check php check both php versions. Would check-hhvm also work. [21:02:52] 10Continuous-Integration-Infrastructure, 6operations, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020239 (10hashar) [21:02:54] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2020238 (10hashar) [21:03:28] paladox: no, it only runs the ones that are set up for that branch. there's no point in check-hhvm since those are run automatically for every patch [21:03:45] legoktm: Ok thanks. [21:05:23] legoktm: The change we just did to check php. In Wikibase it has php53: would check php53 work there or would we need to change it. [21:06:22] legoktm: Also check zend and check php5 is not working for me https://gerrit.wikimedia.org/r/#/c/266119/ [21:07:54] paladox: because that repository doesn't have a php5 pipeline. Only mediawiki/core, and extensions in the extension-gate repository do [21:10:07] legoktm: Oh ok. Looks like mediawiki/core dosent have php55: but has php53: [21:10:28] Should we change that to php5 and include the php5[35] tests. [21:10:41] er, I just changed it to php5 [21:10:48] see https://gerrit.wikimedia.org/r/270098 [21:11:11] woops never mind sorry. I didnt rebase. [21:11:16] Ive rebased now. [21:11:38] 3Scap3: include refreshCdbJsonFiles in scap's debian package - https://phabricator.wikimedia.org/T126660#2020282 (10mmodell) [21:14:46] (03PS1) 10Paladox: Adding php55 tests to wikibase repo [integration/config] - 10https://gerrit.wikimedia.org/r/270109 [21:15:17] legoktm: Could you review https://gerrit.wikimedia.org/r/#/c/270109/ please. It is about adding php55 tests to wikibase [21:15:45] (03CR) 10Paladox: "Ive added the tests to experimental: to make sure they work and pass before upping them to test." [integration/config] - 10https://gerrit.wikimedia.org/r/270109 (owner: 10Paladox) [21:17:30] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2020298 (10hashar) [21:17:32] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2020296 (10hashar) 5Open>3stalled stalled / pending symlink fix :D [21:18:25] !log re enabling hhvm service on slaves ( https://phabricator.wikimedia.org/T126594 ) Some symlink is missing and only provided by the upstart script grrrrrrr https://phabricator.wikimedia.org/T126658 [21:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:19:14] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [21:28:36] !log pooling back slaves 1001 to 1006 [21:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:28:48] RoanKattouw: luasandbox should be found by hhvm now [21:28:58] hashar: Would we be able to migrate jsonlint test to Trusty since jshint is on Trusty. [21:30:04] na jshint should disappear [21:30:18] I also don't think we want to push more jobs to trusty, at least not right now [21:30:29] PROBLEM - Puppet failure on integration-slave-trusty-1002 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [21:30:42] I have doubled the number of executors :D [21:30:46] oh :D [21:30:55] (03PS1) 10Paladox: Migrate jsonlint to UbuntuTrusty [integration/config] - 10https://gerrit.wikimedia.org/r/270111 [21:31:18] legoktm hashar ok. [21:31:26] (03Abandoned) 10Paladox: Migrate jsonlint to UbuntuTrusty [integration/config] - 10https://gerrit.wikimedia.org/r/270111 (owner: 10Paladox) [21:32:27] PROBLEM - Puppet failure on integration-slave-trusty-1005 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:33:07] paladox: no, I think it's a good idea, but maybe in a few days, not today [21:33:37] legoktm: Ok i will re open it for review when ever it is ready to be review. [21:34:02] (03Restored) 10Paladox: Migrate jsonlint to UbuntuTrusty [integration/config] - 10https://gerrit.wikimedia.org/r/270111 (owner: 10Paladox) [21:34:06] 10Continuous-Integration-Infrastructure: Investigate installing php5.3 on trusty and/or debian instance - https://phabricator.wikimedia.org/T103786#2020339 (10hashar) [21:34:08] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Make /usr/bin/php a wrapper that picks the right PHP version on CI slaves - https://phabricator.wikimedia.org/T126211#2007557 (10hashar) [21:35:11] legoktm: Does this patch look ok for wikidata https://gerrit.wikimedia.org/r/#/c/270109/ ive added it to experimental: to make sure it works. [21:35:48] 10Continuous-Integration-Infrastructure: Investigate installing php5.3 on trusty and/or debian instance - https://phabricator.wikimedia.org/T103786#1399172 (10hashar) T126211 provided us with a shell script that is set as the Debian alternative for `/usr/bin/php`. Setting env variable `PHP_BIN` let us switch to... [21:36:01] paladox: looks correct, if Jan doesn't review it in a few days I can merge it [21:36:38] 10Continuous-Integration-Infrastructure: Investigate installing php5.3 on trusty and/or debian instance - https://phabricator.wikimedia.org/T103786#2020357 (10hashar) [21:36:40] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 3releng-201516-q3, 7WorkType-NewFunctionality: [keyresult] Migrate php composer (Zend and HHVM) CI jobs to Nodepool - https://phabricator.wikimedia.org/T119139#2020356 (10hashar) [21:37:30] legoktm: Ok thanks. We will probaly after we check to make sure it passes want to split it into a template so we can blacklist branchs 1_2[3-6]. Or else it might fail on those branches. [21:37:49] (03CR) 10Legoktm: [C: 031] "Lets do this in a day or two once the trusty slave situation settles down" [integration/config] - 10https://gerrit.wikimedia.org/r/270111 (owner: 10Paladox) [21:37:50] 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 3releng-201516-q3, 7WorkType-NewFunctionality: [keyresult] Migrate php composer (Zend and HHVM) CI jobs to Nodepool - https://phabricator.wikimedia.org/T119139#1819085 (10hashar) >>! In T119139#1997906, @hashar wrote: > We first need bits to... [21:39:39] (03PS2) 10Paladox: Adding php55 tests to wikibase repo [integration/config] - 10https://gerrit.wikimedia.org/r/270109 (https://phabricator.wikimedia.org/T126441) [21:40:09] 10Continuous-Integration-Config: Have CI set `$wgScribuntoDefaultEngine = 'luasandbox` to speed up parser tests - https://phabricator.wikimedia.org/T126670#2020381 (10hashar) 3NEW [21:40:23] RECOVERY - Puppet failure on integration-slave-trusty-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [21:40:33] 10Browser-Tests-Infrastructure, 7JavaScript: Create a few tests using Nightwatch.js - https://phabricator.wikimedia.org/T126435#2020390 (10greg) What issues did you run up against? I know there is a general dislike of Java, but curious what your blockers were. [21:40:49] 10Continuous-Integration-Config: Have CI set `$wgScribuntoDefaultEngine = 'luasandbox` to speed up parser tests - https://phabricator.wikimedia.org/T126670#2020381 (10hashar) [21:40:56] (03CR) 10Paladox: "In a follow up patch if everything passes we would need to split the php55 tests for wikibase into a separate template so we can blacklist" [integration/config] - 10https://gerrit.wikimedia.org/r/270109 (https://phabricator.wikimedia.org/T126441) (owner: 10Paladox) [21:43:13] 10Continuous-Integration-Config: Have CI set `$wgScribuntoDefaultEngine = 'luasandbox` to speed up parser tests - https://phabricator.wikimedia.org/T126670#2020396 (10Paladox) How would it be set in there and do we create a new php file for that config. [21:43:21] 10Continuous-Integration-Config: Have CI set `$wgScribuntoDefaultEngine = 'luasandbox` to speed up parser tests - https://phabricator.wikimedia.org/T126670#2020397 (10Anomie) BTW, if you don't already have the php-luasandbox or hhvm-luasandbox packages installed (matching whichever PHP you're running) on the CI... [21:43:24] 10Continuous-Integration-Config, 5Patch-For-Review: Deprecate global CodeSniffer rules repo - https://phabricator.wikimedia.org/T66371#2020399 (10hashar) 5Open>3Resolved a:3hashar @Dzahn has merged the patch \O/ [21:44:43] 10Continuous-Integration-Config: translatewiki.net phplint job should allow PHP 5.4 syntax - https://phabricator.wikimedia.org/T97889#2020406 (10hashar) Looks like we can phase out Zend jobs for translatewiki.git repo and solely rely on hhvm. [21:44:55] 10Continuous-Integration-Config: translatewiki.net phplint job should allow PHP 5.4 syntax - https://phabricator.wikimedia.org/T97889#2020407 (10hashar) a:5Nikerabbit>3None [21:48:47] Yippee, build fixed! [21:48:47] Project beta-scap-eqiad build #89528: 09FIXED in 14 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/89528/ [21:49:10] 10Continuous-Integration-Config: Have CI set `$wgScribuntoDefaultEngine = 'luasandbox` to speed up parser tests - https://phabricator.wikimedia.org/T126670#2020422 (10hashar) The CI slaves are provisioned via the Puppet classes `mediawiki::packages` which also provision the Wikimedia production application serve... [21:49:16] thcipriani: Yippee fixed! [21:49:45] hashar: yippee! hacked together :( [21:50:02] thcipriani: what do you mean? [21:50:38] hashar: pretty sure none of the non-mediawiki deployments will work at the moment. [21:50:50] still trying to dig into why [21:51:41] mostly because it doesn't seem to think that the deployment_server hiera variable is set to the instance FQDN [21:54:53] yay! [21:55:18] (too late yay, not yay'ing re other failures, obvs ;) ) [22:00:45] 10Browser-Tests-Infrastructure, 10Reading-Web, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#2020474 (10Jdlrobson) [22:00:55] 10Browser-Tests-Infrastructure, 10Reading-Web, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1156443 (10Jdlrobson) [22:06:12] earlier today I was talking to someone at the coworking place about how I am happy to not be a sysadmin [22:06:23] to which he asked me? So what have are you doing today [22:06:27] me: sysadmin [22:07:07] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2020557 (10greg) [22:07:09] 10Beta-Cluster-Infrastructure: rebuild deployment-bastion on trusty - https://phabricator.wikimedia.org/T126537#2020558 (10greg) [22:07:11] 10Beta-Cluster-Infrastructure: Beta Cluster updates broken due to php incompatibilities - https://phabricator.wikimedia.org/T126573#2020556 (10greg) [22:07:18] 10Continuous-Integration-Config, 7Easy: translatewiki.net phplint job should allow PHP 5.4 syntax - https://phabricator.wikimedia.org/T97889#2020563 (10hashar) [22:07:23] 10Beta-Cluster-Infrastructure, 6Labs: New labs instance fails running `block-for-project-export` before running mount - https://phabricator.wikimedia.org/T126568#2017518 (10greg) [22:07:25] 10Beta-Cluster-Infrastructure: rebuild deployment-bastion on trusty - https://phabricator.wikimedia.org/T126537#2016744 (10greg) [22:08:00] hashar: :) [22:11:26] greg-g: so status, we have roughly doubled the number of Trusty slaves [22:11:35] woohooo! [22:11:47] I really wanted to see what happened on the small 2GB ram slaves [22:11:53] and been monitoring them all day [22:12:14] investigated bunch of weird issue and blamed php 5.5 / face palmed a lot [22:12:26]

I even sweared

[22:12:43] in the end, deleted all the small one, create new big slaves and pooled them after dinner [22:13:41] hashar: With this many slaves can we start running CI on i18n commits eventually? ;-) [22:14:37] hehe [22:14:41] I think we're still constrained by disk space on that [22:15:05] :-( [22:15:08] basically https://phabricator.wikimedia.org/T93703 [22:15:40] Could we shard the checkouts? [22:15:47] 10Beta-Cluster-Infrastructure: Beta Cluster updates broken due to php incompatibilities - https://phabricator.wikimedia.org/T126573#2020644 (10greg) ``` 21:48 < wmf-insec> Yippee, build fixed! 21:48 < wmf-insec> Project beta-scap-eqiad build #89528: FIXED in 14 min: https://integration.wikime... [22:16:12] So all slaves get mediawiki-gate extensions, and then one set of slaves gets remaining extensions A-E, another gets F–M, etc.? [22:16:23] we probably need to finish https://phabricator.wikimedia.org/T94327 [22:17:04] Is there a still-open ticket related to finishing that? [22:18:45] I guess not [22:18:46] * legoktm files [22:22:03] 10Continuous-Integration-Config: Convert all MediaWiki extension phpunit jobs to use generic jobs - https://phabricator.wikimedia.org/T126682#2020689 (10Legoktm) 3NEW [22:22:57] 10Continuous-Integration-Infrastructure, 5Patch-For-Review: CI trusty slaves running out of memory - https://phabricator.wikimedia.org/T126545#2020702 (10hashar) This is definitely solved. Here is the summary: On Feb 10th we have pooled ci.medium instances that only have 2GB of RAM. That was to accommodate th... [22:23:57] 10Beta-Cluster-Infrastructure, 15User-greg: Beta Cluster updates broken due to php incompatibilities - https://phabricator.wikimedia.org/T126573#2020715 (10greg) 5Open>3Resolved a:3greg And is continuing to run successfully: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [22:24:02] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team: CI incidents - week of Feb 8th - https://phabricator.wikimedia.org/T126634#2020719 (10hashar) [22:24:10] 10Beta-Cluster-Infrastructure: Beta Cluster updates broken due to php incompatibilities - https://phabricator.wikimedia.org/T126573#2020721 (10greg) a:5greg>3thcipriani [22:24:52] 10Continuous-Integration-Infrastructure, 6Release-Engineering-Team: CI incidents - week of Feb 8th - https://phabricator.wikimedia.org/T126634#2019273 (10hashar) Added a quick report for {T126545} at T126545#2020702 [22:24:53] thcipriani: ya'll are still working on the issues with services deploying in BC, right? [22:25:19] greg-g: yeah, also need to make patches for ops/puppet when finished. [22:25:35] kk, I might have gotten trigger happy with that resolving [22:25:49] thcipriani: should I hold off on sending a "BC is updating again" email? [22:26:20] greg-g: no, go for it. it is in fact updating. [22:26:30] word [22:26:34] thanks a ton, all [22:26:49] greg-g: should mention there is ongoing work and that deployment-bastion is no longer going to be a thing [22:27:38] 10Continuous-Integration-Infrastructure: Consider increasing number of trusty CI slaves - https://phabricator.wikimedia.org/T126423#2020766 (10hashar) The ci.medium flavor turned out to lack memory so we got m1.large instance instead. Report at T126545#2020702 Last 24 hours view for UbuntuTrusty: {F3334488 si... [22:27:54] oh [22:27:58] I was too quick [22:28:25] :D [22:29:02] greg-g: I will try to get an incident report written [22:29:26] still have to report actionables for wmf.12 / fosdem and the cache storage system doh [22:29:29] lot of documents [22:29:50] if you weren't so busy fighting fires! [22:29:52] :( [22:29:58] thank you [22:30:05] thcipriani: have you had any trouble with adding deployment-tin as a Jenkins slave? [22:30:24] really, even though the last few weeks have had their fair share of crappy fires, ya'll have been working really well together and... yeah [22:30:54] yeah I have the feeling we are dealing with all the mess pretty well with good communication [22:30:56] hashar: marxarelli did that in a few minutes, so it didn't seem like it. Got the public key from ldap, seems to work. [22:31:18] marxarelli: I appoint you as the new teacher for 102 Jenkins - pooling slaves [22:31:32] thcipriani: greaaaat [22:31:44] guess we will be able to have a job running on deployment-tin to drive scap soonish [22:32:02] i.e. cd /srv/deployment/restbase/restbase && scap deploy [22:32:11] then grab popcorn [22:32:24] :D [22:32:56] hashar: :D [22:33:42] thcipriani: accepted D128 [22:34:38] 10Continuous-Integration-Config: Have CI set `$wgScribuntoDefaultEngine = 'luasandbox` to speed up parser tests - https://phabricator.wikimedia.org/T126670#2020829 (10Anomie) It'll throw an exception with the message "The luasandbox extension is not present, this engine cannot be used." if you try to use that en... [22:34:51] legoktm and hashar: Could you review https://gerrit.wikimedia.org/r/#/c/267540/ please. Its about adding parallel lint to mediawiki/core. [22:35:32] paladox: i think composer is still broken [22:35:59] hashar: Oh that https://gerrit.wikimedia.org/r/#/c/267540/ is following phpcs. [22:36:53] dont we have a php lint job still? [22:39:07] Project beta-scap-eqiad build #89534: 04FAILURE in 4 min 24 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/89534/ [22:41:17] hashar legoktm test_zuul_scheduler.py file needs updating for php55 since looking at it, it shows [22:41:23] def assertProjectHasPhplint(self, name, definition, pipeline): [22:41:24] self.assertTrue( [22:41:24] any([job for job in definition [22:41:24] if job.endswith('php53lint') or [22:41:24] job.startswith('composer-')]), [22:41:24] 'Project %s pipeline %s must have either ' [22:41:26] 'phplint or a composer-* job' [22:41:28] % (name, pipeline)) [22:41:46] hmm [22:42:06] Because they all do have php53lint and php55lint. [22:44:05] legoktm: But it shows if job.endswith('php53lint') so if it php53 is not included in mediawiki core it will cause it too fail because even though we would go for php55 it wants us to use php53. [22:44:14] (03CR) 10Hashar: "I think we are better sticking with the find | xargs php -l for now. This patch looks fine but it has too much overhead:" [integration/config] - 10https://gerrit.wikimedia.org/r/267540 (owner: 10Paladox) [22:44:28] reviewed [22:44:41] paladox: good catch :) [22:45:02] hashar: Thanks. [22:45:11] the trouble is that we now vary based on branch [22:45:13] :-/ [22:46:13] hashar we can do this if job.endswith(('php53lint', 'php55lint')) [22:46:16] I think we have some test example to check what jobs are triggered on a change [22:46:36] yeah maybe that is good enough [22:46:59] thcipriani: did you get scap on deployment-tin working? [22:47:09] marxarelli: :\ [22:47:14] no [22:47:21] oh. that seemed like a yes [22:47:24] :P [22:48:14] (03PS3) 10Paladox: Add new test mediawiki-core-parallel-lint to mediawiki/core [integration/config] - 10https://gerrit.wikimedia.org/r/267540 [22:48:45] hashar: Ive rebased and made some changes too https://gerrit.wikimedia.org/r/#/c/267540/ [22:48:52] marxarelli: it's not going to "work" until the redis returner works. I still can't figure out why it's stuck in read-only slave mode [22:49:20] is it still "connection refused"? [22:49:34] because i noticed that redis is only bound to localhost [22:49:56] and afaict the request is being made to the public interface [22:50:11] marxarelli: yeah, that's also part of the problem [22:50:36] but even if it were on a public interface, the fact that it thinks it's a read-only slave means that it wouldn't accept any writes [22:50:39] oh, fun. more servers more problems [22:50:50] (03PS1) 10Paladox: Update test for php55 [integration/config] - 10https://gerrit.wikimedia.org/r/270125 [22:51:00] also a problem [22:51:10] hashar legoktm ive updated the test at https://gerrit.wikimedia.org/r/270125 [22:52:08] thcipriani: maybe poke ops ? [22:52:17] there must be some reddit guru around [22:52:19] is there any way to just shutdown an instance without deleting it? I'd like to turn off deployment-bastion to make sure it's not interfering with anything, but I'd like to keep it around in case I need anything from it. [22:52:19] redis [22:52:33] yeah shutdown -H now or something like that [22:52:42] i.e. just like you would shutdown a server / your machine [22:53:02] it should then show up in state SHUTDOWN [22:53:22] kk [22:53:22] else poke labs folks to have them shut it down via the OpenStack cli [22:53:26] thank you :) [22:53:43] !log shutting down deployment-bastion [22:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [22:55:18] heh, never seen this before: molly-guard: SSH session detected! [22:55:18] hashar: https://gerrit.wikimedia.org/r/#/c/270125/ [22:55:23] thcipriani: looks like it might be another problem with the fqdn [22:55:44] [14:52:17] there must be some reddit guru around <-- ha! [22:55:52] marxarelli: you mean in deployment::redis? [22:55:58] ::deployment::redis tests whether $::fqdn != $deployment_server [22:55:59] yeah [22:56:15] PROBLEM - Host deployment-bastion is DOWN: CRITICAL - Host Unreachable (10.68.16.58) [22:56:43] thcipriani: molly-guard hijack shutdown to prevent you from killing the wrong server. Eg: attempting to shutdown tin.eqiad.wmnet , when writing the hostname you will probably notice you are killing the prod server [22:56:45] marxarelli: but you determined earlier that facter -p fqdn == deployment-tin.deployment-prep.eqiad.wmflabs [22:56:56] true. [22:57:09] paladox: will do some review tomorrow ;) [22:57:17] hashar: Ok. [22:57:30] thcipriani: let's add logging to that class to make sure puppet is getting the same facts that we are [22:57:52] marxarelli: do eet. [22:57:58] kk [22:58:43] paladox: speaking of reviews, I have a nice bookmark to list changes for which I am a reviewer and I havent voted on yet. https://gerrit.wikimedia.org/r/#/q/is:open+reviewer:self+label:code-review%253D0%252Cself+NOT+owner:self,n,z [22:59:13] paladox: looks like you are popular! [22:59:33] hashar: Yes. [23:03:35] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:04:26] thcipriani: "fqdn: 'deployment-tin.deployment-prep.eqiad.wmflabs' deployment_server: 'deployment-tin.deployment-prep.eqiad.wmflabs'" [23:04:33] indeed they are the same [23:05:14] time to use notify? [23:06:19] hashar: that's from notify [23:06:26] greaaat [23:06:30] :) [23:06:32] i love that hack [23:06:57] thcipriani: so i don't think puppet is managing /etc/redis/redis.conf on that server [23:07:24] but it _is_ managing /etc/redis/tcp_6379.conf which includes /etc/redis/redis.conf [23:08:26] marxarelli: yeah, it's redis-instance_tcp-6379 (name of the service or some such) [23:08:45] redis-instance-tcp_6379 [23:09:02] sometime /var/lib/puppet/state/resources.txt can help. It list all resources known to puppet [23:09:21] eg: file[/etc/redis/tcp_6379.conf] [23:09:30] file[/etc/init/redis-instance-tcp_6379.conf] [23:10:42] marxarelli: if you check out the process list, /usr/bin/redis-server /etc/redis/tcp_6379.conf isn't actually running even though it seems to start [23:12:58] ah, nvmd :( [23:13:39] i think either way, the problem is that redis.conf from the package defaults it to a read-only slave [23:14:09] `dpkg --fsys-tarfile /var/cache/apt/archives/redis-server_2%3a2.8.4-2+wmf1_amd64.deb | tar -xOf - ./etc/redis/redis.conf | grep read-only` [23:15:18] yeah, it gets included in the tcp_6379.conf file [23:15:22] redis-cli config get slave-read-only [23:15:24] "yes" [23:15:39] or wait, does this setting not even matter if it's not set up as a slave? [23:15:44] :/ [23:15:55] * marxarelli rtfms [23:16:21] marxarelli: it should. Plus the fact that it's bound to localhost is setup in that file [23:16:28] https://phabricator.wikimedia.org/T124720 [23:18:05] but there's no `slaveof` [23:23:06] is jenkins/zuul stalled again? and if so is it because dns is still screwed up? [23:23:11] marxarelli: Imma turn back on deployment-bastion. see what it looked like there. [23:23:57] thcipriani: cool. i'm curious to see whether deployment-bastion resolves to 127.0.0.1 which would explain why redis bound to lo wouldn't be a problem there [23:24:53] andrewbogott: doesn't seem particularly busy [23:25:10] the issue we were having with dns was affecting beta cluster [23:25:14] oh, ok [23:25:23] I’m just wondering why https://gerrit.wikimedia.org/r/#/c/268834/ isn’t merging [23:25:26] but I can be patient [23:25:33] RECOVERY - Host deployment-bastion is UP: PING OK - Packet loss = 0%, RTA = 4.39 ms [23:26:00] andrewbogott: it has a depends-on [23:26:18] oh, of course. crap [23:26:22] I guess I’ll just remove that [23:26:52] marxarelli: checkout deployment-bastion, /etc/redis/redis.conf is good there [23:26:57] andrewbogott: but since you brought it up, any chance we could get a fix of those ptr records? :) [23:27:07] marxarelli: I did, I think? [23:27:11] oh! [23:27:13] i'll check [23:27:39] andrewbogott: you did! [23:27:41] thanks! [23:27:45] sure [23:27:59] it worries me that it happened, but it’s not happening /right now/ so I’m not sure how to proceed [23:28:03] other than just mop up the duplicates [23:28:21] yeah, it seemed very random [23:28:29] the names it gave back anyway [23:28:45] all dead instances if that helps narrow it down any [23:29:45] what happens is that the ‘clean up dns instances’ code fails for a while, leaking dns entries [23:29:47] then, months later... [23:30:00] new instances try to reuse those addresses and there are collisions [23:30:12] dns ghosts ... [23:30:14] So it’s always, “this was broken for a while, sometime a few months ago" [23:30:15] * marxarelli shudders [23:30:30] i guess they're more like zombies [23:30:38] I should maybe make a cronjob that compares dns with the nova list and purges anything nova doesn’t know about [23:31:39] ah, so when an instance is deleted, the deletion of the ptr record fails? [23:32:38] thcipriani: yeah, you're right. the packaged version of redis.conf is different for precise [23:33:13] marxarelli: yeah, I think this may be a package file difference, but if that were the case how are production trebuchet deployments working? [23:37:15] thcipriani: redis is listening on 0.0.0.0:6379 on mira [23:37:32] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:50] Yippee, build fixed! [23:39:50] Project beta-scap-eqiad build #89541: 09FIXED in 5 min 8 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/89541/ [23:39:55] marxarelli: I have a thought. [23:40:09] I think that our redis is installed from the system and not the wikimedia repo [23:40:12] thcipriani: oh, neat. i was just typing 'i have no clue' [23:40:33] they're both 2.8.4-2+wmf1 [23:40:39] according to dpkg -l [23:41:05] yeah, just looked at that :\ [23:41:19] ostriches: MaxSem's new repo doesn't seem to be replicated to GitHub -- https://gerrit.wikimedia.org/r/#/admin/projects/HtmlFormatter [23:41:32] "I have no clue" got ported ...... https://github.com/nvbn/thefuck#the-fuck- [23:42:10] lol [23:42:16] bd808: creations aren't automatic anymore [23:42:39] Create and it'll replicate:) [23:43:02] marxarelli: thcipriani: what is your issue with redis ? [23:43:25] hashar: the /etc/redis/redis.conf file is setup to be a slave instance and it's only listening on localhost [23:43:33] redis.conf:bind 127.0.0.1 [23:43:38] that is the localhost part [23:44:03] looks like the /etc/redis/redis.conf is provided by the .deb package [23:44:10] and puppet no more / never override it [23:44:20] so sounds like that got live hacked somehow maybe [23:44:51] hashar: the package contains the `bind 127.0.0.1` line [23:45:07] but mira is using the same package and is listening on 0.0.0.0 [23:45:27] hashar: <+marxarelli> `dpkg --fsys-tarfile /var/cache/apt/archives/redis-server_2%3a2.8.4-2+wmf1_amd64.deb | tar -xOf - ./etc/redis/redis.conf | grep read-only` [23:45:32] download both /etc/redis and colordiff -ru them ? [23:46:01] hashar: er, permission denied :) [23:46:08] pfff [23:46:21] lets claim we can get root on tin/mira ;-} [23:46:49] 3pints of beer it got live hacked on tin originally [23:46:52] then copy pasted to mira [23:47:42] I am not sure what dpkg --fsys-tarfile [23:48:02] but potentially the package might have another conf/patch file that is injected when the package is installed [23:48:16] i guess we should just add a bind => '0.0.0.0' to deployment::redis [23:48:30] it will ensure it's consistent at least [23:49:44] thcipriani: ^ is there an open patch dealing with deployment-tin that we can slap that on? [23:50:01] ostriches: so I just make an empty repo on the github side? [23:50:02] so [23:50:02] I went to http://packages.ubuntu.com/trusty/redis-server [23:50:07] marxarelli: all patches that came out of this adventure are still on deployment-puppetmaster [23:50:07] bd808: Yerppppp [23:50:08] got the debian dir at http://archive.ubuntu.com/ubuntu/pool/universe/r/redis/redis_2.8.4-2.debian.tar.gz [23:50:32] marxarelli: the debian package has a debian/redis.conf with : bind 127.0.0.1 [23:50:39] thcipriani: oh, fantastic. any objection to or reason why we shouldn't do that? [23:50:43] hashar: yeah [23:50:45] and slave-read-only yes [23:50:52] so mira must have been live hacked [23:51:10] that seems like a legit inference [23:51:20] :/ [23:51:33] i like your idea of overriding it at the tcp_xxx_instance.conf level [23:51:39] ie bind => '0.0.0.0' to deployment::redis [23:51:48] that is self explaining when looking in puppet [23:52:11] marxarelli: no objections. [23:52:32] okey doke. let's add it to deployment-puppetmaster for now [23:53:32] we will eventually want to clarify how it is setup on prod mira [23:56:17] marxarelli: looking at the diff on deployment-puppetmaster, you may have a mistake unless you're not done... [23:56:29] thcipriani: just saw that :) [23:56:44] applying again [23:56:50] have a good setup ! sleepiness arrive here [23:56:50] * thcipriani ran git stalk marxarelli [23:57:04] * hashar TIL git stalk [23:57:04] hashar: goodnight! [23:57:37] :D later hashar [23:58:14] * greg-g waves to hashar [23:59:09] * marxarelli made another mistake. re-running puppet agent ...