[00:10:26] 6Release-Engineering, 10MediaWiki-Debug-Logging, 10MediaWiki-General-or-Unknown, 5MW-1.23-release, and 2 others: Create a minimal backport of PSR-3 logging to MediaWiki 1.23 LTS - https://phabricator.wikimedia.org/T91653#1202022 (10bd808) a:3bd808 [00:10:48] 6Release-Engineering, 10MediaWiki-Debug-Logging, 10MediaWiki-General-or-Unknown, 10MediaWiki-Tarball-Backports, and 3 others: Create a minimal backport of PSR-3 logging to MediaWiki 1.23 LTS - https://phabricator.wikimedia.org/T91653#1092396 (10bd808) [00:10:52] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [02:44:59] bd808: thanks [02:47:41] 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1202066 (10greg) 3NEW [02:48:25] 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1202073 (10greg) [03:28:03] 10Beta-Cluster, 7database: Use External Store on Beta Labs - https://phabricator.wikimedia.org/T95871#1202118 (10Mattflaschen) 3NEW [03:39:43] 10Beta-Cluster, 7database: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1202132 (10greg) [06:56:08] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:16:23] 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1202285 (10Arrbee) [07:21:10] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [07:39:43] 10Beta-Cluster, 5Patch-For-Review: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1202291 (10hashar) [07:39:43] 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1202290 (10hashar) [07:40:03] 10Beta-Cluster, 5Patch-For-Review: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194574 (10hashar) [07:41:34] 10Beta-Cluster, 5Patch-For-Review: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194574 (10hashar) As @bd808 mentioned, the puppet class recreate the directory whereas it used to be a symlink from /var/ to the extended disk space /srv/ Potentially... [08:03:54] morning [08:17:53] !sal [08:17:53] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:34:47] !log restarting stuck Jenkins [08:34:53] Logged the message, Master [08:42:46] !log kill -9 jenkins causes it was stuck in some deadlock related to the IRC plugin :( [08:42:49] Logged the message, Master [08:46:32] !log jenkins removed #wikimedia-qa IRC channel from the global configuration [08:46:35] Logged the message, Master [09:04:35] zeljkof: modules/zuul/ modules/contint manifests/roles/ci.pp modules/jenkins [09:04:57] git clone -o gerrit ssh://gerrit.wikimedia.org:29418/operations/puppet.git [09:05:24] hashar: I lost you [09:05:28] yeah [09:05:46] hashar: did you unplug the camera? :) [09:56:09] 10Continuous-Integration: Re-create ci slaves (April 2015) - https://phabricator.wikimedia.org/T94916#1202404 (10hashar) [10:02:43] 10Continuous-Integration: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#1202413 (10hashar) That is apparently solved in v3.2. Now pending upgrade of Zuul. [10:03:32] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1202415 (10hashar) [10:03:33] 10Continuous-Integration: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#1202414 (10hashar) [10:03:49] 10Continuous-Integration: Upgrade Zuul server to latest upstream - https://phabricator.wikimedia.org/T94409#1202417 (10hashar) [10:03:50] 10Continuous-Integration: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#646666 (10hashar) [10:13:15] 10Continuous-Integration: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1202463 (10Lydia_Pintscher) Ok for @WMDE-Fisch to be added to the WMDE group. [10:16:39] 10Continuous-Integration: Store Jenkins build output outside Jenkins (e.g. static storage) - https://phabricator.wikimedia.org/T53447#1202474 (10hashar) Potentially, we can already write a wrapper that would send the log to a central storage. OpenStack uses swift with the settings being held directly in Zuul con... [10:25:10] (03PS18) 10Hashar: Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:27:37] (03CR) 10Hashar: [C: 032] "The job has been deployed by awight and is already triggered by Zuul." [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:27:46] (03PS7) 10Hashar: wikimedia-fundraising-civicrm tests are voting [integration/config] - 10https://gerrit.wikimedia.org/r/195343 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:41:05] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:41:25] (03CR) 10Hashar: [C: 032] Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:42:47] !log reducing number of executors from 5 to 4 [10:43:09] Logged the message, Master [10:43:42] !log reducing number of executors on Precise instances from 5 to 4 and on Trusty instances from 6 to 4. The Jenkins scheduler tends to assign the unified jobs to the same slave which overload a single slave while others are idling. [10:43:44] Logged the message, Master [10:44:54] (03Merged) 10jenkins-bot: Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:46:50] (03CR) 10Hashar: [C: 032] "Confirmed it is passing at https://integration.wikimedia.org/ci/job/wikimedia-fundraising-civicrm/ :) Congratulations!" [integration/config] - 10https://gerrit.wikimedia.org/r/195343 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:48:29] (03Merged) 10jenkins-bot: wikimedia-fundraising-civicrm tests are voting [integration/config] - 10https://gerrit.wikimedia.org/r/195343 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [10:54:11] (03PS1) 10Hashar: zuul: pipelines graphs with drawNullAsZero=1 [integration/docroot] - 10https://gerrit.wikimedia.org/r/203802 [10:54:25] (03CR) 10Hashar: [C: 032] zuul: pipelines graphs with drawNullAsZero=1 [integration/docroot] - 10https://gerrit.wikimedia.org/r/203802 (owner: 10Hashar) [10:54:27] (03Merged) 10jenkins-bot: zuul: pipelines graphs with drawNullAsZero=1 [integration/docroot] - 10https://gerrit.wikimedia.org/r/203802 (owner: 10Hashar) [10:56:34] (03PS1) 10Hashar: zuul: fix case for drawNullAsZero [integration/docroot] - 10https://gerrit.wikimedia.org/r/203803 [10:56:55] (03CR) 10Hashar: [C: 032] zuul: fix case for drawNullAsZero [integration/docroot] - 10https://gerrit.wikimedia.org/r/203803 (owner: 10Hashar) [10:56:57] (03Merged) 10jenkins-bot: zuul: fix case for drawNullAsZero [integration/docroot] - 10https://gerrit.wikimedia.org/r/203803 (owner: 10Hashar) [10:59:28] (03PS1) 10Hashar: zuul: rephrase zuul processing graph [integration/docroot] - 10https://gerrit.wikimedia.org/r/203804 [10:59:44] (03CR) 10Hashar: [C: 032] zuul: rephrase zuul processing graph [integration/docroot] - 10https://gerrit.wikimedia.org/r/203804 (owner: 10Hashar) [11:18:22] 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1202568 (10hashar) @awight congratulations on acing the Jenkins configur... [11:18:29] 10Continuous-Integration, 10Wikimedia-Fundraising-CiviCRM: CI for Civi: provision and run tests under Jenkins/Zuul - https://phabricator.wikimedia.org/T86103#1202571 (10hashar) [11:18:31] 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1202569 (10hashar) 5Open>3Resolved a:3hashar [11:18:52] 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1098353 (10hashar) a:5hashar>3awight [11:22:04] (03PS2) 10Hashar: tests: factor out fake classes in a module [integration/config] - 10https://gerrit.wikimedia.org/r/203336 [11:24:56] (03CR) 10Hashar: [C: 032] tests: factor out fake classes in a module [integration/config] - 10https://gerrit.wikimedia.org/r/203336 (owner: 10Hashar) [11:26:45] (03Merged) 10jenkins-bot: tests: factor out fake classes in a module [integration/config] - 10https://gerrit.wikimedia.org/r/203336 (owner: 10Hashar) [11:30:05] (03PS1) 10Hashar: mediawiki: set $wgDebugTimestamps [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203806 [12:09:21] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [12:10:49] 10Continuous-Integration, 6operations: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1202642 (10JanZerebecki) [12:19:22] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.024 second response time [12:54:54] Yippee, build fixed! [12:54:55] Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #440: FIXED in 54 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/440/ [13:02:07] (03CR) 10Hashar: "If $HOME/.amp is just some cache of download packages it is probably safe to keep around. Whatever command generate content there can pro" [integration/config] - 10https://gerrit.wikimedia.org/r/203187 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight) [13:05:12] (03PS5) 10Hashar: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [13:06:18] (03CR) 10Hashar: [C: 032] Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [13:06:35] (03PS6) 10Hashar: Set version on Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [13:06:57] (03CR) 10Hashar: [C: 032] Set version on Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [13:09:02] (03CR) 10Hashar: "I have deleted the old jobs" [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [13:15:51] 10Continuous-Integration: Remove pywiki* and integration* from 'mediawiki' gate queue. - https://phabricator.wikimedia.org/T93304#1202679 (10hashar) 5Open>3declined a:3hashar Not much we can do. The root cause is that a lot of different projects share the same job and Zuul. @legoktm wrote a patch for Zuul... [13:16:45] (03Abandoned) 10Hashar: Make gate-and-submit an independent pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/202958 (https://phabricator.wikimedia.org/T94322) (owner: 10Legoktm) [13:18:14] (03Merged) 10jenkins-bot: Set version on Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [13:19:08] 10Continuous-Integration, 5Patch-For-Review: Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit in the short term - https://phabricator.wikimedia.org/T94322#1202689 (10hashar) >>! In T94322#1192568, @Legoktm wrote: > @hashar: What you described is an ideal situation, but the reality is that an... [13:24:15] 10Browser-Tests, 5Patch-For-Review: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1202693 (10hashar) Seems jobs have been refreshed and should work now. [13:25:08] (03CR) 10Hashar: "That is nice! Thanks :)" [integration/config] - 10https://gerrit.wikimedia.org/r/201677 (https://phabricator.wikimedia.org/T93558) (owner: 10Krinkle) [13:27:33] 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1202698 (10hashar) I dont think anything is blocked on Releng. The [[ https://wik... [13:34:11] hashar: Fun word pun. [13:34:11] converting "fake" classes into real classes [13:34:11] but still fake [13:37:21] 6Release-Engineering, 7Jenkins: [Quarterly Goal] Jenkins Performance improvements - https://phabricator.wikimedia.org/T422#1202730 (10hashar) [13:37:22] 10Continuous-Integration, 6Release-Engineering: Create list of performance-related improvements for Jenkins jobs - https://phabricator.wikimedia.org/T423#1202727 (10hashar) 5Open>3declined a:3hashar I dont think there is any point in keeping that task around. [13:39:10] 10Browser-Tests, 10Continuous-Integration, 6Release-Engineering: Map operations/mediawiki-config/extension-list entries to Jenkins browser test job - https://phabricator.wikimedia.org/T456#1202732 (10hashar) @greg Would you mind clarifying the task at hand? I am wondering what we are trying to achieve here :-] [13:39:46] 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1202734 (10KartikMistry) @hashar Just https://gerrit.wikimedia.org/r/#/c/202689 a... [13:45:40] Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #587: FAILURE in 14 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/587/ [13:49:36] 10Continuous-Integration, 10Ops-Access-Requests, 6operations: Add user wmde-fisch to LDAP group wmde - https://phabricator.wikimedia.org/T95546#1202755 (10hashar) [13:50:07] 10Continuous-Integration, 10Ops-Access-Requests, 6operations: Add user wmde-fisch to LDAP group wmde - https://phabricator.wikimedia.org/T95546#1194224 (10hashar) To make ops work easier, I have rephrased the task title and added some context to its description. [13:55:42] hashar: ping. Okay to delete integration-slave100x? I deleted the old 140x trusty slaves already last week after 2 days of having the new ones run fine. I'm asking you this time because I see slave1004 was updated today. [13:56:07] Krinkle: go go go :) [13:56:17] Krinkle: also I lowered the # of executors on the slaves [13:56:23] from 5 to 4 for Precise [13:56:28] and from 6 to 4 for Trusty [13:56:32] !log Delete old integration-slave1001...1004 (T94916) [13:56:37] Logged the message, Master [13:56:48] Jenkins has the annoying practice to assign jobs to an intance that already ran a job [13:57:03] hashar: Yes? [13:57:11] so this morning I had 5 jobs on the 1001 precise slave while others were idling :/ [13:57:33] hashar: But if a 6th job comes in, it will use the other slave just fine. [13:57:36] Why is that a problem. [13:57:37] yup [13:57:49] though the issue this morning was that 4 jobs were running mediawiki core tests [13:57:54] This way we reduce duplicate workspaces. [13:57:55] which caused the instance to be overloaded [13:58:10] We do not currently have the capacity to host all workspaces on all slaves. [13:58:10] and my lame tox-flake8 was occupying the last execcutor and took 12+ minutes to run :) [13:58:19] PROBLEM - Host integration-slave1001 is DOWN: CRITICAL - Host Unreachable (10.68.16.60) [13:58:23] PROBLEM - Host integration-slave1002 is DOWN: CRITICAL - Host Unreachable (10.68.16.175) [13:58:24] PROBLEM - Host integration-slave1004 is DOWN: CRITICAL - Host Unreachable (10.68.17.4) [13:58:39] This is why I re-created our pool +1 (4 instead of 5 precise slaves, and 6 instead of 5 trusty slaves) [13:58:43] and also increased the executors by one [13:59:02] yeah [13:59:22] but if you end up with 4+ heavy jobs on the same instance, they are all suffering a long delay [13:59:24] PROBLEM - Host integration-slave1003 is DOWN: CRITICAL - Host Unreachable (10.68.17.138) [13:59:28] so from 4x5=20 precise to 5x6 =30 precise [13:59:32] adding more slaves is a good workaround [14:00:20] hashar: Are you saying CPU will go 100% with only 5 jobs active? [14:00:48] most probably yeah [14:00:51] depends on the jobs [14:00:56] some are heavy CPU [14:01:02] others are mostly waiting for network io [14:01:22] Hm.. [14:01:29] hashar: You made trusty slaves 4 executors too [14:01:31] not 5 [14:01:35] yeah [14:01:39] so it went down 2 [14:01:40] they have 4 CPU don't they ? [14:01:47] do you dont want 5 jobs contending for 4 cpus [14:01:54] we now have lass capacity than 2 months ago [14:02:08] I don't think it will be a problem [14:02:11] Trusty always had one more [14:02:23] though we would need a way to monitor the executors occupation [14:02:39] 5x5=25, 4x6=24 [14:02:41] if we find we are having trouble, lets add more instance [14:02:50] or maybe we can ask to get instances with 8 CPU instead :) [14:03:16] What will happen is that with our continued increase in load, things are now going to be slower with even larger queues and backlog. [14:03:51] my point is that having 6 executors on a 4 Cpu job will cause each jobs to be way longer [14:04:07] We should add more instances before decreasing executors, not the other way around. [14:04:12] I don't have time for that. [14:04:41] or increase CPU if needed. Whateer. [14:05:01] Running slow is better than not running at all [14:05:07] Under peak load. [14:05:18] Such as with wikidata exploding jobs out of nowhere [14:05:24] (what the hell happened there anyway?) [14:06:09] https://integration.wikimedia.org/ci/load-statistics might give some clue [14:06:14] though it doesn't have a long history :( [14:06:22] They changed the job configuration. [14:06:43] The problem is that we have too many repositories to host workspaces for on one slave. [14:07:10] We need more than 4 executors as otherwise disk goes full. [14:07:14] With 4 executors, the same job will go to different slaves at different times. [14:07:22] We've been there already. [14:08:22] This is also why Lego and I had to remove zuul-cloner from many jobs because zuul-cloner doesn't support workspace wipe, doesn't support submodules, and (most importantly) doesn't support shallow clone. [14:08:36] I dont understand the relation ship you claim between # of executors and disk space [14:08:50] hashar: Imagine every slave has 1 executor. [14:09:02] and a job gets triggered. It goes to a slave. [14:09:15] Then next time, that slave is already used, so it has to go to a different slave instead. [14:09:23] This means that jobs' workspace is now on two slaves. [14:09:55] Jenkins tries to re-use the same slave each build for a job to reduce duplicated git clones. [14:10:38] It is currently already the case that we have too many jobs/repositories to host workspaces for on one slave. [14:10:38] Reducing the executors significantly increases duplicate workspaces and disk usage. [14:11:00] ok [14:11:04] This happened January, and last year twice as well. [14:11:04] Each time I believe it was a consequence of executors having been lowered after I increased it. [14:11:16] so with six executors you have the same problem eventually [14:11:16] I should've documented that better, but it was for a good reason. [14:11:28] when all executors are busy on the first slave [14:11:34] No [14:11:40] Because jenkins does this [14:11:46] a way would be to create shard of extensions and bind them to specific slaves [14:11:49] It keeps track of where jobs run. [14:11:50] or stop cloning core [14:12:01] It's not perfect, but good enough. [14:12:10] Good enough that in reality (not theory) we do not get full disk. [14:12:17] That's all I know. [14:12:22] got it :) [14:12:53] we can go with the hack I have published at https://phabricator.wikimedia.org/T93703#1144542 [14:12:56] namely use a local mirror [14:13:04] we can increase CPU, get more slaves, more disk, get local git cache per instance, reduce core clones etc. [14:13:07] but until that happens, we need this. [14:13:09] would also need to adjust zuul-cloner code to be able to pass custom options to git clone [14:13:18] it is hardcoded to always do a copy :/ [14:13:35] the # of executors is really a hack around the way Jenkins schedule jobs [14:13:35] hashar: Travis CI changed several years ago to always do --depth=10 for all git clones [14:13:42] It's the default, and there is no way to modify it [14:13:47] And nobody complains [14:13:50] Works fine :) [14:13:55] Full workspace wipe every time [14:14:02] and re-clone shalllow [14:14:12] No git cache even [14:14:22] Although our local git cache will make it even faster [14:14:28] but I think doing wipe and depth will be a good start. [14:14:30] What do you think? [14:18:45] 10Continuous-Integration: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1202800 (10hashar) 3NEW [14:20:49] 10Continuous-Integration, 6operations: Provide lint for yaml files in operations repository - https://phabricator.wikimedia.org/T91496#1202809 (10hashar) 5Open>3declined a:3hashar Per my previous comment, the yaml linting should be done by a test suite in the operations/puppet.git repository. CI would in... [14:21:19] Krinkle: I am not sure how the --depth will work to be honest [14:21:27] I cant remember how zuul-cloner clone the repo [14:21:34] I guess it clone what ever origin/HEAD is [14:21:59] then depending on the patchset branch that triggered the change, it does a checkout of ZUUL_BRANCH [14:22:01] so [14:22:39] if you send a patch to REL1_24 of mediawiki/core I am not sure how it is going to work [14:23:37] hashar: We'd wipe workspace each build, then zuul will detect directory doesn't exist and reclone all relevant repos (core, extensions etc.) [14:23:50] and instead of "git clone .. " it would do "git clone --depth 10 .. " [14:24:06] You can also pass branch or ref to git-clone so you don't need a separate "git checkout " [14:28:07] 10Continuous-Integration: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1202846 (10hashar) [14:28:23] Should be trivial to add the extra argument, right? [14:32:13] (03PS1) 10Hashar: Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) [14:32:28] 10Continuous-Integration, 5Patch-For-Review: Support multiple documents in yamlllint - https://phabricator.wikimedia.org/T86194#1202855 (10hashar) Phase out of the yamllint generic job is tracked by {T95890}. [14:32:35] Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-chrome-sauce build #578: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-chrome-sauce/578/ [14:34:38] 6Release-Engineering: Read "Vagrant: Up and Running" book - https://phabricator.wikimedia.org/T95401#1202858 (10zeljkofilipin) Another repository created: https://github.com/zeljkofilipin/my_vagrant_plugin [14:36:21] 10Continuous-Integration: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1202800 (10hashar) [14:45:37] Project browsertests-Wikidata-SmokeTests-linux-firefox-sauce build #217: FAILURE in 28 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-SmokeTests-linux-firefox-sauce/217/ [14:50:59] (03PS2) 10Hashar: Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) [14:51:28] (03CR) 10Hashar: [C: 032] Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) (owner: 10Hashar) [14:53:07] (03Merged) 10jenkins-bot: Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) (owner: 10Hashar) [14:54:33] (03Abandoned) 10Hashar: wikimedia-fundraising-civicrm [integration/config] - 10https://gerrit.wikimedia.org/r/166031 (owner: 10Hashar) [15:00:20] (03PS2) 10Hashar: (WIP) debian-glue job for Zuul (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/203347 [15:05:42] 10Continuous-Integration, 10Wikidata: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203016 (10JanZerebecki) 3NEW [15:09:11] Hm.. wikibugs not working? [15:10:00] Ah, its just slow [15:13:51] g'morn [15:32:12] 10Deployment-Systems, 6Services, 6operations: Evaluate Docker as a container deployment tool - https://phabricator.wikimedia.org/T93439#1203111 (10GWicke) [15:36:38] 10Continuous-Integration, 6Release-Engineering, 6Mobile-Web: mwext-MobileFrontend-qunit-mobile issues again - https://phabricator.wikimedia.org/T95430#1203130 (10Krinkle) [15:36:40] 10Continuous-Integration, 7Upstream: Zuul-cloner failing to acquire .git/config lock sometimes - https://phabricator.wikimedia.org/T86730#1203131 (10Krinkle) [15:37:02] 10Continuous-Integration, 6Release-Engineering, 6Mobile-Web: mwext-MobileFrontend-qunit-mobile issues again - https://phabricator.wikimedia.org/T95430#1189699 (10Krinkle) Intermittent snafu. [15:40:38] 10Continuous-Integration: Add a Gerrit check for file line endings - https://phabricator.wikimedia.org/T53754#1203160 (10Krinkle) 5Open>3declined a:3Krinkle This should not be handled by a separate job entirely. That's overkill and maintenance overhead for #contint. Individual projects are free to use what... [15:41:48] 10Continuous-Integration, 7Jenkins, 7Upstream: /etc/init.d/jenkins script provided by Debian doesn't work properly - https://phabricator.wikimedia.org/T53817#1203180 (10Krinkle) [15:42:25] 10Continuous-Integration: Zuul should not run jenkins-bot on changes for refs/meta/* - https://phabricator.wikimedia.org/T52389#1203192 (10Krinkle) [15:54:21] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:16:28] 10Continuous-Integration, 7I18n, 5Patch-For-Review, 7Pywikibot-i18n: Jenkins job to validate JSON files submitted to Gerrit repo pywikibot/i18n - https://phabricator.wikimedia.org/T85335#1203335 (10Krinkle) 5Open>3declined >>! In T85335#1177020, @jayvdb wrote: > @legoktm, new JS files (e.g. https://ger... [16:16:42] 10Continuous-Integration, 7I18n, 7Pywikibot-i18n: Jenkins job to validate JSON files submitted to Gerrit repo pywikibot/i18n - https://phabricator.wikimedia.org/T85335#1203339 (10Krinkle) 5declined>3Resolved [16:21:28] 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1203374 (10Krinkle) [16:21:58] 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1194719 (10Krinkle) >>! In T95569#1194807, @yuvipanda wrote: > Should I just delete all the data under the integration project, and let it start again from s... [16:24:17] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:30:06] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:38:37] (03PS1) 10Krinkle: Clean up integration/* report message overrides [integration/config] - 10https://gerrit.wikimedia.org/r/203856 [16:45:04] 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond metrics for cpu.system suddenly up 100% after a reboot - https://phabricator.wikimedia.org/T95912#1203449 (10Krinkle) 3NEW [16:51:44] 10Continuous-Integration, 10Wikidata, 10Wikidata-Sprint-2015-04-07: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203481 (10JanZerebecki) [16:52:27] 10Continuous-Integration, 7Community-consensus-needed: Create a trigger to run extension tests on test coverage extension - https://phabricator.wikimedia.org/T89333#1203486 (10Krinkle) 5Open>3declined a:3Krinkle I don't think that's desirable. Putting everything on the stack of changes to master of media... [16:52:28] 10Continuous-Integration, 10MediaWiki-extensions-MathSearch, 5Patch-For-Review: MathSearch tests fail - https://phabricator.wikimedia.org/T89237#1203489 (10Krinkle) [16:55:13] RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0] [16:55:34] (03PS1) 10Legoktm: Use non-generic job for Wikidata extension [integration/config] - 10https://gerrit.wikimedia.org/r/203858 (https://phabricator.wikimedia.org/T95897) [16:56:34] 10Continuous-Integration, 10MediaWiki-extensions-SemanticForms: SemanticForms unit tests fail - https://phabricator.wikimedia.org/T68052#1203512 (10Krinkle) [16:57:30] (03CR) 10Legoktm: [C: 032] Use non-generic job for Wikidata extension [integration/config] - 10https://gerrit.wikimedia.org/r/203858 (https://phabricator.wikimedia.org/T95897) (owner: 10Legoktm) [16:58:24] 10Beta-Cluster, 10Continuous-Integration, 10Math: beta-recompile-math-texvc-eqiad job fails with "/usr/local/bin/scap-recompile: No such file or directory" - https://phabricator.wikimedia.org/T91191#1203522 (10Krinkle) p:5Normal>3High [16:59:27] (03Merged) 10jenkins-bot: Use non-generic job for Wikidata extension [integration/config] - 10https://gerrit.wikimedia.org/r/203858 (https://phabricator.wikimedia.org/T95897) (owner: 10Legoktm) [17:01:25] !log deploying https://gerrit.wikimedia.org/r/203858 [17:02:23] Logged the message, Master [17:02:41] 10Continuous-Integration, 10Wikidata, 10Wikidata-Sprint-2015-04-07: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203533 (10Legoktm) [17:03:26] 10Continuous-Integration, 10Wikidata, 10Wikidata-Sprint-2015-04-07: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203016 (10Legoktm) Reverted for now, the generic jobs are exper... [18:00:47] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:49] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:50] PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:50] PROBLEM - App Server bits response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:53] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:53] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:54] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:54] PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:57] Project beta-scap-eqiad build #48920: FAILURE in 3 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48920/ [18:01:07] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28677 bytes in 0.565 second response time [18:02:14] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:02:15] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #508: FAILURE in 2 min 39 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/508/ [18:02:15] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:08:18] thcipriani, ^d, marxarelli: I wrote up the homework assignments on http://etherpad.wikimedia.org/p/deployworkinggroup [18:08:20] <^d> thx [18:08:20] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:08:20] PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:08:21] twentyafterfour: thanks. [18:12:08] poor NFS, taking a beating, seemingly. [18:12:08] see -labs [18:12:09] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:22] (03CR) 1020after4: "I guess this one can get merged without waiting for the config change to happen, right?" [tools/release] - 10https://gerrit.wikimedia.org/r/203642 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [18:24:27] here's a concrete example of the problem with trebuchet being owned by ops, which makes something like ansible so appealing: https://gerrit.wikimedia.org/r/#/c/201344/ [18:24:28] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 5.641 second response time [18:25:44] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:27:42] Yippee, build fixed! [18:27:42] Project beta-scap-eqiad build #48925: FIXED in 2 min 0 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48925/ [18:28:01] 10Continuous-Integration, 10Ops-Access-Requests, 6operations: Add user wmde-fisch to LDAP group wmde - https://phabricator.wikimedia.org/T95546#1203802 (10Dzahn) a:3Dzahn [18:49:54] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:49:57] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:57] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:58] RECOVERY - Puppet failure on integration-slave-trusty-1015 is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:59] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #285: FAILURE in 6 min 11 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/285/ [18:50:00] Project beta-scap-eqiad build #48926: FAILURE in 2 min 53 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48926/ [18:50:01] twentyafterfour: sweet. ty [18:50:05] why's beta-scap failing? [18:50:05] oh, nfs still? [18:50:06] greg-g: yea [18:50:07] greg-g: although, theoretically beta should be independent of NFS now [18:50:07] not sure why it’s failing, actually [18:50:07] keys are on localdisk as well [18:50:10] (03CR) 10Chad: [C: 032] Add Josa extension to make-wmf-branch/default.conf [tools/release] - 10https://gerrit.wikimedia.org/r/203642 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [18:55:24] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 47865 bytes in 0.610 second response time [18:55:30] Krinkle: I removed all the current integration data into archive.integration [18:59:23] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #284: FAILURE in 6 min 23 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/284/ [19:04:03] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:06] YuviPanda: thx [19:05:07] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 0.597 second response time [19:05:26] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28677 bytes in 0.627 second response time [19:06:05] Yippee, build fixed! [19:06:06] Project beta-scap-eqiad build #48929: FIXED in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48929/ [19:06:17] (03Merged) 10jenkins-bot: Add Josa extension to make-wmf-branch/default.conf [tools/release] - 10https://gerrit.wikimedia.org/r/203642 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [19:06:37] RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 5.396 second response time [19:06:37] RECOVERY - App Server bits response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 1.936 second response time [19:08:57] PROBLEM - Puppet failure on integration-labsvagrant is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:15:05] Project beta-scap-eqiad build #48930: FAILURE in 2 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48930/ [20:15:05] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:05] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:05] PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:05] PROBLEM - App Server bits response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:11] !log Restarting Zuul, Jenkins and aborting all builds. Everything got stuck following NFS outage in lab [20:15:14] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [20:15:15] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [20:15:20] PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:15:32] (03PS1) 10Krinkle: zuul: Don't raise "abort" as error to the user (2) [integration/docroot] - 10https://gerrit.wikimedia.org/r/203890 [20:15:35] 10Continuous-Integration, 6Labs: Continuous integration should not depend on labs NFS - https://phabricator.wikimedia.org/T90610#1204249 (10Krinkle) p:5Low>3High [20:17:06] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 0.560 second response time [20:17:33] RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 2.319 second response time [20:17:35] RECOVERY - App Server bits response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 1.053 second response time [20:20:24] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 48054 bytes in 0.748 second response time [20:21:36] Project beta-code-update-eqiad build #51673: FAILURE in 15 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/51673/ [20:21:44] Project beta-update-databases-eqiad build #8896: FAILURE in 24 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/8896/ [20:22:00] ^ manual abort [20:24:46] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [20:26:03] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [20:26:20] (03CR) 10Krinkle: [C: 032] zuul: Don't raise "abort" as error to the user (2) [integration/docroot] - 10https://gerrit.wikimedia.org/r/203890 (owner: 10Krinkle) [20:26:25] (03Merged) 10jenkins-bot: zuul: Don't raise "abort" as error to the user (2) [integration/docroot] - 10https://gerrit.wikimedia.org/r/203890 (owner: 10Krinkle) [20:29:00] RECOVERY - Puppet failure on integration-labsvagrant is OK: OK: Less than 1.00% above the threshold [0.0] [20:32:46] 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204502 (10chasemp) p:5Normal>3Low [20:33:11] 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1186041 (10chasemp) 5Open>3stalled Please don't close this as it is to remind me to ensure this access is revoked as appropriate. [20:33:13] 3Continuous-Integration-Isolation, 6operations: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1204509 (10chasemp) [20:34:30] 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204512 (10hashar) I can confirm the access works just fine. Thanks! [20:35:32] 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204538 (10Legoktm) >>! In T95303#1204130, @RobH wrote: > Ops meeting disucssion resulted in approval, with conditions that Chase is aware of. As he will hand... [20:36:51] 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204540 (10chasemp) >>! In T95303#1204538, @Legoktm wrote: >>>! In T95303#1204130, @RobH wrote: >> Ops meeting disucssion resulted in approval, with conditions... [20:37:41] legoktm: the idea is to setup labnodepool more or less puppetized [20:37:41] legoktm: and eventually trash the box and rebuild it completely [20:37:46] ah [20:37:47] making it production grade [20:37:48] gotcha [20:37:50] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204548 (10RobH) 'hardware is getting old' is not a valid reasoning. So this cannot be easily upgraded in place, an... [20:37:55] and ideally I would not need root access on it [20:38:09] the only reason I get root on gallium is to be able to debug Jenkins :( [20:38:16] or fix basic stuffs [20:38:43] legoktm: such root approvals are discussed privately between ops and extra care is taken [20:38:55] that is more or less supposed to remain private I guess so each op member can speak openly [20:39:02] without the fear of hurting the recipient (me) [20:40:19] legoktm: I have 0 chance to review your zuul patch this week. But you can try proposing it to upstream [20:41:16] their Gerrit is https://review.openstack.org/ , project openstack-infra/zuul . You need an account on launchpad and might have to sign a cla [20:41:18] hashar: I spent about an hour going through openstack's CLA process on friday but then it didn't like my ssh key nor https password so I gave up for now [20:41:23] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204574 (10chasemp) >>! In T95760#1204548, @RobH wrote: > 'hardware is getting old' is not a valid reasoning. > > S... [20:41:30] legoktm: oh that is a pity :((( [20:43:01] legoktm: note that the https pass is seperate from the normal login password [20:43:14] legoktm: it's the one in settings » http pass or something [20:44:04] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204592 (10RobH) Gallium is the following: Single CPU: Intel(R) Xeon(R) CPU X3450 @ 2.67GHz Dual 500GB S... [20:49:24] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:40] Krinkle: btw, let me know when I can permenantly get rid of the archive [20:53:40] YuviPanda: You can do so now. [20:53:43] Krinkle: cool. [20:53:49] :) [20:54:07] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 0.543 second response time [20:54:56] Krinkle: all gone now [20:56:06] legoktm: Yeah, upstream is actively interested in these queue things. They're probably already working on it. [20:56:06] I think commits to our zuul should be restricted to backporting fixes from upstream. [20:56:06] it's sufficiently complex that we should not maintain our own patches without upstrem reivew [20:56:06] valhallasw`cloud: yeah, that's the one I used [20:56:06] legoktm: hm. odd :( [20:56:06] Krinkle: ok, I'll try again tonight and see if I can get it to like my ssh key [20:56:13] 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1204631 (10yuvipanda) 5Open>3Resolved a:3yuvipanda All done now. FTR, the way to do this is: # Move the appropriate metrics (found in `/srv/carbon/w... [20:58:25] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204644 (10RobH) a:3Cmjohnson I'm thinking about allocating system cobalt for this, but I need to assign this task... [20:59:00] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204647 (10RobH) p:5High>3Normal @andrew: You set this to high priority, but it seems to be generally not any hi... [20:59:06] hashar: are you close to https://gerrit.wikimedia.org/r/#/c/201728/ not being wip? [21:00:55] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204652 (10RobH) a:5Cmjohnson>3RobH [21:01:21] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204654 (10hashar) Gallium has some SSD disk but the process that makes use of it are moving to some other machines.... [21:04:57] chasemp: need the nodepool debian package [21:05:29] ok I'm going to resign from review there until and can you please radd me? I'm weird and try ot keep a clean review queue [21:05:42] ohh [21:05:45] you can -1 it! [21:06:07] and use as a Gerrit homepage the magic: https://gerrit.wikimedia.org/r/#/q/is:open+reviewer:self+label:Code-Review%253D0%252Cuser%253Dself,n,z [21:06:08] is:open reviewer:self label:Code-Review=0,user=self [21:06:08] [21:06:21] that only list changes for which you are a reviewer and have a CR vote of 0 [21:06:30] so if you vote -1 or +1 ... the change disappear from that query [21:15:01] I don't want to do that tho as I need to keep up on my own +1's and -1's [21:15:03] :) [21:26:34] chasemp: so the puppet change is pending the nodepool debian package [21:26:41] which I have uploaded a sec ago :) [21:26:50] k [21:27:20] I'm not the best to review it let's add godog [21:27:26] I'll ping him about it tomorrow [21:28:26] 10Continuous-Integration, 3Continuous-Integration-Isolation: Puppetize Nodepool configuration - https://phabricator.wikimedia.org/T89143#1204795 (10hashar) [21:28:27] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool - https://phabricator.wikimedia.org/T89142#1028174 (10hashar) [21:28:52] 10Continuous-Integration, 3Continuous-Integration-Isolation: Puppetize Nodepool configuration - https://phabricator.wikimedia.org/T89143#1028190 (10hashar) I have a first Gerrit draft at https://gerrit.wikimedia.org/r/#/c/201728/ It depends on {T89142}. [21:29:15] chasemp: not sure how much bandwith he has for a review [21:29:27] I am pairing with him on thursday to get Zuul Debian package approved [21:29:35] the nodepool one is really a first draft :( [21:34:30] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10RobH) 3NEW a:3RobH [21:34:44] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1199645 (10RobH) [21:34:46] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204877 (10RobH) [21:35:18] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10RobH) [21:35:20] 10Continuous-Integration: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757#1204884 (10RobH) [21:35:22] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204879 (10RobH) 5Open>3Resolved Cobalt is allocated for this task. System setup will proceed on T95959. Resol... [21:36:01] hashar: Hm... it seems tests run slower for some reason. -qunit often times out from Apache (30 seconds timeout) [21:36:01] I don't think it actually is trying for 30 seconds though. That seems like a lot of time for the simple apache on a slave. [21:36:01] Probably some other issue like network or disk [21:36:01] ? [21:36:32] Krinkle: labs died so it is probably degraded somehow [21:36:33] Compared to yesterday [21:36:36] disk would not surprise me [21:36:55] dpkg-source: info: local changes detected, the modified files are: [21:36:55] [21:37:04] F*** YOU DEBIAN [21:37:49] hashar: are you using git-buildpackage? [21:41:15] chasemp: yes [21:41:30] chasemp: turns out I forgot to update the changelog version [21:41:31] I too have suffered that pain [21:41:34] ah [21:41:38] so it used some old tarball in my parent directory [21:41:49] and complained the local source tree was not matching the parent tarball [21:41:50] I hate it [21:42:12] I have been working on making it easier to build package via Jenkins [21:42:17] I have been working on making it easier to build package via Jenkins [21:47:35] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204957 (10RobH) @chasemp will be chasing down the network requirements. Cobalt needs to talk to labs hosts, which means it would curren... [21:48:47] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204964 (10chasemp) a:5RobH>3chasemp [21:57:59] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205022 (10RobH) for whoever does the install server update (I didn't do it yet, since we aren't yet certain of the fqdn.) NIC1 Ethernet... [21:58:40] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205026 (10RobH) [22:04:41] Krinkle: we will migrate zuul scheduler out of gallium [22:04:50] with the aim of phasing out gallium :D [22:04:51] more tomorrow! [22:06:10] 7Blocked-on-RelEng, 10Browser-Tests, 10Continuous-Integration, 6Collaboration-Team, and 2 others: Pass MEDIAWIKI_CAPTCHA_BYPASS_PASSWORD in on Jenkins so GettingStarted browser tests pass - https://phabricator.wikimedia.org/T91220#1205043 (10Mattflaschen) >>! In T91220#1185271, @hashar wrote: > The DNS fai... [22:07:09] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205044 (10hashar) Do we have any 30000 feet network diagrams of our vlan / zones / whatever? That would assist in figuring out how machi... [22:07:11] 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1205045 (10Mattflaschen) [22:07:37] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205047 (10RobH) [22:07:37] nodepool_0.0.1-104-gddd6003_amd64.deb !! [22:07:41] 7Blocked-on-RelEng, 10Browser-Tests, 10Continuous-Integration, 6Collaboration-Team, and 2 others: Pass MEDIAWIKI_CAPTCHA_BYPASS_PASSWORD in on Jenkins so GettingStarted browser tests pass - https://phabricator.wikimedia.org/T91220#1205048 (10Mattflaschen) Forgot to note, I verified it is now green: https:/... [22:09:58] hashar: Could you maybe revisit https://phabricator.wikimedia.org/T94138 tomorrow? [22:10:04] What is left to do there? [22:11:41] Krinkle: either disable core dumps [22:11:55] or figure out why dvips / dvipng segfaults [22:19:28] 10Continuous-Integration, 6Phabricator: Create a yellow project for 'nodepool' - https://phabricator.wikimedia.org/T95965#1205120 (10hashar) 3NEW [22:19:43] 3Continuous-Integration-Isolation, 6Phabricator: Create a yellow project for 'nodepool' - https://phabricator.wikimedia.org/T95965#1205128 (10hashar) [22:22:41] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool - https://phabricator.wikimedia.org/T89142#1205139 (10hashar) I have created the Gerrit repository [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/deb... [22:24:57] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1205146 (10hashar) [22:26:27] 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1028174 (10hashar) [22:26:51] chasemp: I have more or less packaged nodepool https://phabricator.wikimedia.org/T89142#1205139 :D [22:45:33] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10Wikimedia-Labs-Infrastructure: Support dedicating a specific virt node to a specific nova project - https://phabricator.wikimedia.org/T84989#1205215 (10hashar) p:5Normal>3Low Lowering priority, at the start I guess we can afford having our sm... [22:53:35] 10Continuous-Integration, 3Continuous-Integration-Isolation, 10Wikimedia-Labs-Infrastructure: Support dedicating a specific virt node to a specific nova project - https://phabricator.wikimedia.org/T84989#1205248 (10Andrew) Just now chase and I have confirmed that the proper mechanism to direct particular VMs... [23:02:06] ok enough [23:02:09] have a good day! [23:33:44] 10Beta-Cluster, 6Collaboration-Team, 10incident-20150410-flowdataloss, 7database: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1205360 (10greg) p:5Triage>3Normal [23:35:07] 10Beta-Cluster, 10Staging, 6Collaboration-Team, 10incident-20150410-flowdataloss, 7database: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1202118 (10greg) [23:36:57] 10Continuous-Integration, 5Patch-For-Review: Status of Jouncebot and dropping the yamllint Jenkins job - https://phabricator.wikimedia.org/T95894#1205364 (10greg) p:5Triage>3Normal