[00:10:26] <wikibugs>	 6Release-Engineering, 10MediaWiki-Debug-Logging, 10MediaWiki-General-or-Unknown, 5MW-1.23-release, and 2 others: Create a minimal backport of PSR-3 logging to MediaWiki 1.23 LTS - https://phabricator.wikimedia.org/T91653#1202022 (10bd808) a:3bd808
[00:10:48] <wikibugs>	 6Release-Engineering, 10MediaWiki-Debug-Logging, 10MediaWiki-General-or-Unknown, 10MediaWiki-Tarball-Backports, and 3 others: Create a minimal backport of PSR-3 logging to MediaWiki 1.23 LTS - https://phabricator.wikimedia.org/T91653#1092396 (10bd808)
[00:10:52] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK  
[02:44:59] <greg-g>	 bd808: thanks
[02:47:41] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1202066 (10greg) 3NEW
[02:48:25] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1202073 (10greg)
[03:28:03] <wikibugs>	 10Beta-Cluster, 7database: Use External Store on Beta Labs - https://phabricator.wikimedia.org/T95871#1202118 (10Mattflaschen) 3NEW
[03:39:43] <wikibugs>	 10Beta-Cluster, 7database: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1202132 (10greg)
[06:56:08] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[07:16:23] <wikibugs>	 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1202285 (10Arrbee)
[07:21:10] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:39:43] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1202291 (10hashar)
[07:39:43] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1202290 (10hashar)
[07:40:03] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194574 (10hashar)
[07:41:34] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194574 (10hashar) As @bd808  mentioned, the puppet class recreate the directory whereas it used to be a symlink from /var/ to the extended disk space /srv/  Potentially...
[08:03:54] <werdna>	 morning
[08:17:53] <hashar>	 !sal
[08:17:53] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[08:34:47] <zeljkof>	 !log restarting stuck Jenkins
[08:34:53] <qa-morebots>	 Logged the message, Master
[08:42:46] <hashar>	 !log kill -9 jenkins  causes it was stuck in some deadlock related to the IRC plugin :(
[08:42:49] <qa-morebots>	 Logged the message, Master
[08:46:32] <hashar>	 !log jenkins removed #wikimedia-qa IRC channel from the global configuration
[08:46:35] <qa-morebots>	 Logged the message, Master
[09:04:35] <hashar>	 zeljkof: modules/zuul/ modules/contint  manifests/roles/ci.pp  modules/jenkins
[09:04:57] <hashar>	 git clone -o gerrit ssh://gerrit.wikimedia.org:29418/operations/puppet.git
[09:05:24] <zeljkof>	 hashar: I lost you
[09:05:28] <hashar>	 yeah
[09:05:46] <zeljkof>	 hashar: did you unplug the camera? :)
[09:56:09] <wikibugs>	 10Continuous-Integration: Re-create ci slaves (April 2015) - https://phabricator.wikimedia.org/T94916#1202404 (10hashar)
[10:02:43] <wikibugs>	 10Continuous-Integration: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#1202413 (10hashar) That is apparently solved in v3.2.  Now pending upgrade of Zuul.
[10:03:32] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1202415 (10hashar)
[10:03:33] <wikibugs>	 10Continuous-Integration: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#1202414 (10hashar)
[10:03:49] <wikibugs>	 10Continuous-Integration: Upgrade Zuul server to latest upstream - https://phabricator.wikimedia.org/T94409#1202417 (10hashar)
[10:03:50] <wikibugs>	 10Continuous-Integration: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#646666 (10hashar)
[10:13:15] <wikibugs>	 10Continuous-Integration: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1202463 (10Lydia_Pintscher) Ok for @WMDE-Fisch to be added to the WMDE group.
[10:16:39] <wikibugs>	 10Continuous-Integration: Store Jenkins build output outside Jenkins (e.g. static storage) - https://phabricator.wikimedia.org/T53447#1202474 (10hashar) Potentially, we can already write a wrapper that would send the log to a central storage. OpenStack uses swift with the settings being held directly in Zuul con...
[10:25:10] <grrrit-wm>	 (03PS18) 10Hashar: Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:27:37] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "The job has been deployed by awight and is already triggered by Zuul." [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:27:46] <grrrit-wm>	 (03PS7) 10Hashar: wikimedia-fundraising-civicrm tests are voting [integration/config] - 10https://gerrit.wikimedia.org/r/195343 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:41:05] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:41:25] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:42:47] <hashar>	 !log reducing number of executors from 5 to 4
[10:43:09] <qa-morebots>	 Logged the message, Master
[10:43:42] <hashar>	 !log reducing number of executors on Precise instances from 5 to 4 and on Trusty instances from 6 to 4.   The Jenkins scheduler tends to assign the unified jobs to the same slave which overload a single slave while others are idling.
[10:43:44] <qa-morebots>	 Logged the message, Master
[10:44:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: Jenkins job builder definition for CRM job [integration/config] - 10https://gerrit.wikimedia.org/r/195063 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:46:50] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "Confirmed it is passing at https://integration.wikimedia.org/ci/job/wikimedia-fundraising-civicrm/ :) Congratulations!" [integration/config] - 10https://gerrit.wikimedia.org/r/195343 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:48:29] <grrrit-wm>	 (03Merged) 10jenkins-bot: wikimedia-fundraising-civicrm tests are voting [integration/config] - 10https://gerrit.wikimedia.org/r/195343 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[10:54:11] <grrrit-wm>	 (03PS1) 10Hashar: zuul: pipelines graphs with drawNullAsZero=1 [integration/docroot] - 10https://gerrit.wikimedia.org/r/203802 
[10:54:25] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] zuul: pipelines graphs with drawNullAsZero=1 [integration/docroot] - 10https://gerrit.wikimedia.org/r/203802 (owner: 10Hashar)
[10:54:27] <grrrit-wm>	 (03Merged) 10jenkins-bot: zuul: pipelines graphs with drawNullAsZero=1 [integration/docroot] - 10https://gerrit.wikimedia.org/r/203802 (owner: 10Hashar)
[10:56:34] <grrrit-wm>	 (03PS1) 10Hashar: zuul: fix case for drawNullAsZero [integration/docroot] - 10https://gerrit.wikimedia.org/r/203803 
[10:56:55] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] zuul: fix case for drawNullAsZero [integration/docroot] - 10https://gerrit.wikimedia.org/r/203803 (owner: 10Hashar)
[10:56:57] <grrrit-wm>	 (03Merged) 10jenkins-bot: zuul: fix case for drawNullAsZero [integration/docroot] - 10https://gerrit.wikimedia.org/r/203803 (owner: 10Hashar)
[10:59:28] <grrrit-wm>	 (03PS1) 10Hashar: zuul: rephrase zuul processing graph [integration/docroot] - 10https://gerrit.wikimedia.org/r/203804 
[10:59:44] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] zuul: rephrase zuul processing graph [integration/docroot] - 10https://gerrit.wikimedia.org/r/203804 (owner: 10Hashar)
[11:18:22] <wikibugs>	 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1202568 (10hashar) @awight congratulations on acing the Jenkins configur...
[11:18:29] <wikibugs>	 10Continuous-Integration, 10Wikimedia-Fundraising-CiviCRM: CI for Civi: provision and run tests under Jenkins/Zuul - https://phabricator.wikimedia.org/T86103#1202571 (10hashar)
[11:18:31] <wikibugs>	 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1202569 (10hashar) 5Open>3Resolved a:3hashar
[11:18:52] <wikibugs>	 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1098353 (10hashar) a:5hashar>3awight
[11:22:04] <grrrit-wm>	 (03PS2) 10Hashar: tests: factor out fake classes in a module [integration/config] - 10https://gerrit.wikimedia.org/r/203336 
[11:24:56] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] tests: factor out fake classes in a module [integration/config] - 10https://gerrit.wikimedia.org/r/203336 (owner: 10Hashar)
[11:26:45] <grrrit-wm>	 (03Merged) 10jenkins-bot: tests: factor out fake classes in a module [integration/config] - 10https://gerrit.wikimedia.org/r/203336 (owner: 10Hashar)
[11:30:05] <grrrit-wm>	 (03PS1) 10Hashar: mediawiki: set $wgDebugTimestamps [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203806 
[12:09:21] <shinken-wm>	 PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused  
[12:10:49] <wikibugs>	 10Continuous-Integration, 6operations: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1202642 (10JanZerebecki)
[12:19:22] <shinken-wm>	 RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.024 second response time  
[12:54:54] <wmf-insecte>	 Yippee, build fixed!
[12:54:55] <wmf-insecte>	 Project browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #440: FIXED in 54 sec: https://integration.wikimedia.org/ci/job/browsertests-GettingStarted-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/440/
[13:02:07] <grrrit-wm>	 (03CR) 10Hashar: "If $HOME/.amp is just some cache of download packages it is probably safe to keep around. Whatever command generate content there can pro" [integration/config] - 10https://gerrit.wikimedia.org/r/203187 (https://phabricator.wikimedia.org/T91895) (owner: 10Awight)
[13:05:12] <grrrit-wm>	 (03PS5) 10Hashar: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[13:06:18] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[13:06:35] <grrrit-wm>	 (03PS6) 10Hashar: Set version on Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[13:06:57] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Set version on Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[13:09:02] <grrrit-wm>	 (03CR) 10Hashar: "I have deleted the old jobs" [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[13:15:51] <wikibugs>	 10Continuous-Integration: Remove pywiki* and integration* from 'mediawiki' gate queue. - https://phabricator.wikimedia.org/T93304#1202679 (10hashar) 5Open>3declined a:3hashar Not much we can do. The root cause is that a lot of different projects share the same job and Zuul.  @legoktm wrote a patch for Zuul...
[13:16:45] <grrrit-wm>	 (03Abandoned) 10Hashar: Make gate-and-submit an independent pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/202958 (https://phabricator.wikimedia.org/T94322) (owner: 10Legoktm)
[13:18:14] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set version on Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[13:19:08] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit in the short term - https://phabricator.wikimedia.org/T94322#1202689 (10hashar) >>! In T94322#1192568, @Legoktm wrote: > @hashar: What you described is an ideal situation, but the reality is that an...
[13:24:15] <wikibugs>	 10Browser-Tests, 5Patch-For-Review: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1202693 (10hashar) Seems jobs have been refreshed and should work now.
[13:25:08] <grrrit-wm>	 (03CR) 10Hashar: "That is nice! Thanks :)" [integration/config] - 10https://gerrit.wikimedia.org/r/201677 (https://phabricator.wikimedia.org/T93558) (owner: 10Krinkle)
[13:27:33] <wikibugs>	 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1202698 (10hashar) I dont think anything is blocked on Releng. The [[ https://wik...
[13:34:11] <Krinkle>	 hashar: Fun word pun.
[13:34:11] <Krinkle>	 converting "fake" classes into real classes
[13:34:11] <Krinkle>	 but still fake
[13:37:21] <wikibugs>	 6Release-Engineering, 7Jenkins: [Quarterly Goal] Jenkins Performance improvements - https://phabricator.wikimedia.org/T422#1202730 (10hashar)
[13:37:22] <wikibugs>	 10Continuous-Integration, 6Release-Engineering: Create list of performance-related improvements for Jenkins jobs - https://phabricator.wikimedia.org/T423#1202727 (10hashar) 5Open>3declined a:3hashar I dont think there is any point in keeping that task around.
[13:39:10] <wikibugs>	 10Browser-Tests, 10Continuous-Integration, 6Release-Engineering: Map operations/mediawiki-config/extension-list entries to Jenkins browser test job - https://phabricator.wikimedia.org/T456#1202732 (10hashar) @greg Would you mind clarifying the task at hand?  I am wondering what we are trying to achieve here :-]
[13:39:46] <wikibugs>	 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1202734 (10KartikMistry) @hashar Just https://gerrit.wikimedia.org/r/#/c/202689 a...
[13:45:40] <wmf-insecte>	 Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #587: FAILURE in 14 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/587/
[13:49:36] <wikibugs>	 10Continuous-Integration, 10Ops-Access-Requests, 6operations: Add user wmde-fisch to LDAP group wmde - https://phabricator.wikimedia.org/T95546#1202755 (10hashar)
[13:50:07] <wikibugs>	 10Continuous-Integration, 10Ops-Access-Requests, 6operations: Add user wmde-fisch to LDAP group wmde - https://phabricator.wikimedia.org/T95546#1194224 (10hashar) To make ops work easier, I have rephrased the task title and added some context to its description.
[13:55:42] <Krinkle>	 hashar: ping. Okay to delete integration-slave100x? I deleted the old 140x trusty slaves already last week after 2 days of having the new ones run fine. I'm asking you this time because I see slave1004 was updated today.
[13:56:07] <hashar>	 Krinkle: go go go  :)
[13:56:17] <hashar>	 Krinkle: also I lowered the # of executors on the slaves
[13:56:23] <hashar>	 from 5 to 4 for Precise
[13:56:28] <hashar>	 and from 6 to 4 for Trusty
[13:56:32] <Krinkle>	 !log Delete old integration-slave1001...1004 (T94916)
[13:56:37] <qa-morebots>	 Logged the message, Master
[13:56:48] <hashar>	 Jenkins has the annoying practice to assign jobs to an intance that already ran a job
[13:57:03] <Krinkle>	 hashar: Yes?
[13:57:11] <hashar>	 so this morning I had 5 jobs on the 1001 precise slave  while others were idling :/
[13:57:33] <Krinkle>	 hashar: But if a 6th job comes in, it will use the other slave just fine.
[13:57:36] <Krinkle>	 Why is that a problem.
[13:57:37] <hashar>	 yup
[13:57:49] <hashar>	 though the issue this morning was that 4 jobs were running mediawiki core tests
[13:57:54] <Krinkle>	 This way we reduce duplicate workspaces.
[13:57:55] <hashar>	 which caused the instance to be overloaded
[13:58:10] <Krinkle>	 We do not currently have the capacity to host all workspaces on all slaves.
[13:58:10] <hashar>	 and my lame tox-flake8 was occupying the last execcutor and took 12+ minutes to run :)
[13:58:19] <shinken-wm>	 PROBLEM - Host integration-slave1001 is DOWN: CRITICAL - Host Unreachable (10.68.16.60)  
[13:58:23] <shinken-wm>	 PROBLEM - Host integration-slave1002 is DOWN: CRITICAL - Host Unreachable (10.68.16.175)  
[13:58:24] <shinken-wm>	 PROBLEM - Host integration-slave1004 is DOWN: CRITICAL - Host Unreachable (10.68.17.4)  
[13:58:39] <Krinkle>	 This is why I re-created our pool +1 (4 instead of 5 precise slaves, and 6 instead of 5 trusty slaves)
[13:58:43] <Krinkle>	 and also increased the executors by one
[13:59:02] <hashar>	 yeah
[13:59:22] <hashar>	 but if you end up with 4+ heavy jobs on the same instance, they are all suffering a long delay
[13:59:24] <shinken-wm>	 PROBLEM - Host integration-slave1003 is DOWN: CRITICAL - Host Unreachable (10.68.17.138)  
[13:59:28] <Krinkle>	 so from 4x5=20 precise to 5x6 =30 precise
[13:59:32] <hashar>	 adding more slaves is a good workaround
[14:00:20] <Krinkle>	 hashar: Are you saying CPU will go 100% with only 5 jobs active?
[14:00:48] <hashar>	 most probably yeah
[14:00:51] <hashar>	 depends on the jobs
[14:00:56] <hashar>	 some are heavy CPU
[14:01:02] <hashar>	 others are mostly waiting for network io
[14:01:22] <Krinkle>	 Hm.. 
[14:01:29] <Krinkle>	 hashar: You made trusty slaves 4 executors too
[14:01:31] <Krinkle>	 not 5
[14:01:35] <hashar>	 yeah
[14:01:39] <Krinkle>	 so it went down 2
[14:01:40] <hashar>	 they have 4 CPU don't they ?
[14:01:47] <hashar>	 do you dont want 5 jobs contending for 4 cpus
[14:01:54] <Krinkle>	 we now have lass capacity than 2 months ago
[14:02:08] <hashar>	 I don't think it will be a problem
[14:02:11] <Krinkle>	 Trusty always had one more
[14:02:23] <hashar>	 though we would need a way to monitor the executors occupation
[14:02:39] <Krinkle>	 5x5=25, 4x6=24
[14:02:41] <hashar>	 if we find we are having trouble, lets add more instance
[14:02:50] <hashar>	 or maybe we can ask to get instances with 8 CPU instead :)
[14:03:16] <Krinkle>	 What will happen is that with our continued increase in load, things are now going to be slower with even larger queues and backlog.
[14:03:51] <hashar>	 my point is that having 6 executors on a 4 Cpu job will cause each jobs to be way longer
[14:04:07] <Krinkle>	 We should add more instances before decreasing executors, not the other way around.
[14:04:12] <Krinkle>	 I don't have time for that.
[14:04:41] <Krinkle>	 or increase CPU if needed. Whateer.
[14:05:01] <Krinkle>	 Running slow is better than not running at all 
[14:05:07] <Krinkle>	 Under peak load.
[14:05:18] <Krinkle>	 Such as with wikidata exploding jobs out of nowhere
[14:05:24] <Krinkle>	 (what the hell happened there anyway?)
[14:06:09] <hashar>	 https://integration.wikimedia.org/ci/load-statistics  might give some clue
[14:06:14] <hashar>	 though it doesn't have a long history :(
[14:06:22] <Krinkle>	 They changed the job configuration.
[14:06:43] <Krinkle>	 The problem is that we have too many repositories to host workspaces for on one slave.
[14:07:10] <Krinkle>	 We need more than 4 executors as otherwise disk goes full.
[14:07:14] <Krinkle>	 With 4 executors, the same job will go to different slaves at different times.
[14:07:22] <Krinkle>	 We've been there already.
[14:08:22] <Krinkle>	 This is also why Lego and I had to remove zuul-cloner from many jobs because zuul-cloner doesn't support workspace wipe, doesn't support submodules, and (most importantly) doesn't support shallow clone.
[14:08:36] <hashar>	 I dont understand the relation ship you claim between # of executors and disk space
[14:08:50] <Krinkle>	 hashar: Imagine every slave has 1 executor.
[14:09:02] <Krinkle>	 and a job gets triggered. It goes to a slave.
[14:09:15] <Krinkle>	 Then next time, that slave is already used, so it has to go to a different slave instead.
[14:09:23] <Krinkle>	 This means that jobs' workspace is now on two slaves.
[14:09:55] <Krinkle>	 Jenkins tries to re-use the same slave each build for a job to reduce duplicated git clones.
[14:10:38] <Krinkle>	 It is currently already the case that we have too many jobs/repositories to host workspaces for on one slave.
[14:10:38] <Krinkle>	 Reducing the executors significantly increases duplicate workspaces and disk usage.
[14:11:00] <hashar>	 ok 
[14:11:04] <Krinkle>	 This happened January, and last year twice as well.
[14:11:04] <Krinkle>	 Each time I believe it was a consequence of executors having been lowered after I increased it.
[14:11:16] <hashar>	 so with six executors you have the same problem eventually
[14:11:16] <Krinkle>	 I should've documented that better, but it was for a good reason.
[14:11:28] <hashar>	 when all executors are busy on the first slave
[14:11:34] <Krinkle>	 No
[14:11:40] <Krinkle>	 Because jenkins does this
[14:11:46] <hashar>	 a way would be to create shard of extensions and bind them to specific slaves
[14:11:49] <Krinkle>	 It keeps track of where jobs run.
[14:11:50] <hashar>	 or stop cloning core
[14:12:01] <Krinkle>	 It's not perfect, but good enough.
[14:12:10] <Krinkle>	 Good enough that in reality (not theory) we do not get full disk.
[14:12:17] <Krinkle>	 That's all I know.
[14:12:22] <hashar>	 got it :)
[14:12:53] <hashar>	 we can go with the hack I have published at https://phabricator.wikimedia.org/T93703#1144542
[14:12:56] <hashar>	 namely use a local mirror
[14:13:04] <Krinkle>	 we can increase CPU, get more slaves, more disk, get local git cache per instance, reduce core clones etc.
[14:13:07] <Krinkle>	 but until that happens, we need this.
[14:13:09] <hashar>	 would also need to adjust zuul-cloner code to be able to pass custom options to git clone
[14:13:18] <hashar>	 it is hardcoded to always do a copy :/
[14:13:35] <hashar>	 the # of executors is really a hack around the way Jenkins schedule jobs
[14:13:35] <Krinkle>	 hashar: Travis CI changed several years ago to always do --depth=10 for all git clones
[14:13:42] <Krinkle>	 It's the default, and there is no way to modify it
[14:13:47] <Krinkle>	 And nobody complains
[14:13:50] <Krinkle>	 Works fine :)
[14:13:55] <Krinkle>	 Full workspace wipe every time
[14:14:02] <Krinkle>	 and re-clone shalllow
[14:14:12] <Krinkle>	 No git cache even
[14:14:22] <Krinkle>	 Although our local git cache will make it even faster
[14:14:28] <Krinkle>	 but I think doing wipe and depth will be a good start.
[14:14:30] <Krinkle>	 What do you think?
[14:18:45] <wikibugs>	 10Continuous-Integration: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1202800 (10hashar) 3NEW
[14:20:49] <wikibugs>	 10Continuous-Integration, 6operations: Provide lint for yaml files in operations repository - https://phabricator.wikimedia.org/T91496#1202809 (10hashar) 5Open>3declined a:3hashar Per my previous comment, the yaml linting should be done by a test suite in the operations/puppet.git repository. CI would in...
[14:21:19] <hashar>	 Krinkle: I am not sure how the --depth will work to be honest
[14:21:27] <hashar>	 I cant remember how zuul-cloner clone the repo
[14:21:34] <hashar>	 I guess it clone what ever origin/HEAD is
[14:21:59] <hashar>	 then depending on the patchset branch that triggered the change, it does a checkout of ZUUL_BRANCH
[14:22:01] <hashar>	 so 
[14:22:39] <hashar>	 if you send a patch to REL1_24 of mediawiki/core  I am not sure how it is going to work
[14:23:37] <Krinkle>	 hashar: We'd wipe workspace each build, then zuul will detect directory doesn't exist and reclone all relevant repos (core, extensions etc.)
[14:23:50] <Krinkle>	 and instead of "git clone .. " it would do "git clone --depth 10 .. "
[14:24:06] <Krinkle>	 You can also pass branch or ref to git-clone so you don't need a separate "git checkout "
[14:28:07] <wikibugs>	 10Continuous-Integration: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1202846 (10hashar)
[14:28:23] <Krinkle>	 Should be trivial to add the extra argument, right?
[14:32:13] <grrrit-wm>	 (03PS1) 10Hashar: Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) 
[14:32:28] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Support multiple documents in yamlllint - https://phabricator.wikimedia.org/T86194#1202855 (10hashar) Phase out of the yamllint generic job is tracked by {T95890}.
[14:32:35] <wmf-insecte>	 Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-chrome-sauce build #578: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-chrome-sauce/578/
[14:34:38] <wikibugs>	 6Release-Engineering: Read "Vagrant: Up and Running" book - https://phabricator.wikimedia.org/T95401#1202858 (10zeljkofilipin) Another repository created: https://github.com/zeljkofilipin/my_vagrant_plugin
[14:36:21] <wikibugs>	 10Continuous-Integration: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1202800 (10hashar)
[14:45:37] <wmf-insecte>	 Project browsertests-Wikidata-SmokeTests-linux-firefox-sauce build #217: FAILURE in 28 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-SmokeTests-linux-firefox-sauce/217/
[14:50:59] <grrrit-wm>	 (03PS2) 10Hashar: Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) 
[14:51:28] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) (owner: 10Hashar)
[14:53:07] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove commented out yamllint from translatewiki [integration/config] - 10https://gerrit.wikimedia.org/r/203827 (https://phabricator.wikimedia.org/T86194) (owner: 10Hashar)
[14:54:33] <grrrit-wm>	 (03Abandoned) 10Hashar: wikimedia-fundraising-civicrm [integration/config] - 10https://gerrit.wikimedia.org/r/166031 (owner: 10Hashar)
[15:00:20] <grrrit-wm>	 (03PS2) 10Hashar: (WIP) debian-glue job for Zuul (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/203347 
[15:05:42] <wikibugs>	 10Continuous-Integration, 10Wikidata: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203016 (10JanZerebecki) 3NEW
[15:09:11] <Krinkle>	 Hm.. wikibugs not working?
[15:10:00] <Krinkle>	 Ah, its just slow
[15:13:51] <greg-g>	 g'morn
[15:32:12] <wikibugs>	 10Deployment-Systems, 6Services, 6operations: Evaluate Docker as a container deployment tool - https://phabricator.wikimedia.org/T93439#1203111 (10GWicke)
[15:36:38] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Mobile-Web: mwext-MobileFrontend-qunit-mobile issues again - https://phabricator.wikimedia.org/T95430#1203130 (10Krinkle)
[15:36:40] <wikibugs>	 10Continuous-Integration, 7Upstream: Zuul-cloner failing to acquire .git/config lock sometimes - https://phabricator.wikimedia.org/T86730#1203131 (10Krinkle)
[15:37:02] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Mobile-Web: mwext-MobileFrontend-qunit-mobile issues again - https://phabricator.wikimedia.org/T95430#1189699 (10Krinkle) Intermittent snafu.
[15:40:38] <wikibugs>	 10Continuous-Integration: Add a Gerrit check for file line endings - https://phabricator.wikimedia.org/T53754#1203160 (10Krinkle) 5Open>3declined a:3Krinkle This should not be handled by a separate job entirely. That's overkill and maintenance overhead for #contint. Individual projects are free to use what...
[15:41:48] <wikibugs>	 10Continuous-Integration, 7Jenkins, 7Upstream: /etc/init.d/jenkins script provided by Debian doesn't work properly - https://phabricator.wikimedia.org/T53817#1203180 (10Krinkle)
[15:42:25] <wikibugs>	 10Continuous-Integration: Zuul should not run jenkins-bot on changes for refs/meta/* - https://phabricator.wikimedia.org/T52389#1203192 (10Krinkle)
[15:54:21] <shinken-wm>	 PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[16:16:28] <wikibugs>	 10Continuous-Integration, 7I18n, 5Patch-For-Review, 7Pywikibot-i18n: Jenkins job to validate JSON files submitted to Gerrit repo pywikibot/i18n - https://phabricator.wikimedia.org/T85335#1203335 (10Krinkle) 5Open>3declined >>! In T85335#1177020, @jayvdb wrote: > @legoktm, new JS files (e.g. https://ger...
[16:16:42] <wikibugs>	 10Continuous-Integration, 7I18n, 7Pywikibot-i18n: Jenkins job to validate JSON files submitted to Gerrit repo pywikibot/i18n - https://phabricator.wikimedia.org/T85335#1203339 (10Krinkle) 5declined>3Resolved
[16:21:28] <wikibugs>	 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1203374 (10Krinkle)
[16:21:58] <wikibugs>	 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1194719 (10Krinkle) >>! In T95569#1194807, @yuvipanda wrote: > Should I just delete all the data under the integration project, and let it start again from s...
[16:24:17] <shinken-wm>	 RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[16:30:06] <shinken-wm>	 PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[16:38:37] <grrrit-wm>	 (03PS1) 10Krinkle: Clean up integration/* report message overrides [integration/config] - 10https://gerrit.wikimedia.org/r/203856 
[16:45:04] <wikibugs>	 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond metrics for cpu.system suddenly up 100% after a reboot - https://phabricator.wikimedia.org/T95912#1203449 (10Krinkle) 3NEW
[16:51:44] <wikibugs>	 10Continuous-Integration, 10Wikidata, 10Wikidata-Sprint-2015-04-07: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203481 (10JanZerebecki)
[16:52:27] <wikibugs>	 10Continuous-Integration, 7Community-consensus-needed: Create a trigger to run extension tests on test coverage extension - https://phabricator.wikimedia.org/T89333#1203486 (10Krinkle) 5Open>3declined a:3Krinkle I don't think that's desirable. Putting everything on the stack of changes to master of media...
[16:52:28] <wikibugs>	 10Continuous-Integration, 10MediaWiki-extensions-MathSearch, 5Patch-For-Review: MathSearch tests fail - https://phabricator.wikimedia.org/T89237#1203489 (10Krinkle)
[16:55:13] <shinken-wm>	 RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0]  
[16:55:34] <grrrit-wm>	 (03PS1) 10Legoktm: Use non-generic job for Wikidata extension [integration/config] - 10https://gerrit.wikimedia.org/r/203858 (https://phabricator.wikimedia.org/T95897) 
[16:56:34] <wikibugs>	 10Continuous-Integration, 10MediaWiki-extensions-SemanticForms: SemanticForms unit tests fail - https://phabricator.wikimedia.org/T68052#1203512 (10Krinkle)
[16:57:30] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Use non-generic job for Wikidata extension [integration/config] - 10https://gerrit.wikimedia.org/r/203858 (https://phabricator.wikimedia.org/T95897) (owner: 10Legoktm)
[16:58:24] <wikibugs>	 10Beta-Cluster, 10Continuous-Integration, 10Math: beta-recompile-math-texvc-eqiad job fails with "/usr/local/bin/scap-recompile: No such file or directory" - https://phabricator.wikimedia.org/T91191#1203522 (10Krinkle) p:5Normal>3High
[16:59:27] <grrrit-wm>	 (03Merged) 10jenkins-bot: Use non-generic job for Wikidata extension [integration/config] - 10https://gerrit.wikimedia.org/r/203858 (https://phabricator.wikimedia.org/T95897) (owner: 10Legoktm)
[17:01:25] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/203858
[17:02:23] <qa-morebots>	 Logged the message, Master
[17:02:41] <wikibugs>	 10Continuous-Integration, 10Wikidata, 10Wikidata-Sprint-2015-04-07: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203533 (10Legoktm)
[17:03:26] <wikibugs>	 10Continuous-Integration, 10Wikidata, 10Wikidata-Sprint-2015-04-07: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1203016 (10Legoktm) Reverted for now, the generic jobs are exper...
[18:00:47] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:49] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:50] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:50] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:53] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:53] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:54] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:54] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:57] <wmf-insecte>	 Project beta-scap-eqiad build #48920: FAILURE in 3 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48920/
[18:01:07] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28677 bytes in 0.565 second response time  
[18:02:14] <shinken-wm>	 PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[18:02:15] <wmf-insecte>	 Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #508: FAILURE in 2 min 39 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/508/
[18:02:15] <shinken-wm>	 PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[18:08:18] <twentyafterfour>	 thcipriani, ^d, marxarelli:  I wrote up the homework assignments on http://etherpad.wikimedia.org/p/deployworkinggroup
[18:08:20] <^d>	 thx
[18:08:20] <shinken-wm>	 PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[18:08:20] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[18:08:21] <thcipriani>	 twentyafterfour: thanks.
[18:12:08] <thcipriani>	 poor NFS, taking a beating, seemingly.
[18:12:08] <legoktm>	 see -labs
[18:12:09] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:19:22] <grrrit-wm>	 (03CR) 1020after4: "I guess this one can get merged without waiting for the config change to happen, right?" [tools/release] - 10https://gerrit.wikimedia.org/r/203642 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt)
[18:24:27] <twentyafterfour>	 here's a concrete example of the problem with trebuchet being owned by ops, which makes something like ansible so appealing:   https://gerrit.wikimedia.org/r/#/c/201344/
[18:24:28] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 5.641 second response time  
[18:25:44] <shinken-wm>	 RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:27:42] <wmf-insecte>	 Yippee, build fixed!
[18:27:42] <wmf-insecte>	 Project beta-scap-eqiad build #48925: FIXED in 2 min 0 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48925/
[18:28:01] <wikibugs>	 10Continuous-Integration, 10Ops-Access-Requests, 6operations: Add user wmde-fisch to LDAP group wmde - https://phabricator.wikimedia.org/T95546#1203802 (10Dzahn) a:3Dzahn
[18:49:54] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:49:57] <shinken-wm>	 RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:49:57] <shinken-wm>	 RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:49:58] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1015 is OK: OK: Less than 1.00% above the threshold [0.0]  
[18:49:59] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #285: FAILURE in 6 min 11 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/285/
[18:50:00] <wmf-insecte>	 Project beta-scap-eqiad build #48926: FAILURE in 2 min 53 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48926/
[18:50:01] <marxarelli>	 twentyafterfour: sweet. ty
[18:50:05] <greg-g>	 why's beta-scap failing?
[18:50:05] <greg-g>	 oh, nfs still?
[18:50:06] <YuviPanda>	 greg-g: yea
[18:50:07] <YuviPanda>	 greg-g: although, theoretically beta should be independent of NFS now
[18:50:07] <YuviPanda>	 not sure why it’s failing, actually
[18:50:07] <YuviPanda>	 keys are on localdisk as well
[18:50:10] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Add Josa extension to make-wmf-branch/default.conf [tools/release] - 10https://gerrit.wikimedia.org/r/203642 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt)
[18:55:24] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 47865 bytes in 0.610 second response time  
[18:55:30] <YuviPanda>	 Krinkle: I removed all the current integration data into archive.integration
[18:59:23] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #284: FAILURE in 6 min 23 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/284/
[19:04:03] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:04:06] <Krinkle>	 YuviPanda: thx
[19:05:07] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 0.597 second response time  
[19:05:26] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28677 bytes in 0.627 second response time  
[19:06:05] <wmf-insecte>	 Yippee, build fixed!
[19:06:06] <wmf-insecte>	 Project beta-scap-eqiad build #48929: FIXED in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48929/
[19:06:17] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add Josa extension to make-wmf-branch/default.conf [tools/release] - 10https://gerrit.wikimedia.org/r/203642 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt)
[19:06:37] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 5.396 second response time  
[19:06:37] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 1.936 second response time  
[19:08:57] <shinken-wm>	 PROBLEM - Puppet failure on integration-labsvagrant is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[20:15:05] <wmf-insecte>	 Project beta-scap-eqiad build #48930: FAILURE in 2 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48930/
[20:15:05] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:15:05] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:15:05] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:15:05] <shinken-wm>	 PROBLEM - App Server bits response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:15:11] <Krinkle>	 !log Restarting Zuul, Jenkins and aborting all builds. Everything got stuck following NFS outage in lab
[20:15:14] <icinga-wm>	 PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused
[20:15:15] <icinga-wm>	 PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server
[20:15:20] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[20:15:32] <grrrit-wm>	 (03PS1) 10Krinkle: zuul: Don't raise "abort" as error to the user (2) [integration/docroot] - 10https://gerrit.wikimedia.org/r/203890 
[20:15:35] <wikibugs>	 10Continuous-Integration, 6Labs: Continuous integration should not depend on labs NFS - https://phabricator.wikimedia.org/T90610#1204249 (10Krinkle) p:5Low>3High
[20:17:06] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 0.560 second response time  
[20:17:33] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 2.319 second response time  
[20:17:35] <shinken-wm>	 RECOVERY - App Server bits response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 1.053 second response time  
[20:20:24] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 48054 bytes in 0.748 second response time  
[20:21:36] <wmf-insecte>	 Project beta-code-update-eqiad build #51673: FAILURE in 15 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/51673/
[20:21:44] <wmf-insecte>	 Project beta-update-databases-eqiad build #8896: FAILURE in 24 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/8896/
[20:22:00] <Krinkle>	 ^ manual abort
[20:24:46] <icinga-wm>	 RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server
[20:26:03] <icinga-wm>	 RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730
[20:26:20] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] zuul: Don't raise "abort" as error to the user (2) [integration/docroot] - 10https://gerrit.wikimedia.org/r/203890 (owner: 10Krinkle)
[20:26:25] <grrrit-wm>	 (03Merged) 10jenkins-bot: zuul: Don't raise "abort" as error to the user (2) [integration/docroot] - 10https://gerrit.wikimedia.org/r/203890 (owner: 10Krinkle)
[20:29:00] <shinken-wm>	 RECOVERY - Puppet failure on integration-labsvagrant is OK: OK: Less than 1.00% above the threshold [0.0]  
[20:32:46] <wikibugs>	 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204502 (10chasemp) p:5Normal>3Low
[20:33:11] <wikibugs>	 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1186041 (10chasemp) 5Open>3stalled Please don't close this as it is to remind me to ensure this access is revoked as appropriate.
[20:33:13] <wikibugs>	 3Continuous-Integration-Isolation, 6operations: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1204509 (10chasemp)
[20:34:30] <wikibugs>	 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204512 (10hashar) I can confirm the access works just fine. Thanks!
[20:35:32] <wikibugs>	 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204538 (10Legoktm) >>! In T95303#1204130, @RobH wrote: > Ops meeting disucssion resulted in approval, with conditions that Chase is aware of.  As he will hand...
[20:36:51] <wikibugs>	 3Continuous-Integration-Isolation, 6operations: Remove hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1204540 (10chasemp) >>! In T95303#1204538, @Legoktm wrote: >>>! In T95303#1204130, @RobH wrote: >> Ops meeting disucssion resulted in approval, with conditions...
[20:37:41] <hashar>	 legoktm: the idea is to setup labnodepool more or less puppetized
[20:37:41] <hashar>	 legoktm: and eventually trash the box and rebuild it completely 
[20:37:46] <legoktm>	 ah
[20:37:47] <hashar>	 making it production grade
[20:37:48] <legoktm>	 gotcha
[20:37:50] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204548 (10RobH) 'hardware is getting old' is not a valid reasoning.  So this cannot be easily upgraded in place, an...
[20:37:55] <hashar>	 and ideally I would not need root access on it
[20:38:09] <hashar>	 the only reason I get root on gallium  is to be able to debug Jenkins :(
[20:38:16] <hashar>	 or fix basic stuffs
[20:38:43] <hashar>	 legoktm: such root approvals are discussed privately between ops and extra care is taken
[20:38:55] <hashar>	 that is more or less supposed to remain private I guess so each op member can speak openly
[20:39:02] <hashar>	 without the fear of hurting the recipient (me)
[20:40:19] <hashar>	 legoktm: I have 0 chance to review your zuul patch this week. But you can try proposing it to upstream
[20:41:16] <hashar>	 their Gerrit is https://review.openstack.org/  , project openstack-infra/zuul . You need an account on launchpad and might have to sign a cla
[20:41:18] <legoktm>	 hashar: I spent about an hour going through openstack's CLA process on friday but then it didn't like my ssh key nor https password so I gave up for now
[20:41:23] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204574 (10chasemp) >>! In T95760#1204548, @RobH wrote: > 'hardware is getting old' is not a valid reasoning. >  > S...
[20:41:30] <hashar>	 legoktm: oh that is a pity :(((
[20:43:01] <valhallasw`cloud>	 legoktm: note that the https pass is seperate from the normal login password
[20:43:14] <valhallasw`cloud>	 legoktm: it's the one in settings » http pass  or something
[20:44:04] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204592 (10RobH) Gallium is the following:  Single CPU: Intel(R) Xeon(R) CPU           X3450  @ 2.67GHz Dual 500GB S...
[20:49:24] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:53:40] <YuviPanda>	 Krinkle: btw, let me know when I can permenantly get rid of the archive
[20:53:40] <Krinkle>	 YuviPanda: You can do so now.
[20:53:43] <YuviPanda>	 Krinkle: cool.
[20:53:49] <Krinkle>	 :)
[20:54:07] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47862 bytes in 0.543 second response time  
[20:54:56] <YuviPanda>	 Krinkle: all gone now
[20:56:06] <Krinkle>	 legoktm: Yeah, upstream is actively interested in these queue things. They're probably already working on it.
[20:56:06] <Krinkle>	 I think commits to our zuul should be restricted to backporting fixes from upstream.
[20:56:06] <Krinkle>	 it's sufficiently complex that we should not maintain our own patches without upstrem reivew
[20:56:06] <legoktm>	 valhallasw`cloud: yeah, that's the one I used
[20:56:06] <valhallasw`cloud>	 legoktm: hm. odd :(
[20:56:06] <legoktm>	 Krinkle: ok, I'll try again tonight and see if I can get it to like my ssh key
[20:56:13] <wikibugs>	 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1204631 (10yuvipanda) 5Open>3Resolved a:3yuvipanda All done now.   FTR, the way to do this is:  # Move the appropriate metrics (found in `/srv/carbon/w...
[20:58:25] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204644 (10RobH) a:3Cmjohnson I'm thinking about allocating system cobalt for this, but I need to assign this task...
[20:59:00] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204647 (10RobH) p:5High>3Normal @andrew: You set this to high priority, but it seems to be generally not any hi...
[20:59:06] <chasemp>	 hashar: are you close to https://gerrit.wikimedia.org/r/#/c/201728/ not being wip?
[21:00:55] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204652 (10RobH) a:5Cmjohnson>3RobH
[21:01:21] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204654 (10hashar) Gallium has some SSD disk but the process that makes use of it are moving to some other machines....
[21:04:57] <hashar>	 chasemp: need the nodepool debian package
[21:05:29] <chasemp>	 ok I'm going to resign from review there until and can you please radd me?  I'm weird and try ot keep a clean review queue
[21:05:42] <hashar>	 ohh
[21:05:45] <hashar>	 you can -1 it!
[21:06:07] <hashar>	 and use as a Gerrit homepage the magic: https://gerrit.wikimedia.org/r/#/q/is:open+reviewer:self+label:Code-Review%253D0%252Cuser%253Dself,n,z
[21:06:08] <hashar>	 is:open reviewer:self label:Code-Review=0,user=self
[21:06:08] <hashar>	 	 	
[21:06:21] <hashar>	 that only list changes for which you are a reviewer and have a CR vote of 0
[21:06:30] <hashar>	 so if you vote -1 or +1 ... the change disappear from that query
[21:15:01] <chasemp>	 I don't want to do that tho as I need to keep up on my own +1's and -1's
[21:15:03] <chasemp>	 :)
[21:26:34] <hashar>	 chasemp: so the puppet change is pending the nodepool debian package
[21:26:41] <hashar>	 which I have uploaded a sec ago :)
[21:26:50] <chasemp>	 k
[21:27:20] <chasemp>	 I'm not the best to review it let's add godog
[21:27:26] <chasemp>	 I'll ping him about it tomorrow
[21:28:26] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation: Puppetize Nodepool configuration - https://phabricator.wikimedia.org/T89143#1204795 (10hashar)
[21:28:27] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool - https://phabricator.wikimedia.org/T89142#1028174 (10hashar)
[21:28:52] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation: Puppetize Nodepool configuration - https://phabricator.wikimedia.org/T89143#1028190 (10hashar) I have a first Gerrit draft at https://gerrit.wikimedia.org/r/#/c/201728/   It depends on {T89142}.
[21:29:15] <hashar>	 chasemp: not sure how much bandwith he has for a review
[21:29:27] <hashar>	 I am pairing with him on thursday to get Zuul Debian package approved
[21:29:35] <hashar>	 the nodepool one is really a first draft :(
[21:34:30] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10RobH) 3NEW a:3RobH
[21:34:44] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1199645 (10RobH)
[21:34:46] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204877 (10RobH)
[21:35:18] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10RobH)
[21:35:20] <wikibugs>	 10Continuous-Integration: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757#1204884 (10RobH)
[21:35:22] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10hardware-requests, 6operations: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1204879 (10RobH) 5Open>3Resolved Cobalt is allocated for this task.  System setup will proceed on T95959.  Resol...
[21:36:01] <Krinkle>	 hashar: Hm... it seems tests run slower for some reason. -qunit often times out from Apache (30 seconds timeout)
[21:36:01] <Krinkle>	 I don't think it actually is trying for 30 seconds though. That seems like a lot of time for the simple apache on a slave.
[21:36:01] <Krinkle>	 Probably some other issue like network or disk
[21:36:01] <Krinkle>	 ?
[21:36:32] <hashar>	 Krinkle: labs died so it is probably degraded somehow
[21:36:33] <Krinkle>	 Compared to yesterday
[21:36:36] <hashar>	 disk would not surprise me
[21:36:55] <hashar>	 dpkg-source: info: local changes detected, the modified files are:
[21:36:55] <hashar>	 <list of all my files>
[21:37:04] <hashar>	 F*** YOU DEBIAN
[21:37:49] <chasemp>	 hashar: are you using git-buildpackage?
[21:41:15] <hashar>	 chasemp: yes
[21:41:30] <hashar>	 chasemp: turns out I forgot to update the changelog version
[21:41:31] <chasemp>	 I too have suffered that pain
[21:41:34] <chasemp>	 ah
[21:41:38] <hashar>	 so it used some old tarball in my parent directory
[21:41:49] <hashar>	 and complained the local source tree was not matching the parent tarball
[21:41:50] <hashar>	 I hate it
[21:42:12] <hashar>	 I have been working on making it easier to build package via Jenkins
[21:42:17] <hashar>	 I have been working on making it easier to build package via Jenkins
[21:47:35] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204957 (10RobH) @chasemp will be chasing down the network requirements.  Cobalt needs to talk to labs hosts, which means it would curren...
[21:48:47] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204964 (10chasemp) a:5RobH>3chasemp
[21:57:59] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205022 (10RobH) for whoever does the install server update (I didn't do it yet, since we aren't yet certain of the fqdn.)  NIC1 Ethernet...
[21:58:40] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205026 (10RobH)
[22:04:41] <hashar>	 Krinkle: we will migrate zuul scheduler out of gallium
[22:04:50] <hashar>	 with the aim of phasing out gallium :D
[22:04:51] <hashar>	 more tomorrow!
[22:06:10] <wikibugs>	 7Blocked-on-RelEng, 10Browser-Tests, 10Continuous-Integration, 6Collaboration-Team, and 2 others: Pass MEDIAWIKI_CAPTCHA_BYPASS_PASSWORD in on Jenkins so GettingStarted browser tests pass - https://phabricator.wikimedia.org/T91220#1205043 (10Mattflaschen) >>! In T91220#1185271, @hashar wrote: > The DNS fai...
[22:07:09] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205044 (10hashar) Do we have any 30000 feet network diagrams of our vlan / zones / whatever? That would assist in figuring out how machi...
[22:07:11] <wikibugs>	 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1205045 (10Mattflaschen)
[22:07:37] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1205047 (10RobH)
[22:07:37] <hashar>	 nodepool_0.0.1-104-gddd6003_amd64.deb !!
[22:07:41] <wikibugs>	 7Blocked-on-RelEng, 10Browser-Tests, 10Continuous-Integration, 6Collaboration-Team, and 2 others: Pass MEDIAWIKI_CAPTCHA_BYPASS_PASSWORD in on Jenkins so GettingStarted browser tests pass - https://phabricator.wikimedia.org/T91220#1205048 (10Mattflaschen) Forgot to note, I verified it is now green: https:/...
[22:09:58] <Krinkle>	 hashar: Could you maybe revisit https://phabricator.wikimedia.org/T94138 tomorrow?
[22:10:04] <Krinkle>	 What is left to do there?
[22:11:41] <hashar>	 Krinkle: either disable core dumps
[22:11:55] <hashar>	 or figure out why dvips / dvipng segfaults
[22:19:28] <wikibugs>	 10Continuous-Integration, 6Phabricator: Create a yellow project for 'nodepool' - https://phabricator.wikimedia.org/T95965#1205120 (10hashar) 3NEW
[22:19:43] <wikibugs>	 3Continuous-Integration-Isolation, 6Phabricator: Create a yellow project for 'nodepool' - https://phabricator.wikimedia.org/T95965#1205128 (10hashar)
[22:22:41] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool - https://phabricator.wikimedia.org/T89142#1205139 (10hashar) I have created the Gerrit repository [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/deb...
[22:24:57] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1205146 (10hashar)
[22:26:27] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1028174 (10hashar)
[22:26:51] <hashar>	 chasemp: I have more or less packaged nodepool https://phabricator.wikimedia.org/T89142#1205139  :D
[22:45:33] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10Wikimedia-Labs-Infrastructure: Support dedicating a specific virt node to a specific nova project - https://phabricator.wikimedia.org/T84989#1205215 (10hashar) p:5Normal>3Low Lowering priority, at the start I guess we can afford having our sm...
[22:53:35] <wikibugs>	 10Continuous-Integration, 3Continuous-Integration-Isolation, 10Wikimedia-Labs-Infrastructure: Support dedicating a specific virt node to a specific nova project - https://phabricator.wikimedia.org/T84989#1205248 (10Andrew) Just now chase and I have confirmed that the proper mechanism to direct particular VMs...
[23:02:06] <hashar>	 ok enough
[23:02:09] <hashar>	 have a good day!
[23:33:44] <wikibugs>	 10Beta-Cluster, 6Collaboration-Team, 10incident-20150410-flowdataloss, 7database: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1205360 (10greg) p:5Triage>3Normal
[23:35:07] <wikibugs>	 10Beta-Cluster, 10Staging, 6Collaboration-Team, 10incident-20150410-flowdataloss, 7database: Use External Store on Beta Cluster - https://phabricator.wikimedia.org/T95871#1202118 (10greg)
[23:36:57] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Status of Jouncebot and dropping the yamllint Jenkins job - https://phabricator.wikimedia.org/T95894#1205364 (10greg) p:5Triage>3Normal