[00:05:47] <wikibugs>	 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1192656 (10awight) a:5awight>3None Unassigning: I am finished workin...
[00:15:28] <grrrit-wm>	 (03PS3) 10Legoktm: Convert extensions to use generic phpunit job (D-E) [integration/config] - 10https://gerrit.wikimedia.org/r/202279 
[00:17:16] <bd808>	 hey legoktm, do you work for CI now? ;)
[00:17:34] <YuviPanda>	 he works for everyone :D
[00:18:14] <bd808>	 so much energy! I have taken advantage of that before
[00:18:22] <YuviPanda>	 :D
[00:22:25] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Convert extensions to use generic phpunit job (D-E) [integration/config] - 10https://gerrit.wikimedia.org/r/202279 (owner: 10Legoktm)
[00:30:12] <legoktm>	 bd808: it was your idea to deputize me! :P
[00:31:27] <grrrit-wm>	 (03Merged) 10jenkins-bot: Convert extensions to use generic phpunit job (D-E) [integration/config] - 10https://gerrit.wikimedia.org/r/202279 (owner: 10Legoktm)
[00:32:30] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/202279
[00:32:33] <qa-morebots>	 Logged the message, Master
[00:40:09] <YuviPanda>	 legoktm: are you deputy PM to bd808? :)
[00:40:38] <legoktm>	 noooooooooo
[00:44:24] <YuviPanda>	 YESSSS
[00:44:24] <YuviPanda>	 :P
[02:24:35] <shinken-wm>	 PROBLEM - SSH on deployment-bastion is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[04:33:19] <grrrit-wm>	 (03PS2) 10Legoktm: Add phplint job for mediawiki/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/202938 
[04:34:12] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Add phplint job for mediawiki/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/202938 (owner: 10Legoktm)
[04:35:55] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add phplint job for mediawiki/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/202938 (owner: 10Legoktm)
[04:36:37] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/202938
[04:36:42] <qa-morebots>	 Logged the message, Master
[05:04:24] <grrrit-wm>	 (03PS1) 10Legoktm: Convert extensions to use generic phpunit job (F-G) [integration/config] - 10https://gerrit.wikimedia.org/r/202992 
[05:11:09] <legoktm>	 !log deleted core dumps from integration-slave1002, /var had filled up
[05:11:12] <qa-morebots>	 Logged the message, Master
[05:40:18] <YuviPanda>	 legoktm: /var? I thought there is no more separate /var
[05:40:56] <legoktm>	 YuviPanda: these slaves are old
[05:41:19] <legoktm>	 YuviPanda: https://phabricator.wikimedia.org/T94916 halp
[05:58:36] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Convert extensions to use generic phpunit job (F-G) [integration/config] - 10https://gerrit.wikimedia.org/r/202992 (owner: 10Legoktm)
[06:01:56] <grrrit-wm>	 (03Merged) 10jenkins-bot: Convert extensions to use generic phpunit job (F-G) [integration/config] - 10https://gerrit.wikimedia.org/r/202992 (owner: 10Legoktm)
[06:02:16] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/202992
[06:02:19] <qa-morebots>	 Logged the message, Master
[06:03:47] <grrrit-wm>	 (03PS1) 10Legoktm: Add GlobalCssJs to shared extension job [integration/config] - 10https://gerrit.wikimedia.org/r/202998 
[06:12:06] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Add GlobalCssJs to shared extension job [integration/config] - 10https://gerrit.wikimedia.org/r/202998 (owner: 10Legoktm)
[06:15:15] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add GlobalCssJs to shared extension job [integration/config] - 10https://gerrit.wikimedia.org/r/202998 (owner: 10Legoktm)
[06:15:40] <legoktm>	 !log deploying https://gerrit.wikimedia.org/r/202998
[06:15:42] <qa-morebots>	 Logged the message, Master
[07:29:02] <wikibugs>	 6Release-Engineering: Investigate production and/or beta requirements for Sentry - https://phabricator.wikimedia.org/T89732#1193235 (10Tgr) This largely happened in other tasks, I think. See T93138 (initial hardware request), T84956 (packaging and puppetizing), T86677 (initial security review). Do you see the ne...
[07:34:16] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: proxies: allow filtering by datacenter [tools/scap] - 10https://gerrit.wikimedia.org/r/200130 (owner: 10Giuseppe Lavagetto)
[08:38:07] <wikibugs>	 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 7Documentation: Document how to tag extensions in git - https://phabricator.wikimedia.org/T94412#1193399 (10Mglaser) @mmodell, thanks for your support here! Tagging helps a lot when you want to do good extension versioning. I wonder what @demon thinks. Can thi...
[08:49:55] <hashar>	 !log https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/  job stalled for some reason
[08:50:01] <qa-morebots>	 Logged the message, Master
[08:50:15] <hashar>	 !log https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ timed out after 30 minutes while trying to  git pull
[08:50:18] <qa-morebots>	 Logged the message, Master
[08:51:47] <hashar>	 !log deployment-bastion is out of disk space on /var/  :(
[08:51:49] <qa-morebots>	 Logged the message, Master
[08:54:15] <Krinkle>	 hashar: Is there a local PPA I should register to make Package[zuul] work?
[08:54:19] <Krinkle>	 Or does the package not exist in any repo yet?
[08:54:26] <hashar>	 Krinkle: hey!
[08:54:28] <Krinkle>	 Can you document how to install that .deb?
[08:54:30] <hashar>	 sorry about the mess up yesterday
[08:54:33] <Krinkle>	 :-)
[08:54:44] <hashar>	 had to leave early in the middle of the afternoon due to the whole familly being sick :(
[08:54:54] <hashar>	 the .deb is only in  /home/hashar/ for now
[08:55:15] <hashar>	 I will get it added to apt.wikimedia.org for both Trusty and Precise whenever I am happy with the package
[08:55:20] <hashar>	 hopefully today :)
[08:55:31] <Krinkle>	 I don't know how to install that. There's commands for it, but there is different arguments and variations.
[08:55:49] <hashar>	 in theory we could set up a local repo under /data/project/  and inject some custom config in apt.conf
[08:55:57] <Krinkle>	 What commands should I exec exactly?
[08:55:58] <hashar>	 labs might well have support for that already, I havent looked though
[08:56:13] <Krinkle>	 I'm trying to make our patches just a simple bash script
[08:56:17] <Krinkle>	 I did it with the old patches already: https://phabricator.wikimedia.org/P466
[08:56:18] <hashar>	 dpkg -i /home/hashar/zuul_XXXXXXX.deb
[08:56:23] <hashar>	 apt-get install -f
[08:56:35] <wmf-insecte>	 Project beta-code-update-eqiad build #51048: FAILURE in 3 min 34 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/51048/
[08:56:36] <hashar>	 where XXXX vary between 'precise' and 'trusty'
[08:56:49] <hashar>	 and apt-get install -f  is to install missing dependencies
[08:57:13] <Krinkle>	 hashar: does the apt-get take an argument related to zuul or the deb file or it fetches all known missing dependencies?
[08:57:56] <hashar>	 when you do dpkg -i
[08:58:01] <hashar>	 that tries to install the .deb file passed in parameter
[08:58:12] <hashar>	 that .deb file has a bunch of dependencies themselves which are added to list of packages to be installed
[08:58:19] <hashar>	 but dpkg is not smart enough to install them for you
[08:58:24] <hashar>	 so it just register the dependencies
[08:58:27] <Krinkle>	  Hm.. interesting, the home mount is gone on slave-trusty-1010
[08:58:31] <hashar>	 and bails out because they are not available on the system
[08:58:43] <hashar>	 apt-get is able to fetch the missing packages from some repo
[08:58:56] <hashar>	 so apt-get install will tell you that there are some broken/missing packages
[08:58:57] <Krinkle>	 hashar: Right, so apt-get knows about the state that dpkg-i left behind.
[08:59:01] <hashar>	 and -f  make it install them
[08:59:04] <hashar>	 sorry definitely a big mess :
[08:59:05] <hashar>	 (
[08:59:26] <shinken-wm>	 RECOVERY - SSH on deployment-bastion is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[08:59:42] <Krinkle>	 hashar: Does installing manually resolve the puppet resource for Zuul package?
[08:59:51] <hashar>	 !log rebooted deployment-bastion and cleared some files under /var/
[08:59:51] <Krinkle>	 e.g. how does it interact with the rest of the manifest
[08:59:54] <qa-morebots>	 Logged the message, Master
[09:00:05] <hashar>	 Krinkle: yeah because puppet uses  apt-get install 'whatever package'
[09:00:20] <hashar>	 but since the package is missing from apt.wikimedia.org, it can't find it and bails out
[09:00:34] <hashar>	 that is the error messages you have seen yesterday with Package['zuul']  blatantly failling
[09:00:36] <Krinkle>	 hashar: Does apt-get install learn about 'zuul' via dpkg -i?
[09:00:42] <hashar>	 yup
[09:00:48] <hashar>	 though I don't know all the details
[09:00:56] <Krinkle>	 So puppet will continue after installing manually?
[09:01:08] <hashar>	 seems dpkg -i register in some state file that the 'zuul' package is provided by a file /home/hashar/zuul_XXX.deb
[09:01:19] <hashar>	 yeah puppet will be happy
[09:01:33] <hashar>	 because once installed manually the state file is updated to state that  'zuul' is installed
[09:01:47] <hashar>	 so when puppet verify whether the package is there (running:  apt-cache policy zuul)
[09:01:54] <hashar>	 it will get a positive
[09:02:07] <Krinkle>	 Hm.. any idea why the mount is gone?
[09:02:09] <hashar>	 iirc you can see what puppet is using as underlying command by running with debug
[09:02:15] <hashar>	 puppet agent -tv --debug
[09:02:21] <hashar>	 should dump all the shell commands bein gused
[09:02:43] <hashar>	 Krinkle: which mount? :)
[09:02:48] <Krinkle>	  home/
[09:03:19] <hashar>	 it is supposed to be a NFS mount yeah
[09:03:32] <hashar>	 which instance has the issue?  You might have to remount /home
[09:03:39] <hashar>	 or just reboot :D
[09:04:04] <Krinkle>	 I already reboooted twice
[09:04:09] <Krinkle>	 integration-slave-trusty-1010
[09:04:23] <Krinkle>	 The one I've been working on for a week. It happened again, it takes a week to re-create our instances :-(
[09:05:16] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #564: FAILURE in 55 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/564/
[09:05:36] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[09:05:48] <Krinkle>	 puppet isn't doing it because it fails on zuul
[09:06:00] <hashar>	 ahhh
[09:06:00] <Krinkle>	 did you test the zuul package on a new instance?
[09:06:02] <hashar>	 good puppet
[09:06:28] <hashar>	 yeah I have installed the deb package on all instances we have
[09:06:46] <hashar>	 wanna try the manual install ?
[09:06:46] <Krinkle>	 new instances *after* the puppet patch that broke it
[09:06:51] <Krinkle>	 I can't without a mount
[09:07:06] <hashar>	 well on integration-slave-trusty-1010  I get the /home/ mounted properly
[09:07:46] <Krinkle>	 Could not chdir to home directory /home/krinkle: Permission denied
[09:07:46] <Krinkle>	 -bash: /home/krinkle/.bash_profile: Permission denied
[09:08:11] <hashar>	 ls -ld krinkle/ hashar/
[09:08:11] <hashar>	 drwx------ 21 hashar  svn     4096 Apr  8 13:24 hashar//
[09:08:11] <hashar>	 drwxr--r-- 20 krinkle wikidev 4096 Apr  8 19:33 krinkle//
[09:08:20] <hashar>	 bah
[09:08:27] <Krinkle>	 fixed
[09:08:35] <Krinkle>	 weird
[09:08:42] <hashar>	 have you done something ?
[09:08:56] <Krinkle>	 nope
[09:08:57] <hashar>	 I just queried our groups using :   id krinkle
[09:09:25] <hashar>	 maybe some cache entry were stalled and the commands ended up querying LDAP  for fresh informations
[09:09:46] <hashar>	 your UID/ GID / Groups etc are in LDAP and iirc there is a local cache on the instance
[09:09:51] <hashar>	 might have been corrupted somehow
[09:10:09] <Krinkle>	 hashar: So this technique, is this good enough to do for our new pool of slaves?
[09:10:13] <Krinkle>	 Assuming this is the last error we see
[09:10:30] <Krinkle>	 I can go ahead and do this for all new instances and move on. We'll see about doing it via apt.wikimedia next month
[09:10:45] <Krinkle>	 I mean, you can get it in there now, but it won't apply until next month.
[09:10:48] <hashar>	 yup
[09:10:57] <hashar>	 gotta document it on the manual setup page
[09:11:07] <hashar>	 but I think I remembered we can have a local apt repo for labs project
[09:11:33] <Krinkle>	 Hm.. dpkg -i gives exit code 1
[09:11:35] <Krinkle>	 I assume that's normal?
[09:11:46] <hashar>	 yeah it fails to install the package because of some missing dep probably
[09:11:59] <Krinkle>	 Yeah
[09:11:59] <hashar>	 wanna hang out to share the screen?
[09:12:16] <Krinkle>	 can't do at themoent. multi tasking
[09:12:23] <hashar>	 :)
[09:13:46] <hashar>	 hmm
[09:13:57] <hashar>	 toollabs has local apt repos
[09:14:32] <hashar>	 made possible in puppet using 'labsdebrepo'
[09:14:39] <hashar>	 will look at setting up
[09:14:54] <hashar>	 that might prove to be useful eventually
[09:18:49] <Krinkle>	 hashar: https://phabricator.wikimedia.org/P466 looks good?
[09:18:51] <Krinkle>	 I added the zuul bt
[09:19:55] <hashar>	 Krinkle: should do
[09:19:55] <Krinkle>	 I might add it to integration-jenkins:/bin depending on how long we need to use it
[09:20:04] <Krinkle>	 e.g. bin/patch-slave-trusty.sh
[09:20:06] <hashar>	 hopefully no more than a week
[09:20:15] <Krinkle>	 I mean the entire script
[09:20:21] <hashar>	 we might well have that shell script added to puppet
[09:20:25] <hashar>	 :D
[09:20:31] <Krinkle>	 We'll need it for nodepool
[09:20:37] <Krinkle>	 and probably more hacks
[09:20:44] <hashar>	 nodepool will land in apt.wikimedia.org
[09:20:48] <hashar>	 zuul as well
[09:20:59] <Krinkle>	 I mean, if nodepool will create new slaves, it will need this
[09:21:02] <hashar>	 but I am not 100% happy with the package I came up with
[09:21:06] <hashar>	 ah yeah
[09:21:16] <Krinkle>	 hashar: I see the jessie instance is pooled in Jenkins
[09:21:19] <hashar>	 nodepool executes two scripts, one when creating the instance
[09:21:32] <hashar>	 and another one when booting it up in the pool and before adding the instance to the pool of slaves
[09:21:36] <Krinkle>	 how is it doing?
[09:21:47] <hashar>	 jessie that is to migrate the debian-glue jobs to it
[09:22:10] <hashar>	 alexandros as crafted some very nice build env to let us build deb package against all the distro we have and having  apt.wikimedia.org has a source
[09:22:12] <hashar>	 but
[09:22:18] <shinken-wm>	 RECOVERY - Puppet staleness on deployment-bastion is OK: OK: Less than 1.00% above the threshold [3600.0]  
[09:22:23] <hashar>	 I pooled it with the generic puppet class which installs all the mediawiki packages
[09:22:30] <hashar>	 and a lot of them are not available in jessie or have been renamed
[09:22:40] <hashar>	 I have filled a task about it, faidon already looked at it and commented
[09:22:53] <hashar>	 we need to adjust the mediawiki::  puppet definitions to vary some package names
[09:23:02] <hashar>	 and also figure out whether some packages are actually still needed
[09:23:14] <hashar>	 an example is libmemcached10
[09:23:26] <hashar>	 which we have on ubuntu but is no more on debian cause it provides a later version
[09:23:30] <hashar>	 maybe libmemcached42
[09:23:40] <hashar>	 so have to figure out whether the cluster can run with that newever version
[09:23:59] <Krinkle>	 hashar: See https://tools.wmflabs.org/nagf/?project=integration#h_integration-slave-trusty-1010_cpu
[09:24:02] <Krinkle>	 The memory graph
[09:24:14] <Krinkle>	 For some weird reason, the initial boot has either broken or very high memory usage
[09:24:18] <Krinkle>	 and then after reboot it's normal
[09:24:34] <Krinkle>	 The first 2 hours  were normal
[09:24:38] <hashar>	 the dark green 'cached' memory is linux cache
[09:24:46] <hashar>	 whenever you read files on the system, that ends up in that cache
[09:24:57] <Krinkle>	 I think it's just broken because it's a flat line
[09:25:01] <hashar>	 and when the file is written / deleted, the kernel updates discard the cache entry for you automatically
[09:25:21] <Krinkle>	 First boot is fine, then second reboot it's broken, and then third reboot (after applying slave and no more errors) it is fine again
[09:25:22] <hashar>	 so cached is not necessarly a big issue, specially when an instance is being provisionned since there are loooot of writes / reads being done
[09:25:26] <Krinkle>	 happens every single instance
[09:25:30] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[09:25:31] <Krinkle>	 been that way for over a yeah
[09:25:34] <Krinkle>	 year*
[09:25:36] <hashar>	 the inactive I guess some process went wild
[09:25:48] <hashar>	 we have a daemon running 'atop'
[09:25:58] <hashar>	 which takes sample of cpu / mem  / io usage every 10 minutes or so
[09:25:59] <Krinkle>	 The actual usage is not that high
[09:26:02] <Krinkle>	 it's wrongly reported
[09:26:13] <hashar>	 that let you browse the history of what is running on the machine
[09:26:17] <Krinkle>	 No way an instance has continuous 12 hours exactly that amount of mem usage
[09:26:19] <hashar>	 weird doc at https://wikitech.wikimedia.org/wiki/Atop
[09:26:21] <Krinkle>	 it's flat
[09:26:35] <hashar>	 ohh
[09:26:54] <hashar>	 maybe the daemon sending metrics to graphite was broken/Stalled ?
[09:27:07] <hashar>	 and the flat graph would be caused by lack of new metrics points
[09:27:16] <Krinkle>	 Yeah, the first boot it work fine, second boot it goes high and flat, third boot it's fine again
[09:27:48] <Krinkle>	 https://phabricator.wikimedia.org/T91351
[09:29:21] <wikibugs>	 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193521 (10Krinkle) A better example from the new integration-slave-trusty-1010:  {F110397}  The first boot is fine. Then aft...
[09:29:53] <wikibugs>	 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193524 (10Krinkle)
[09:29:55] <wikibugs>	 10Continuous-Integration: Re-create ci slaves (April 2015) - https://phabricator.wikimedia.org/T94916#1193523 (10Krinkle)
[09:30:11] <Krinkle>	 hashar: btw, blockers for re-create tasks I use as a way to track issues we discovered or are bothered by. Not real blockers per se.
[09:31:14] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1010 is OK: OK: Less than 1.00% above the threshold [0.0]  
[09:31:26] <hashar>	 I am looking at the metric on graphite.wmflabs.org
[09:31:55] <shinken-wm>	 PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0]  
[09:35:08] <Krinkle>	 !log Pooled integration-slave-trusty-1010
[09:35:10] <qa-morebots>	 Logged the message, Master
[09:38:51] <wikibugs>	 10Deployment-Systems, 6Release-Engineering, 6Services, 6operations: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1193558 (10mobrovac)
[09:41:40] <wikibugs>	 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193566 (10hashar) Looking at [[ https://wikitech.wikimedia.org/wiki/Atop | atop ]] history, there is nothing suspicious.  I...
[09:43:13] <Krinkle>	 hashar: btw, there is another fatal issue started 10 days ago that comes up very often.
[09:43:16] <Krinkle>	 https://wikitech.wikimedia.org/wiki/Release_Engineering/Argh
[09:43:22] <Krinkle>	 "Jenkins unable to reach Gearman"
[09:43:26] <Krinkle>	 I've aggregated it from SAL
[09:43:55] <Krinkle>	 Jenkins goes into a state where we can't relaunch gearman, it gives 503 error from /ci/configure
[09:44:58] <wikibugs>	 10Continuous-Integration: Setup a local apt repository for 'integration' labs project - https://phabricator.wikimedia.org/T95534#1193567 (10hashar) 3NEW
[09:45:22] <Krinkle>	 Curious if you're able to find out more about it. I gave it my best, but came up empty. Might be more your area :)
[09:45:29] <hashar>	 oh man
[09:45:36] <hashar>	 and I thought it was becoming more stable
[09:45:42] <Krinkle>	 Yeah :(
[09:46:00] <hashar>	 the Zuul deadlock is usually caused by a patch being force merged
[09:46:17] <Krinkle>	 the last 2 days with SWAT have been very frustrating
[09:46:24] <hashar>	 that one https://phabricator.wikimedia.org/T93812
[09:46:28] <Krinkle>	 took 2 hours longer Tuesdau and Wednesday
[09:46:32] <hashar>	 doh
[09:46:33] <Krinkle>	 because of our sucky queue
[09:46:41] <Krinkle>	 We really need to do something about it
[09:46:49] <hashar>	 split the queue again ? :D
[09:46:57] <Krinkle>	 Because I don't know Zuul very well, the only thing I know as a solution is to disable dependent pipeline for the time being
[09:47:26] <Krinkle>	 This can't keep going on like this
[09:47:48] <hashar>	 I really wish I have noticed the work on consolidating all the jobs :/
[09:48:12] <hashar>	 I would most probably have thought about the issue of having all repos sharing the same queue in gate-and-submit
[09:48:19] <Krinkle>	 is there a config flag to disable the queue or to make the queue manually (e.g. declare "mwext" -> mwcore, without the automaatic thing based on job overlap)
[09:48:34] <hashar>	 nop
[09:48:39] <hashar>	 or at least known I know of
[09:49:00] <hashar>	 the queues are generated when Zuul loads its configuration
[09:49:16] <hashar>	 the zuul diff job probably had huges console log when the changes been made
[09:49:32] <Krinkle>	 well, it already had 1300 extensions in the same queeu
[09:49:48] <Krinkle>	 the diff is likek 2 mega bytes whenever we change mwext, so it didn't seem important
[09:49:49] <hashar>	 yup
[09:50:01] <hashar>	 all extensions rely on mw/core
[09:50:27] <Krinkle>	 yeah, that's fine, but the problem is that unrelated projects also get caught. And we have master <> wmf/* also depending 
[09:50:36] <hashar>	 though two extensions changes should probably be ind ifferent queues if there is no mw/core change ahead
[09:51:01] <hashar>	 yeah there is no knowledge about branches :(
[09:51:12] <hashar>	 what upstream assume, is that you have no idea what branches a job is going to use
[09:51:24] <Krinkle>	 I feel like the dependant pipeline is nice in theory, but not ready yet. A beta feature we should not run in prod.
[09:51:27] <hashar>	 you could well have a job ending up always using master
[09:51:38] <hashar>	 well it is fine
[09:51:50] <Krinkle>	 upstream says they're removing it in zuul v3 in favour of explicit queue.
[09:51:52] <hashar>	 until you mess up the convention of having each repos having jobs named differently
[09:52:02] <hashar>	 then that trick zuul in thinking all those repos are tightly coupled together
[09:52:13] <hashar>	 but yeah explicit queue would be better
[09:52:26] <Krinkle>	 yeah, but we had to consolidate because of disk space and workspace scaling
[09:52:45] <Krinkle>	 even now that problem is not solved.
[09:52:52] <Krinkle>	 our labs slaves are much smaller than the prod slaves
[09:53:21] <hashar>	 I posted on some task a way to skip the whole clone entirely
[09:53:28] <hashar>	 using git clone --shared
[09:53:34] <hashar>	 but we talked about it early this week
[09:53:38] <Krinkle>	 but I mean, it being oblivious to branches seems like an obvious issue. I would never implement dependent pipeline without branches. Is it worth the trouble right now?
[09:53:45] <hashar>	 not sure why I thought about --shared instead of hardlinks though
[09:55:41] <hashar>	 !log restarted Zuul to clear out some stalled jobs
[09:55:44] <qa-morebots>	 Logged the message, Master
[09:55:57] <wmf-insecte>	 Yippee, build fixed!
[09:55:57] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #565: FIXED in 45 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/565/
[09:55:59] <Krinkle>	 yeah, and the force merge is also a regression
[09:56:08] <Krinkle>	 how come that wasn't a problem before.
[09:56:36] <Krinkle>	 would it be worth it for us to spend time patching that instead of waiting for upstream?
[09:56:43] <hashar>	 that always has been afaik
[09:56:45] <Krinkle>	 Might be more important than other projects we're doing at the moment.
[09:57:07] <hashar>	 at least the code causing the deadlock has been in zuul for aqges
[09:57:20] <hashar>	 so yeah 
[09:57:26] <hashar>	 +2 on patching ourselve
[09:57:31] <Krinkle>	 yeah, but it almost never caused a problem. Now it's causing problems everyday requiring manual intervention to fix.
[09:57:33] <hashar>	 and we have a bunch of patches pending for zuul-cloner
[09:57:37] <hashar>	 such as clean / submodule update
[09:58:01] <Krinkle>	 The main thing that bothers me is manual intervention. We can't operate CI in a scenario where it is normal to require manual intervention just to keep it running. 
[09:58:31] <Krinkle>	 unproductive
[09:58:36] <hashar>	 so the deadlock above need to be fixed
[09:58:44] <Krinkle>	 and also not scaling, because we are not online 24/7
[09:58:47] <Krinkle>	 Yeah
[09:59:09] <hashar>	 brb
[10:00:24] <Krinkle>	 me too ,moving desks
[10:01:41] <hashar>	 818     mediawiki-extensions-hhvm@4
[10:01:41] <hashar>	 887     mediawiki-extensions-hhvm@3
[10:01:41] <hashar>	 946     mediawiki-extensions-hhvm@2
[10:01:41] <hashar>	 955     mediawiki-extensions-hhvm
[10:01:41] <hashar>	 992     mediawiki-core-doxygen-publish
[10:01:41] <hashar>	 1467    mediawiki-core-npm@2
[10:01:43] <hashar>	 1550    mediawiki-core-npm
[10:01:45] <hashar>	 1807    browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox
[10:01:50] <hashar>	 in MB
[10:01:54] <hashar>	 that is a lot :D
[10:17:20] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-2] "I want to keep the DependentPipeline for a wild range of reasons I mentioned on T94322." [integration/config] - 10https://gerrit.wikimedia.org/r/202958 (https://phabricator.wikimedia.org/T94322) (owner: 10Legoktm)
[10:26:25] <grrrit-wm>	 (03Abandoned) 10Hashar: (WIP) Experiment zuul-cloner with extensions [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/141846 (owner: 10Hashar)
[10:31:54] <hashar>	 !log deployment-bastion has a lock file remaining /mnt/srv/mediawiki-staging/php-master/extensions/.git/refs/remotes/origin/master.lock
[10:31:57] <qa-morebots>	 Logged the message, Master
[10:42:06] <zeljkof>	 hashar: Christoph can not create Jenkins jobs
[10:42:32] <zeljkof>	 Access Denied WMDE-Fisch is missing the Job/Create permission
[10:42:35] <zeljkof>	 https://integration.wikimedia.org/ci/newJob says
[10:43:56] <zeljkof>	 hashar: found him at https://integration.wikimedia.org/ci/user/wmde-fisch/
[10:43:57] <wmf-insecte>	 Yippee, build fixed!
[10:43:57] <wmf-insecte>	 Project beta-code-update-eqiad build #51059: FIXED in 56 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/51059/
[10:44:32] <hashar>	 zeljkof: please have him fill a task / bug
[10:44:41] <zeljkof>	 hashar: sure, sending him mail right now
[10:44:53] <hashar>	 and post me the task #id will reply on it :
[10:44:54] <hashar>	 )
[10:46:09] <hashar>	 !log repacked extensions in deployment-bastion staging area: <tt>find /mnt/srv/mediawiki-staging/php-master/extensions -maxdepth 2  -type f -name .git  -exec bash  -c 'cd `dirname {}` && pwd && git repack -Ad && git gc' \;</tt>
[10:46:11] <qa-morebots>	 Logged the message, Master
[10:48:34] <wmf-insecte>	 Project beta-scap-eqiad build #48303: FAILURE in 4 min 36 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48303/
[10:57:54] <wikibugs>	 10Continuous-Integration: Create CI slaves using Debian Jessie (tracking) - https://phabricator.wikimedia.org/T94836#1193694 (10hashar)
[10:57:55] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Jessie has no install candidate for openjdk-6-jdk - https://phabricator.wikimedia.org/T94999#1193692 (10hashar) 5Open>3Resolved The contint puppet manifest no more attempts to install openjdk-6 on Jessie hosts.  Version 7 works just fine.
[11:07:21] <shinken-wm>	 PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused  
[11:17:19] <shinken-wm>	 RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.018 second response time  
[11:32:52] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit in the short term - https://phabricator.wikimedia.org/T94322#1193732 (10Krinkle) > Commit A that removes the deprecated function wfExample() is now breaking all extensions that still rely on it....
[11:36:17] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1188546 (10zeljkofilipin) @hashar might know.
[11:44:13] <wikibugs>	 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193738 (10KartikMistry) 3NEW
[11:53:20] <shinken-wm>	 PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused  
[11:54:02] <kart_>	 hashar: T95539 please.
[11:55:28] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1193763 (10hashar) `wmf-insecte` is the Jenkins IRC client provided by [[ https://wiki.jenkins-ci.org/display/JENKINS/Instant+Messaging+Plugin | Instant Messaging Plugin ]].  There...
[11:58:20] <shinken-wm>	 RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.023 second response time  
[12:29:08] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1193806 (10zeljkofilipin) As far as I know, @manybubbles speaks Java. :)
[12:30:59] <hashar>	 !log beta: reset hard of operations/puppet repo on the puppetmaster since it has been stalled for 9+days https://phabricator.wikimedia.org/T95539
[12:31:04] <qa-morebots>	 Logged the message, Master
[12:32:34] <wikibugs>	 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193816 (10hashar) Beta puppetmaster is deployment-salt.eqiad.wmflabs the git repo under /var/lib/git/operations/puppet is magically auto rebased via a cronjob.  The working copy is detached and has t...
[12:32:55] <wikibugs>	 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193818 (10hashar) 5Open>3Resolved p:5Triage>3Normal a:3hashar
[12:33:10] <hashar>	 kart_: solved :D
[12:33:18] <hashar>	 kart_: the local repo had some patch cherry picked on it
[12:33:30] <hashar>	 kart_: and the magic script did not magic to auto update the repo
[12:39:43] <hashar>	 !log https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ is still broken :-(
[12:39:47] <qa-morebots>	 Logged the message, Master
[12:40:01] <hashar>	 !log spurts out <tt>Permission denied (publickey).</tt>
[12:40:03] <qa-morebots>	 Logged the message, Master
[12:40:46] <kart_>	 hashar: thanks
[12:49:59] <wikibugs>	 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1193865 (10hashar) 3NEW
[12:50:02] <wikibugs>	 10Browser-Tests: Transfer the main Sauce Labs account to a generic WMF account - https://phabricator.wikimedia.org/T94191#1193872 (10zeljkofilipin) > zfilipin  > Wikimedia >  > Hi Renata, >  > I am waiting for our IT to create a new e-mail address. I will let you know as soon as I hear back from them. >  > Željk...
[12:50:45] <wikibugs>	 10Browser-Tests: Transfer the main Sauce Labs account to a generic WMF account - https://phabricator.wikimedia.org/T94191#1193874 (10zeljkofilipin) > Renata Santillan  > Sauce Labs >  > Hi Zeljko, >  > No problem! We're ready to help when you have more information. >  > Best, >  > Renata >  > April 8, 2015, 10:1...
[12:50:57] <werdna>	 Hey, how can I correctly run PHPUnit on vagrant? It’s a bit complicated because the extension is only being used on one of the wikis on my vagrant instance
[12:51:17] <wikibugs>	 10Browser-Tests: Transfer the main Sauce Labs account to a generic WMF account - https://phabricator.wikimedia.org/T94191#1193878 (10zeljkofilipin) > zfilipin  > Wikimedia >  > Hi Renata, >  > I have created a new account with username wikimedia. >  > Željko >  > April 9, 2015, 2:49 PM
[12:51:23] <wikibugs>	 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193879 (10KartikMistry) Thanks @hashar
[12:51:25] <werdna>	 vagrant@mediawiki-vagrant:/vagrant/mediawiki$ php tests/phpunit/phpunit.php --wiki=livingstyleguidewiki /vagrant/mediawiki/extensions/OOUIPlayground/tests/phpunit/
[12:51:25] <werdna>	 Fatal error: Class undefined: OOUIPlayground\WidgetRepository in /vagrant/mediawiki/extensions/OOUIPlayground/tests/phpunit/CodeRendererTest.php on line 21
[12:51:35] <werdna>	 (because it’s not using the correct wiki)
[12:52:46] <manybubbles>	 getting a funky error: https://integration.wikimedia.org/ci/job/wikidata-query-rdf/100/console
[12:58:54] <shinken-wm>	 PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[12:59:42] <Krinkle>	 !log Creating integration-slave-trusty-1011 - integration-slave-trusty-1016
[12:59:44] <qa-morebots>	 Logged the message, Master
[13:00:11] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:01:27] <hashar>	 !log integration-zuul-packaged  applied zuul::merger and zuul::server
[13:01:31] <qa-morebots>	 Logged the message, Master
[13:04:45] <shinken-wm>	 PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]  
[13:14:30] <hashar>	 !log integration-zuul-packaged applied role::labs::lvm::srv
[13:14:32] <qa-morebots>	 Logged the message, Master
[13:16:43] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 031] Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 (owner: 10Legoktm)
[13:23:23] <wikibugs>	 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1193965 (10zeljkofilipin) Looks like there are 3 IE jobs without explicit browser version:  # https://integration.wikimedia.org/ci/view/BrowserTests/job/browsertests-Flow-en.wi...
[13:24:51] <wikibugs>	 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1193966 (10zeljkofilipin) The same problem in all 3 jobs:   ``` 00:00:15.265 (...) bundle exec cucumber (...) --tags @internet_explorer_ (...) 00:00:18.455 0 scenarios 00:00:18...
[13:27:08] <shinken-wm>	 PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0]  
[13:29:43] <shinken-wm>	 RECOVERY - Puppet failure on integration-zuul-packaged is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:34:05] <shinken-wm>	 RECOVERY - Long lived cherry-picks on puppetmaster on deployment-salt is OK: OK: Less than 100.00% above the threshold [0.0]  
[13:37:17] <wikibugs>	 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1193984 (10zeljkofilipin) Given a simple Selenium script:   ``` lang=ruby require "selenium-webdriver"  saucelabs_username = "username" saucelabs_key = "key"  name = "internet_...
[13:40:22] <grrrit-wm>	 (03CR) 10Hashar: "Create a mediawiki/tools/phpmd repo ? :)" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/201956 (owner: 10MarkAHershberger)
[13:41:50] <wikibugs>	 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1194007 (10zeljkofilipin) Since there is no easy way to determine IE version if it is not set, I think the best way would be to insist that the version is always set explicitly...
[13:46:00] <wikibugs>	 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1194033 (10hashar) If you come to require IE to have version explicitly set, you probably want to update  jjb/macro-browsertests.yaml and have it exit early whenever the versio...
[13:48:23] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0]  
[13:48:35] <grrrit-wm>	 (03PS1) 10Zfilipin: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) 
[13:49:42] <wikibugs>	 10Continuous-Integration: Create CI slaves using Debian Jessie (tracking) - https://phabricator.wikimedia.org/T94836#1194042 (10hashar)
[13:49:43] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Update puppet for packages having different names in Jessie - https://phabricator.wikimedia.org/T95000#1194039 (10hashar) 5Open>3Resolved a:3hashar Solved! That also made firefox to be magically upgraded just like chromium.  Labs instance integration-slave-jes...
[13:53:29] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:55:51] <grrrit-wm>	 (03PS2) 10Zfilipin: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) 
[13:56:22] <grrrit-wm>	 (03CR) 10Zfilipin: "Patch set 2 adds created and deleted jobs to the commit message." [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[14:02:05] <wikibugs>	 10Browser-Tests, 5Patch-For-Review: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1194108 (10zeljkofilipin) The new jobs are running fine:  - https://integration.wikimedia.org/ci/view/BrowserTests/view/Echo+Flow/job/browsertests-Flow-en....
[14:05:09] <grrrit-wm>	 (03PS3) 10Zfilipin: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) 
[14:05:48] <grrrit-wm>	 (03CR) 10Zfilipin: "Patch set 3 implements the suggestion from https://phabricator.wikimedia.org/T95398#1194033" [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin)
[14:09:17] <wmf-insecte>	 Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #583: FAILURE in 38 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/583/
[14:20:36] <wikibugs>	 10Continuous-Integration: Migrate all debian-glue jobs to Jessie slaves - https://phabricator.wikimedia.org/T95545#1194160 (10hashar) 3NEW
[14:21:10] <wikibugs>	 10Continuous-Integration, 6operations: Build Debian package jenkins-debian-glue for Jessie - https://phabricator.wikimedia.org/T95006#1194170 (10hashar)
[14:21:34] <wikibugs>	 10Continuous-Integration: Migrate all debian-glue jobs to Jessie slaves - https://phabricator.wikimedia.org/T95545#1194160 (10hashar)
[14:21:35] <wikibugs>	 10Continuous-Integration: Create CI slaves using Debian Jessie (tracking) - https://phabricator.wikimedia.org/T94836#1194172 (10hashar)
[14:24:43] <hashar>	 !log deleting integration-slave-jessie-1001 extended disk is too smal
[14:24:46] <hashar>	 !log deleting integration-slave-jessie-1001 extended disk is too small
[14:24:46] <qa-morebots>	 Logged the message, Master
[14:24:48] <qa-morebots>	 Logged the message, Master
[14:25:12] <wmf-insecte>	 Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #1: FAILURE in 25 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/1/
[14:27:10] <shinken-wm>	 PROBLEM - Host integration-slave-jessie-1001 is DOWN: CRITICAL - Host Unreachable (10.68.16.72)  
[14:31:55] <wikibugs>	 10Continuous-Integration: Replace project-specific "{name}-thing" jobs with generic "thing" ones - https://phabricator.wikimedia.org/T91997#1194202 (10Krinkle)
[14:33:28] <wikibugs>	 10Continuous-Integration: Replace project-specific "{name}-thing" jobs with generic "thing" ones - https://phabricator.wikimedia.org/T91997#1101137 (10Krinkle)
[14:34:13] <wmf-insecte>	 Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #1: SUCCESS in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/1/
[14:35:11] <wmf-insecte>	 Yippee, build fixed!
[14:35:11] <wmf-insecte>	 Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #467: FIXED in 9 min 3 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/467/
[14:37:45] <grrrit-wm>	 (03CR) 10BryanDavis: "Seems to be done now via some other patch: <https://doc.wikimedia.org/cdb/master/>" [integration/config] - 10https://gerrit.wikimedia.org/r/174417 (https://bugzilla.wikimedia.org/73530) (owner: 10Hashar)
[14:37:56] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0]  
[14:47:18] <shinken-wm>	 RECOVERY - Host integration-slave-jessie-1001 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms  
[14:49:02] <wikibugs>	 10Continuous-Integration: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1194315 (10hashar)
[14:49:13] <wikibugs>	 10Continuous-Integration: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1194317 (10hashar) p:5Triage>3Normal
[14:51:46] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[14:52:58] <wmf-insecte>	 Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #1: FAILURE in 51 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/1/
[14:55:22] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [0.0]  
[15:03:13] <shinken-wm>	 PROBLEM - Host integration-slave-jessie-1001 is DOWN: CRITICAL - Host Unreachable (10.68.16.72)  
[15:05:33] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1194390 (10greg) p:5Low>3Lowest
[15:08:39] <shinken-wm>	 PROBLEM - Host integration-slave-trusty-1010 is DOWN: CRITICAL - Host Unreachable (10.68.17.210)  
[15:12:16] <wikibugs>	 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1194453 (10greg) >>! In T95539#1193816, @hashar wrote: > Beta puppetmaster is deployment-salt.eqiad.wmflabs the git repo under /var/lib/git/operations/puppet is magically auto rebased via a cronjob. >...
[15:13:21] <greg-g>	 werdna: file a bug, you asked at a mostly non-active time for the team :)
[15:13:38] <greg-g>	 manybubbles: the endpoint error?
[15:14:01] <manybubbles>	 greg-g: yeah - it doesn't seem to be causing trouble but its an error anyway
[15:14:26] <greg-g>	 step 1) file a bug :)
[15:14:32] <greg-g>	 I can :P
[15:15:06] <greg-g>	 step 0) search for string in phab and notice it's already reported: https://phabricator.wikimedia.org/T93321
[15:15:09] <greg-g>	 manybubbles: ^
[15:15:41] <manybubbles>	 greg-g: sorry, I was being lazy.
[15:15:45] <manybubbles>	 thanks for the search
[15:15:49] <greg-g>	 :)
[15:20:06] <hashar>	 greg-g: good morning :)
[15:20:41] <hashar>	 the beta cluster job that runs scap is borked with ssh auth failure and I cant figure it out :(
[15:22:15] <hashar>	 !sal
[15:22:16] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:23:30] <thcipriani>	 hashar: is the ssh key in keyholder?
[15:24:26] <greg-g>	 hashar: is there a bug for... oh there he is
[15:24:37] <greg-g>	 g'morning thcipriani :)
[15:24:47] * greg-g passes torch to you
[15:24:55] * greg-g goes into meetings for next 1.5 hours
[15:25:16] <greg-g>	 btw: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48303/console
[15:26:06] <thcipriani>	 yup, so looking at: SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add -l
[15:26:13] <thcipriani>	 there are no identities
[15:26:17] <wikibugs>	 10Beta-Cluster: beta-scap-eqiad no more run due to ssh Permission denied - https://phabricator.wikimedia.org/T95562#1194513 (10hashar) 3NEW
[15:26:17] <thcipriani>	 so we just have to add one
[15:26:30] <hashar>	 thcipriani: I filled a task above :)
[15:26:39] <thcipriani>	 kk
[15:26:48] <hashar>	 in short we had 9 days of puppet patches pending because the repo was stalled on the puppetmaster
[15:26:51] <hashar>	 might be the reason
[15:29:07] <wikibugs>	 10Browser-Tests, 6Release-Engineering: Browser tests running against beta all failing because of mw-api-siteinfo.py - https://phabricator.wikimedia.org/T95163#1194528 (10greg) Sorry about the radio silence here.    >>! In T95163#1182020, @Gilles wrote: > If someone who's an admin on labs for the "Integration"...
[15:30:37] <wikibugs>	 10Beta-Cluster: beta-scap-eqiad no more run due to ssh Permission denied - https://phabricator.wikimedia.org/T95562#1194539 (10greg) p:5Triage>3Unbreak!
[15:32:46] <thcipriani>	 !log added mwdeploy_rsa to keyholder agent.sock via chmod 400 /etc/keyholder.d/mwdeploy_rsa && SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa && chmod 440 /etc/keyholder.d/mwdeploy_rsa; permissions in puppet may be wrong?
[15:32:48] <qa-morebots>	 Logged the message, Master
[15:33:21] <thcipriani>	 hashar: the next build of that job _should_ work, but we'll see.
[15:37:12] <shinken-wm>	 RECOVERY - Host integration-slave-jessie-1001 is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms  
[15:38:11] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194574 (10hashar) 3NEW
[15:38:14] <hashar>	 thcipriani: you are a magician :)
[15:38:18] <shinken-wm>	 PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0]  
[15:38:22] <hashar>	 also found out l10nupdate is most probably broken
[15:38:25] <hashar>	 it writes to the wrong place
[15:39:08] <hashar>	 I am trying to find the configuration 
[15:39:43] * hashar whistles
[15:40:11] <wmf-insecte>	 Yippee, build fixed!
[15:40:11] <wmf-insecte>	 Project beta-scap-eqiad build #48334: FIXED in 6 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48334/
[15:42:06] <thcipriani>	 yay!
[15:43:09] <greg-g>	 yippee!
[15:45:12] <thcipriani>	 hashar: looks like l10n should be output to /srv/mediawiki-staging/php-[version]/cache/l10n is that not right? l10nupdate is somewhat opaque to me yet :\
[15:46:03] <hashar>	 !sal
[15:46:03] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:46:11] <hashar>	 thcipriani: it is opaque to me as well :)
[15:48:02] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194623 (10hashar) l10nupdate code is in puppet `modules/scap/files/l10nupdate-1` and it has `GITDIR=/var/lib/l10nupdate/mediawiki`   /var/lib/l10nupdate/ has been created on 2015-03-25 02:00...
[15:51:13] <hashar>	 thcipriani: congrats on fixing scap!
[15:53:20] <thcipriani>	 hashar: thanks, that key will need to be primed on reboot. Probably not super desirable :\
[15:58:56] <hashar>	 thcipriani: what does "primed" mean?
[15:59:00] <hashar>	 non native english here :D
[15:59:21] <chasemp>	 hashar: primed means prepared / ready to go
[15:59:22] <chasemp>	 usually
[16:00:05] <hashar>	 !log integration-slave-jessie-1001  recreated. Applying it role::ci::slave::labs which should also bring in the package builder role under /mnt/pbuilder
[16:00:08] <qa-morebots>	 Logged the message, Master
[16:00:33] <thcipriani>	 hashar: "primed" is not really a good term it's just the one that I've heard bd808 use :) it just means running the command I put into SAL, you'll also need the mwdeploy_rsa pass which is in labs/private
[16:00:38] <hashar>	 :)
[16:00:59] <hashar>	 ah you know about labs/private already
[16:01:00] <hashar>	 all good so
[16:01:07] <hashar>	 was going to suggest putting the keys there
[16:01:30] <hashar>	 more puppet madness. I am giving up for today
[16:01:35] <hashar>	 thanks again for the scap fix thcipriani !
[16:01:45] <hashar>	 will be back tomorrow
[16:01:51] <thcipriani>	 hashar: yw, have a good evening!
[16:01:59] <bd808>	 hashar: which l10nupdate is broken? prod or some testing thing in beta cluster?
[16:02:48] <thcipriani>	 bd808: I think this is the ticket: https://phabricator.wikimedia.org/T95564
[16:04:14] <bd808>	 oh. I thought that we made /var/lib/l10nupdate a symlink to /srv/l10nupdate
[16:04:27] <bd808>	 I wonder if puppet is undoing that for us
[16:08:33] <wikibugs>	 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances. - https://phabricator.wikimedia.org/T95569#1194719 (10Krinkle) 3NEW
[16:18:52] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK  
[16:22:24] <legoktm>	 werdna: use mwscript --wiki=blah phpunit...
[16:23:23] <wikibugs>	 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1194794 (10Dzahn) http://ubuntuforums.org/showthread.php?t=802156  tldr: bad proxies   sudo aptitude -o Acquire::http::No-Cache=True -o Acqui...
[16:25:38] <wikibugs>	 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances. - https://phabricator.wikimedia.org/T95569#1194807 (10yuvipanda) Should I just delete all the data under the integration project, and let it start again from scratch?
[16:34:51] <werdna>	 legoktm: I tried mwscript phpunit —wiki and that didn’t work
[16:36:36] <shinken-wm>	 RECOVERY - Puppet staleness on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [3600.0]  
[16:36:40] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0]  
[16:36:42] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave1002 is OK: OK: Less than 1.00% above the threshold [0.0]  
[16:37:56] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0]  
[16:38:00] <shinken-wm>	 RECOVERY - HHVM Queue Size on deployment-mediawiki01 is OK: OK: Less than 30.00% above the threshold [10.0]  
[16:38:13] <shinken-wm>	 PROBLEM - Citoid on deployment-sca01 is CRITICAL: Connection refused  
[16:39:52] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%)  
[16:47:35] <wikibugs>	 10Browser-Tests, 6Mobile-Web, 10MobileFrontend: add metadata to ChunkyPNG image - https://phabricator.wikimedia.org/T67274#1194885 (10greg) @jdlrobson: can you give some background here and/or let me know if this is still needed?
[16:52:10] <Krinkle>	 !log Pool integration-slave-trusty-1011...integration-slave-trusty-1016
[16:52:13] <qa-morebots>	 Logged the message, Master
[17:10:58] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194974 (10greg)
[17:11:51] <Krinkle>	 !log Depool integration-slave1402...integration-slave1405
[17:11:54] <qa-morebots>	 Logged the message, Master
[17:17:43] <grrrit-wm>	 (03PS1) 10Krinkle: Remove bash -x from mw-install-* and mw-run-update [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203111 
[17:17:52] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Remove bash -x from mw-install-* and mw-run-update [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203111 (owner: 10Krinkle)
[17:18:29] <greg-g>	 16:39 < shinken-w> PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL:  deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%)
[17:18:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove bash -x from mw-install-* and mw-run-update [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203111 (owner: 10Krinkle)
[17:18:40] <greg-g>	 less than 100% free?
[17:18:52] <Krinkle>	 grrrit-wm: :D
[17:18:54] <Krinkle>	 greg-g: :D
[17:19:13] <YuviPanda>	 it is a terrible message
[17:19:18] <YuviPanda>	 I’ve been meaning to fix that as well
[17:19:26] <YuviPanda>	 BUT TOO MANY THINGS *EXPLODES*
[17:19:31] <greg-g>	 YuviPanda: file a task
[17:19:34] <chasemp>	 well techincally ...it's true :)
[17:19:47] <YuviPanda>	 Krinkle: should I just remove all metrics under integration.?
[17:19:53] <YuviPanda>	 greg-g: there’s already one I think. let me find
[17:19:59] <greg-g>	 chasemp: yeah, I'd kinda hope so, I'm just curious if it's actually a problem
[17:20:04] <Krinkle>	 !log Creating integration-slave-precise-1011
[17:20:06] <qa-morebots>	 Logged the message, Master
[17:20:13] * greg-g ssh's
[17:20:23] <chasemp>	 greg-g: I was poking fun at yuvi :)
[17:20:29] <Krinkle>	 YuviPanda: Preferably not..
[17:20:46] <Krinkle>	 YuviPanda: Though if it's easier, let's do that next monday after I recreated the instances.
[17:20:53] <YuviPanda>	 Krinkle: that’s definitely easier :)
[17:20:59] <Krinkle>	 I'll be deleting a few more isntances and then it'll be stable for the next month
[17:21:15] <greg-g>	 /dev/vda2                                                  1.9G  1.8G   63M  97% /var
[17:21:27] <greg-g>	 63M free :/
[17:21:50] <YuviPanda>	 Krinkle: cool
[17:22:02] <YuviPanda>	 Krinkle: can you note that on the bug and set a time so I’ll make sure I’m around?
[17:22:31] <YuviPanda>	 greg-g: that instance needs recreating but too many things on it and we don’t know what’ll break...
[17:22:39] <Krinkle>	 YuviPanda: Actually, while we'll delete a few more instances, not metrics. So it'd be cool to delete integration.* now.
[17:22:40] * YuviPanda sshs
[17:22:49] <YuviPanda>	 Krinkle: ok, moment
[17:23:00] <Krinkle>	 Then we'll delete the extra instances later, but at least indidivual metrics will be usable again
[17:23:08] <Krinkle>	 It would be helpful to have the during these two days  :)
[17:23:09] <Krinkle>	 Thanks :)
[17:24:04] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195043 (10greg) Current df -h  ``` gjg@deployment-bastion:~$ df -h Filesystem                                                 Size  Used Avail Use% Mounted on /dev/vda1...
[17:25:04] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195048 (10greg) p:5Triage>3High
[17:27:01] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195060 (10yuvipanda) Alright, so 'real' solution is to recreate that instance. Since atm that's a bit of a yak shave, I'm just going to symlink things around
[17:29:37] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195063 (10thcipriani) Salient comments from this morning:  <bd808> oh. I thought that we made /var/lib/l10nupdate a symlink to /srv/l10nupdate <bd808> I wonder if puppet is undoing that for...
[17:32:45] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195097 (10yuvipanda) I just created the symlink again. Running puppet again.
[17:33:46] <YuviPanda>	 thcipriani: greg-g uhm, puppet seems hosed on all of deployment-prep
[17:34:34] <YuviPanda>	 like, totally
[17:34:38] <YuviPanda>	 certificate failure
[17:35:12] <thcipriani>	 YuviPanda: are you running this on deployment-bastion?
[17:35:24] <YuviPanda>	 thcipriani: it failed there and also failed on salt
[17:35:41] <YuviPanda>	 filing taks now
[17:35:59] <wikibugs>	 10Beta-Cluster, 6Labs: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195122 (10yuvipanda) 3NEW
[17:36:07] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195130 (10yuvipanda) Ok, puppet seems hosed on all of deployment-prep. Filed T95586
[17:37:36] <wikibugs>	 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1195157 (10EBernhardson)
[17:37:37] <wikibugs>	 7Blocked-on-RelEng, 10Browser-Tests, 10Continuous-Integration, 6Collaboration-Team, and 2 others: Pass MEDIAWIKI_CAPTCHA_BYPASS_PASSWORD in on Jenkins so GettingStarted browser tests pass - https://phabricator.wikimedia.org/T91220#1195155 (10EBernhardson) 5Open>3Resolved
[17:39:52] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK  
[17:40:07] <thcipriani>	 YuviPanda: huh. the cert deployment-salt was trying to use was for the agent was i-0000015c.deployment-prep.eqiad.wmflabs
[17:41:14] <thcipriani>	 fallout from the enc 'true' string, I'd guess. But it's weird, removing the environment causes it to generate a new cert :\
[17:41:32] <wikibugs>	 10Beta-Cluster, 6Labs, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195176 (10greg) p:5Triage>3Unbreak!
[17:42:46] <YuviPanda>	 thcipriani: did beta ever have ENC set?
[17:44:23] <thcipriani>	 I think so, they ran into the 'true' problem at the same time. wait, deployment-salt is the puppetmaster? No wonder everything is weird.
[17:44:43] <YuviPanda>	 yes
[17:44:46] <YuviPanda>	 deployment-salt is the puppetmaster
[17:44:54] <YuviPanda>	 thcipriani: last successful puppet run was 281 mins ago
[17:45:02] <thcipriani>	 look at the /etc/puppet/puppet.conf that seems wacky
[17:45:30] <thcipriani>	 at least vs what I've been seeing in staging
[17:45:35] <YuviPanda>	 thcipriani: in which host?
[17:45:42] <thcipriani>	 on deployment-salt
[17:46:11] <YuviPanda>	 really? which part seems whacky?
[17:46:17] <YuviPanda>	 I haven’t looked at things
[17:46:29] <YuviPanda>	 err
[17:46:34] <YuviPanda>	 things I meant /etc/puppet/puppet.conf
[17:47:49] <thcipriani>	 YuviPanda: well, I think the ssldir is incorrect, also I think that the master section should have more, stuff, at least it does in modules/puppetmaster/templates/20-master.conf.erb
[17:47:59] <YuviPanda>	 oh, hmm
[17:48:00] <YuviPanda>	 fair enough
[17:48:04] * YuviPanda isnt’ really sure what’s happening
[17:48:11] * YuviPanda runs facter
[17:50:18] <thcipriani>	 YuviPanda: I bet that briefly the dc changed to i-0000015c.deployment-prep.eqiad.wmflabs which overwrote the puppet.conf, so I bet if we just overwrite the puppet.conf with good values and rerun puppet it'll self correct.
[17:51:09] <thcipriani>	 since role::puppet::self checks the ::fqdn against the puppetmaster value in ldap and they didn't match
[17:51:35] <thcipriani>	 since the /etc/resolv.conf was updated to deployment-prep.eqiad.wmflabs
[17:52:19] <YuviPanda>	 uh oh
[17:53:50] <thcipriani>	 what's uh oh?
[17:54:30] <YuviPanda>	 uh oh as in ‘I have no idea how to do that’ :)
[17:54:45] <YuviPanda>	 how do we overrwrite puppet.conf with good values?
[17:58:55] <thcipriani>	 well, that is an excellent question.
[18:05:22] * YuviPanda has no answer, and has to go now
[18:07:17] <thcipriani>	 kk, I think I'm going to try overwriting the puppet.conf pieced together from /etc/puppet/modules/puppet/self
[18:07:48] <thcipriani>	 I really just think it needs to get to the right cert directory and it'll self-correct from there.
[18:12:18] <thcipriani>	 hashar: just in time, I was just looking into this: https://phabricator.wikimedia.org/T95586 which is happening on deployment-salt
[18:12:34] <hashar>	 oh my god
[18:12:42] <hashar>	 please no
[18:13:06] <thcipriani>	 I _think_ I know why it's happening: the /etc/puppet/puppet.conf is pointing to the wrong ssldir
[18:13:15] <hashar>	 thcipriani: so the puppet client on that instance establish a ssl connection with the master
[18:13:29] <hashar>	 when the client is setup for the first time it send its cert to the master
[18:13:33] <hashar>	 and on the master we have to sign it
[18:13:57] <hashar>	 the host in the cert is based on the instance ec2id and eqiad.wmflabs
[18:13:59] <thcipriani>	 right, and it has been signed, it's in the directory /var/lib/puppet/server/ssl
[18:14:14] <hashar>	 so on line 3 you see i-0000015c.eqiad.wmflabs
[18:14:15] <thcipriani>	 but right now it's pointing at /var/lib/puppet/client/ssl
[18:14:49] <hashar>	 last friday  an experimental DNS server has been introduced
[18:15:06] <hashar>	 which slightly change the fully qualified domain name (fqdn) for instances
[18:15:13] <hashar>	 so instead of:  <somename>.eqiad.wmflabs
[18:15:19] <hashar>	 you have the project inserted as a subdomain
[18:15:21] <thcipriani>	 exactly, and what happened, I think, was the fqdn changed, which removed the puppetmaster role
[18:15:28] <hashar>	 ie:  <somename>.deployment-prep.eqiad.wmflabs
[18:15:36] <hashar>	 and that is in turn used in the puppet conf
[18:15:40] <hashar>	 gotta look at puppet.conf 
[18:15:46] <hashar>	 so tldr
[18:15:59] <hashar>	 I spend a good half a day fixing it up on the integration project
[18:16:07] <hashar>	 I looked at beta and it was not impacted
[18:16:20] <hashar>	 and it was not impacted because the operations/puppet repo has been stall for the last 9 days
[18:16:32] <hashar>	 when I have unblocked operations/puppet that caused the faulty change to be deployed
[18:16:35] <hashar>	 damn
[18:16:44] <hashar>	 !sal
[18:16:45] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[18:16:59] <thcipriani>	 I _think_ I can fix this, because the same thing happened in staging
[18:17:15] <hashar>	 yup
[18:17:25] <hashar>	 the dnsmasq server should be the default really
[18:17:28] <hashar>	 I took notes on https://phabricator.wikimedia.org/T95273
[18:17:40] <hashar>	 I eventually had a corrupted puppet.conf on the master
[18:17:49] <hashar>	 so I ended up having to rebuild a bunch of conf files manually
[18:17:57] <hashar>	 and I deleted all ssl certs and regenerated all of them
[18:18:03] <hashar>	 but there must be a smarter way to handle it
[18:18:12] <hashar>	 at first
[18:18:25] <hashar>	 I would set the hiera() conf to use dnsmasq  https://wikitech.wikimedia.org/w/index.php?title=Hiera:Integration&diff=152484&oldid=152033
[18:18:38] <hashar>	 though since puppet clients are not running, it is not going to be applied
[18:18:48] <hashar>	 so I have manually ran:
[18:18:51] <hashar>	 echo 'domain eqiad.wmflabs                                                                 
[18:18:52] <hashar>	 search eqiad.wmflabs                                                                   
[18:18:52] <hashar>	 nameserver 10.68.16.1' > /etc/resolv.conf
[18:18:52] <hashar>	 /etc/init.d/nscd restart
[18:19:32] <hashar>	 https://phabricator.wikimedia.org/T95273#1185320 even has the whole script
[18:19:34] <hashar>	 but that is scary
[18:19:46] <hashar>	 potentially just changing the resolv.conf  should be enough
[18:22:07] <hashar>	 thcipriani: ah and the ssldir are messed up as well
[18:22:25] <thcipriani>	 right, so they need to point at server rather than client
[18:22:30] <hashar>	 so right now
[18:22:40] <hashar>	 puppet.conf has a section [main] 
[18:22:44] <hashar>	 ssldir = /var/lib/puppet/client/ssl
[18:22:47] <thcipriani>	 wait!
[18:22:50] <hashar>	 so I suspect the master is using the client cert
[18:22:55] <hashar>	 when it should use ... the server cert
[18:23:05] <hashar>	 I remember having seen a diff once I fixed puppet
[18:23:12] <thcipriani>	 I got it, so here's what changed in your puppet.conf
[18:23:36] <thcipriani>	 https://phabricator.wikimedia.org/P500
[18:23:55] <thcipriani>	 so if you restore those settings + resov.conf you should be good to go
[18:24:03] <grrrit-wm>	 (03PS1) 10Awight: CiviCRM job can be run concurrently [integration/config] - 10https://gerrit.wikimedia.org/r/203187 (https://phabricator.wikimedia.org/T91895) 
[18:24:04] <hashar>	 here the puppet conf on integration puppet master : https://phabricator.wikimedia.org/P501
[18:24:04] <thcipriani>	 I grabbed that out of the /var/log/puppet.log
[18:24:14] <hashar>	 note how  [master] has:  ssldir = /var/lib/puppet/server/ssl/ 
[18:24:35] <thcipriani>	 yup
[18:24:36] <hashar>	 ah yeah P500 is the diff
[18:24:41] <hashar>	 accurately describe the issue
[18:25:06] <hashar>	 seems the hostname change cause some puppet manifest to no more recognize the instance has being the master
[18:25:10] <hashar>	 so the [master] section is dropped
[18:25:13] <hashar>	 but the master is still around
[18:25:18] <hashar>	 so yeah restore
[18:25:48] <thcipriani>	 right, the hostname is critical because role::puppet::self checks the ::fqdn against the puppetmaster set in ldap
[18:26:01] <thcipriani>	 and if they match, it gives it the puppetmaster role
[18:26:02] <hashar>	 patch  --reverse !!
[18:28:16] <wikibugs>	 10Continuous-Integration, 6Labs: integration labs project DNS resolver improperly switched to openstack-designate - https://phabricator.wikimedia.org/T95273#1195498 (10hashar) The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the maste...
[18:28:39] <wikibugs>	 10Beta-Cluster, 6Labs, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195122 (10hashar) The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the master as being the master...
[18:28:44] <hashar>	 thcipriani: so yeah should be much faster to fix
[18:29:06] <hashar>	 you would never believe how much I have screamed while fixing it for integration :(
[18:29:10] <thcipriani>	 hashar: I just restored /etc/puppet/puppet.conf
[18:29:23] <thcipriani>	 so if we want to try a puppet run, it _should_ work
[18:29:27] <hashar>	 is puppet agent happy now ?
[18:30:08] <thcipriani>	 hashar: seems to be running...
[18:30:32] <thcipriani>	 dang:  Could not find class role::labs::instance
[18:31:12] <hashar>	 try again ! :D
[18:31:33] <hashar>	 at least it seems to compile just fine
[18:31:50] <hashar>	 Apr  9 18:31:43 deployment-salt puppet-master[1404]: Compiled catalog for i-0000015c.eqiad.wmflabs in environment production in 10.80 seconds
[18:31:58] <hashar>	 but
[18:32:06] <hashar>	 Could not retrieve facts for i-0000083a.eqiad.wmflabs: SQLite3::BusyException: database is locked:
[18:32:26] <hashar>	 Apr  9 18:32:14 deployment-salt puppet-agent[6371]: (/Stage[main]/Role::Labs::Instance/File[/etc/mailname]/content) -deployment-salt.deployment-prep.eqiad.wmflabs
[18:32:26] <hashar>	 Apr  9 18:32:14 deployment-salt puppet-agent[6371]: (/Stage[main]/Role::Labs::Instance/File[/etc/mailname]/content) +deployment-salt.eqiad.wmflabs
[18:32:28] <hashar>	 seems to work
[18:32:43] <hashar>	 thcipriani: sqlite does not really handle concurrent connections :D
[18:32:59] <thcipriani>	 heh, sorry :)
[18:34:32] <hashar>	 so I ran puppet on deployment-bastio
[18:34:46] <hashar>	 it cant reach some metadata directory :-(
[18:34:59] <hashar>	  Connection refused puppet://deployment-salt.eqiad.wmflabs/plugins
[18:35:19] <hashar>	 on integration,  /etc/puppet/auth.conf ended up being corrupted
[18:35:52] <hashar>	 maybe restarrting puppetmaster would suffice
[18:36:22] <thcipriani>	 yeah, maybe. I noticed that in the puppet run on deployment-salt it did correct the auth.conf
[18:36:29] <hashar>	 great
[18:36:43] <thcipriani>	 so, maybe everything's magically fixed...
[18:36:52] <hashar>	 restarting puppetmaster
[18:37:10] <hashar>	 na :(
[18:37:11] <shinken-wm>	 PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[18:37:26] <thcipriani>	 bummer
[18:37:31] <shinken-wm>	 PROBLEM - Puppet failure on deployment-kafka02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[18:37:31] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]  
[18:37:51] <shinken-wm>	 PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[18:38:21] <hashar>	 ah
[18:38:26] <hashar>	 puppetmaster refuses to start
[18:39:15] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[18:39:17] <hashar>	 and no idea how to look at logs
[18:39:32] <shinken-wm>	 PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]  
[18:40:02] <shinken-wm>	 PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[18:40:21] <thcipriani>	 hmm, not in syslog or dmesg
[18:40:28] <hashar>	 it is back up somehow
[18:40:46] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[18:41:16] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]  
[18:41:58] <hashar>	 there is some apache / ruby thing providing metadata
[18:42:28] <hashar>	 supposed to listen on port 8140
[18:43:45] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[18:45:11] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[18:45:54] <thcipriani>	 hashar: hmm, I see puppetmaster::passenger role, but it doesn't even look like apache is installed on deployment-salt
[18:46:06] <hashar>	 gonna kill -9 it
[18:46:19] <shinken-wm>	 PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[18:46:31] <thcipriani>	 kk
[18:46:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[18:46:43] <hashar>	 restarting again
[18:46:48] <hashar>	 looking at netstat -tlnp
[18:46:53] <hashar>	 to figure out whether a ruby process listen
[18:47:01] <shinken-wm>	 PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[18:47:04] <hashar>	 tcp        0      0 10.68.16.99:8140        0.0.0.0:*               LISTEN      9697/ruby       
[18:47:09] <shinken-wm>	 PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[18:47:15] <hashar>	 running puppet locally
[18:47:21] <hashar>	 works!!!
[18:47:32] <hashar>	 thcipriani: I think the puppet master having the bad conf was still running
[18:47:40] <hashar>	 and the init.d script was not killing it for some reason
[18:47:44] <hashar>	 had to kill -9
[18:47:47] <thcipriani>	 ah, that makes sense
[18:47:51] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]  
[18:47:52] <hashar>	 then start the process again via the init script
[18:47:59] <hashar>	 and apparently we have a working puppetmaster again
[18:48:10] <hashar>	 that is tedious
[18:48:19] <thcipriani>	 yeah, that was kinda rough
[18:48:25] <hashar>	 all those sysadmins tasks remembers me  it is a job :)
[18:48:27] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[18:48:33] <shinken-wm>	 PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[18:48:38] <hashar>	 so tl:dr;  puppet / labs etc are awesome
[18:48:45] <hashar>	 but random crazy failures occurs often
[18:49:18] <hashar>	 root cause in the end is the hostname changed causing puppetmaster to be downgraded magically as a normal client
[18:49:27] <shinken-wm>	 PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[18:49:31] <marxarelli>	 hashar: had to do the same a while back https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL#March_18
[18:49:43] <shinken-wm>	 PROBLEM - Puppet failure on deployment-test is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]  
[18:50:02] <hashar>	 marxarelli: oh 
[18:50:08] <shinken-wm>	 PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[18:50:17] <hashar>	 maybe it ended up being locked by too many catalog being compiled
[18:50:39] <wikibugs>	 10Continuous-Integration, 7Upstream: Fails npm build failure "File exists: ../esprima/bin/esparse.js" - https://phabricator.wikimedia.org/T90816#1195557 (10Krinkle) 5Resolved>3Open Happened again.  https://integration.wikimedia.org/ci/job/npm/2194/console  ``` 18:15:08 Building remotely on integration-slav...
[18:51:16] <shinken-wm>	 PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[18:51:40] <thcipriani>	 so now what's up with all these puppet failures
[18:52:01] <hashar>	 so
[18:52:09] <hashar>	 no clue :)
[18:52:12] <shinken-wm>	 PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]  
[18:52:23] <hashar>	 but an interesting thing I have seen is that some packages for Precise have been updated 
[18:52:28] <marxarelli>	 it may take time for them to resolve
[18:52:37] <hashar>	 and the repo is signed with a GPG key we do not have on instance
[18:52:46] <hashar>	 might cause issues
[18:54:40] <hashar>	 top #1 reason I love puppet:  Duplicate declaration: Package[zip]
[18:55:25] <thcipriani>	 OK, just to doublecheck that these puppet runs will be fine, I ran puppet on deployment-mediawiki01 and it went fine, so hooray!
[18:55:47] <hashar>	 memc02 as well
[18:55:50] <hashar>	 so I guess they will recover
[18:55:52] <hashar>	 meanwhile
[18:56:01] <hashar>	 on deployment-bastion there is a nasty change going on
[18:56:05] <hashar>	 with syslog-ng and rsyslog
[18:56:11] <thcipriani>	 yeah, saw that
[18:56:19] <hashar>	 TECH DEBT OF DOOOM
[18:56:34] <hashar>	 so in short on prod we used syslog as a central aggregator
[18:56:49] <hashar>	 then we had all app servers to use rsyslog to relay their local log to that central aggregator
[18:56:57] <hashar>	 and we never bothered to move the central syslog to rsylog
[18:57:06] <hashar>	 - or at least we hadn't back 1 + year ago -
[18:57:20] <hashar>	 so on beta everything should have rsyslog to relay log
[18:57:27] <hashar>	 BUT deployment-bastion should only have syslog-ng
[18:57:40] <hashar>	 and of course rsyslog and syslog-ng packages conflict
[19:00:50] * hashar https://www.youtube.com/watch?v=GlYj0ogWNRA  *Funky Disco House Mix *
[19:03:58] <shinken-wm>	 RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:04:42] <wikibugs>	 10Continuous-Integration: Deprecate global CodeSniffer rules repo and phpcs jobs - https://phabricator.wikimedia.org/T66371#1195612 (10Krinkle)
[19:05:27] <wikibugs>	 10Continuous-Integration: Deprecate global CodeSniffer rules repo and phpcs jobs - https://phabricator.wikimedia.org/T66371#697969 (10Krinkle) a:5Krinkle>3None
[19:06:16] <shinken-wm>	 RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:06:18] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:07:12] <shinken-wm>	 RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:07:28] <shinken-wm>	 RECOVERY - Puppet failure on deployment-kafka02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:07:32] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:07:50] <shinken-wm>	 RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:08:29] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:09:05] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]  
[19:09:35] <shinken-wm>	 RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:10:01] <shinken-wm>	 RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:10:45] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:13:47] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:13:56] <wikibugs>	 10Continuous-Integration, 10Librarization: Jenkins: Create job for verifying committed "vendor" directory from composer - https://phabricator.wikimedia.org/T74952#1195643 (10Krinkle)
[19:14:29] <shinken-wm>	 RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:14:33] <greg-g>	 whew
[19:15:13] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:16:16] <shinken-wm>	 RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:16:34] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:17:08] <shinken-wm>	 RECOVERY - Puppet failure on deployment-zookeeper01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:17:10] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:17:25] <hashar>	 thcipriani: seems all good
[19:17:50] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:18:34] <shinken-wm>	 RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:18:50] <thcipriani>	 hashar: nice—now we can address the /var partition, which is what we were trying to do in the first place :P
[19:19:02] <hashar>	 that is not going to be easy :(
[19:19:06] <shinken-wm>	 RECOVERY - Puppet failure on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:19:07] <hashar>	 probably easier to build some new instance
[19:19:14] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:19:26] <hashar>	 and migrate services hosted on deployment-bastion to different and fresh instances
[19:19:44] <shinken-wm>	 RECOVERY - Puppet failure on deployment-test is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:20:21] <thcipriani>	 well, I posted some comments here: https://phabricator.wikimedia.org/T95564
[19:22:00] <shinken-wm>	 RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:23:00] <wikibugs>	 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195659 (10hashar) @yuvipanda the puppet manifests ensure /var/lib/l10nupdate is a directory, so you cant really symlink.  Up until March 24th the l10nupdate working directory was in /srv/l10...
[19:24:48] <wikibugs>	 10Beta-Cluster, 6Labs, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195675 (10hashar) 5Open>3Resolved a:3hashar Ok solved! That was the exact same issue as on integration and staging project.  Changing the hostname cause the puppetmaster...
[19:25:09] <shinken-wm>	 RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[19:35:05] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[19:43:52] <grrrit-wm>	 (03PS2) 10Legoktm: Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 
[19:45:47] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 (owner: 10Legoktm)
[19:49:11] <grrrit-wm>	 (03Merged) 10jenkins-bot: Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 (owner: 10Legoktm)
[19:50:03] <legoktm>	 !log deployed https://gerrit.wikimedia.org/r/202932
[19:50:06] <qa-morebots>	 Logged the message, Master
[19:53:46] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[20:10:24] <wikibugs>	 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195824 (10Dzahn) 5Open>3Resolved a:3Dzahn fixed with method 2:  ``` # apt-get clean # cd /var/lib/apt # mv lists lists.old # mkdir -p...
[20:16:18] <wikibugs>	 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195841 (10Dzahn) root@deployment-bastion:~# apt-key list | grep -B1 ftpmaster pub   1024D/437D05B5 2004-09-12 uid                  Ubuntu Ar...
[20:20:52] <mutante>	 fixed deployment-bastion's APT sources
[20:21:11] <wikibugs>	 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195853 (10hashar) Thanks a ton @dzahn for the fix, the reference and the detailed step by step instructions!
[20:21:16] <mutante>	 wondered if there were pending package upgrades since apt-get update was fixed
[20:21:29] <mutante>	 saw that it would upgrade both libc6 and php5, so a bunch
[20:21:41] <mutante>	 also saw it would _down_grade salt-minion (?)
[20:21:57] <mutante>	 didnt execute it
[20:25:22] <wikibugs>	 10Browser-Tests, 10Continuous-Integration, 10Wikimedia-Fundraising: Create unit and integration tests for Fundraising extensions to identify breaking MediaWiki changes - https://phabricator.wikimedia.org/T89404#1195872 (10awight)
[20:39:26] <wikibugs>	 10Continuous-Integration, 6operations: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1195915 (10hashar) Thanks @faidon for the preliminary investigation. Should I fill subtasks for the 5 points you mentioned? It seems that each will reach out to di...
[20:41:17] <wikibugs>	 10Continuous-Integration, 6operations: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1195919 (10hashar)
[20:41:52] <wikibugs>	 10Continuous-Integration, 6operations: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10hashar)
[20:41:55] <shinken-wm>	 RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0]  
[20:44:34] <hashar>	 ah syslog is all happy
[20:45:04] <hashar>	 thcipriani: marxarelli: tip for beta cluster, syslog should be centraly collected by deployment-bastion and are written to /data/project/syslog
[20:46:01] <thcipriani>	 hashar: neat.
[20:46:15] <hashar>	 and it is spammed with:  init: citoid main process ended, respawning
[20:48:28] <greg-g>	 heh
[20:48:44] <wikibugs>	 10Beta-Cluster, 10Citoid: Citoid Syntaxerror on beta cluster - https://phabricator.wikimedia.org/T95616#1195947 (10hashar) 3NEW
[20:48:52] <hashar>	 I have no idea how many tasks I have created this week
[20:49:04] <hashar>	 I feel like my job title should now be "task filler"
[20:49:43] <greg-g>	 21
[20:49:46] <greg-g>	 https://phabricator.wikimedia.org/maniphest/query/uQyw3ZctIBhm/#R
[20:50:10] <wikibugs>	 10Beta-Cluster, 10Citoid: Citoid Syntaxerror on beta cluster - https://phabricator.wikimedia.org/T95616#1195960 (10hashar) /etc/citoid/config.yaml is definitely a YAML file but somehow it is being loaded as a javascript file :/
[20:50:29] <hashar>	 nice
[20:50:42] <hashar>	 greg-g: OpenStack infra is considering Phabricator
[20:50:49] <hashar>	 instead of their home made bug system
[20:51:25] <greg-g>	 saw that :) I've been ignoring the commentary though
[20:52:22] <chasemp>	 best quote is "if only it was written in python"
[20:52:24] <chasemp>	 :)
[20:59:19] <wikibugs>	 3Continuous-Integration-Isolation: Figure out how Jenkins conf is maintained by OpenStack - https://phabricator.wikimedia.org/T95049#1195996 (10hashar) OpenStack has a fully puppetized Jenkins. They have split their puppet modules as independent repositories so that people from the OpenStack community can benefi...
[21:15:13] <wikibugs>	 10Beta-Cluster, 10Citoid: Citoid Syntaxerror on beta cluster - https://phabricator.wikimedia.org/T95616#1196074 (10mobrovac) Merci beaucoup for noticing and letting us know @hashar !
[21:18:51] <wikibugs>	 10Beta-Cluster: beta-scap-eqiad no more run due to ssh Permission denied - https://phabricator.wikimedia.org/T95562#1196076 (10hashar) a:3thcipriani Fixed by Tyler
[21:24:39] <hashar>	 chasemp: they are a python shop so yeah :(
[21:24:49] <hashar>	 that is one of the reason we migrated out of perl Bugzilla
[21:25:44] <hashar>	 anyway I am out
[23:34:56] <grrrit-wm>	 (03PS1) 10Krinkle: zuul: Don't raise "abort" as error to the user [integration/docroot] - 10https://gerrit.wikimedia.org/r/203251 
[23:35:10] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] zuul: Don't raise "abort" as error to the user [integration/docroot] - 10https://gerrit.wikimedia.org/r/203251 (owner: 10Krinkle)
[23:35:47] <grrrit-wm>	 (03Merged) 10jenkins-bot: zuul: Don't raise "abort" as error to the user [integration/docroot] - 10https://gerrit.wikimedia.org/r/203251 (owner: 10Krinkle)
[23:43:14] <wikibugs>	 10Continuous-Integration, 7Upstream: Fails npm build failure "File exists: ../esprima/bin/esparse.js" - https://phabricator.wikimedia.org/T90816#1196461 (10Krinkle)