[00:05:47] 10Continuous-Integration, 3Fundraising Sprint House of Pain, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising-CiviCRM, and 2 others: Write Jenkins job builder definition for CiviCRM CI job - https://phabricator.wikimedia.org/T91895#1192656 (10awight) a:5awight>3None Unassigning: I am finished workin... [00:15:28] (03PS3) 10Legoktm: Convert extensions to use generic phpunit job (D-E) [integration/config] - 10https://gerrit.wikimedia.org/r/202279 [00:17:16] hey legoktm, do you work for CI now? ;) [00:17:34] he works for everyone :D [00:18:14] so much energy! I have taken advantage of that before [00:18:22] :D [00:22:25] (03CR) 10Legoktm: [C: 032] Convert extensions to use generic phpunit job (D-E) [integration/config] - 10https://gerrit.wikimedia.org/r/202279 (owner: 10Legoktm) [00:30:12] bd808: it was your idea to deputize me! :P [00:31:27] (03Merged) 10jenkins-bot: Convert extensions to use generic phpunit job (D-E) [integration/config] - 10https://gerrit.wikimedia.org/r/202279 (owner: 10Legoktm) [00:32:30] !log deploying https://gerrit.wikimedia.org/r/202279 [00:32:33] Logged the message, Master [00:40:09] legoktm: are you deputy PM to bd808? :) [00:40:38] noooooooooo [00:44:24] YESSSS [00:44:24] :P [02:24:35] PROBLEM - SSH on deployment-bastion is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:33:19] (03PS2) 10Legoktm: Add phplint job for mediawiki/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/202938 [04:34:12] (03CR) 10Legoktm: [C: 032] Add phplint job for mediawiki/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/202938 (owner: 10Legoktm) [04:35:55] (03Merged) 10jenkins-bot: Add phplint job for mediawiki/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/202938 (owner: 10Legoktm) [04:36:37] !log deploying https://gerrit.wikimedia.org/r/202938 [04:36:42] Logged the message, Master [05:04:24] (03PS1) 10Legoktm: Convert extensions to use generic phpunit job (F-G) [integration/config] - 10https://gerrit.wikimedia.org/r/202992 [05:11:09] !log deleted core dumps from integration-slave1002, /var had filled up [05:11:12] Logged the message, Master [05:40:18] legoktm: /var? I thought there is no more separate /var [05:40:56] YuviPanda: these slaves are old [05:41:19] YuviPanda: https://phabricator.wikimedia.org/T94916 halp [05:58:36] (03CR) 10Legoktm: [C: 032] Convert extensions to use generic phpunit job (F-G) [integration/config] - 10https://gerrit.wikimedia.org/r/202992 (owner: 10Legoktm) [06:01:56] (03Merged) 10jenkins-bot: Convert extensions to use generic phpunit job (F-G) [integration/config] - 10https://gerrit.wikimedia.org/r/202992 (owner: 10Legoktm) [06:02:16] !log deploying https://gerrit.wikimedia.org/r/202992 [06:02:19] Logged the message, Master [06:03:47] (03PS1) 10Legoktm: Add GlobalCssJs to shared extension job [integration/config] - 10https://gerrit.wikimedia.org/r/202998 [06:12:06] (03CR) 10Legoktm: [C: 032] Add GlobalCssJs to shared extension job [integration/config] - 10https://gerrit.wikimedia.org/r/202998 (owner: 10Legoktm) [06:15:15] (03Merged) 10jenkins-bot: Add GlobalCssJs to shared extension job [integration/config] - 10https://gerrit.wikimedia.org/r/202998 (owner: 10Legoktm) [06:15:40] !log deploying https://gerrit.wikimedia.org/r/202998 [06:15:42] Logged the message, Master [07:29:02] 6Release-Engineering: Investigate production and/or beta requirements for Sentry - https://phabricator.wikimedia.org/T89732#1193235 (10Tgr) This largely happened in other tasks, I think. See T93138 (initial hardware request), T84956 (packaging and puppetizing), T86677 (initial security review). Do you see the ne... [07:34:16] (03Abandoned) 10Giuseppe Lavagetto: proxies: allow filtering by datacenter [tools/scap] - 10https://gerrit.wikimedia.org/r/200130 (owner: 10Giuseppe Lavagetto) [08:38:07] 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 7Documentation: Document how to tag extensions in git - https://phabricator.wikimedia.org/T94412#1193399 (10Mglaser) @mmodell, thanks for your support here! Tagging helps a lot when you want to do good extension versioning. I wonder what @demon thinks. Can thi... [08:49:55] !log https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ job stalled for some reason [08:50:01] Logged the message, Master [08:50:15] !log https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ timed out after 30 minutes while trying to git pull [08:50:18] Logged the message, Master [08:51:47] !log deployment-bastion is out of disk space on /var/ :( [08:51:49] Logged the message, Master [08:54:15] hashar: Is there a local PPA I should register to make Package[zuul] work? [08:54:19] Or does the package not exist in any repo yet? [08:54:26] Krinkle: hey! [08:54:28] Can you document how to install that .deb? [08:54:30] sorry about the mess up yesterday [08:54:33] :-) [08:54:44] had to leave early in the middle of the afternoon due to the whole familly being sick :( [08:54:54] the .deb is only in /home/hashar/ for now [08:55:15] I will get it added to apt.wikimedia.org for both Trusty and Precise whenever I am happy with the package [08:55:20] hopefully today :) [08:55:31] I don't know how to install that. There's commands for it, but there is different arguments and variations. [08:55:49] in theory we could set up a local repo under /data/project/ and inject some custom config in apt.conf [08:55:57] What commands should I exec exactly? [08:55:58] labs might well have support for that already, I havent looked though [08:56:13] I'm trying to make our patches just a simple bash script [08:56:17] I did it with the old patches already: https://phabricator.wikimedia.org/P466 [08:56:18] dpkg -i /home/hashar/zuul_XXXXXXX.deb [08:56:23] apt-get install -f [08:56:35] Project beta-code-update-eqiad build #51048: FAILURE in 3 min 34 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/51048/ [08:56:36] where XXXX vary between 'precise' and 'trusty' [08:56:49] and apt-get install -f is to install missing dependencies [08:57:13] hashar: does the apt-get take an argument related to zuul or the deb file or it fetches all known missing dependencies? [08:57:56] when you do dpkg -i [08:58:01] that tries to install the .deb file passed in parameter [08:58:12] that .deb file has a bunch of dependencies themselves which are added to list of packages to be installed [08:58:19] but dpkg is not smart enough to install them for you [08:58:24] so it just register the dependencies [08:58:27] Hm.. interesting, the home mount is gone on slave-trusty-1010 [08:58:31] and bails out because they are not available on the system [08:58:43] apt-get is able to fetch the missing packages from some repo [08:58:56] so apt-get install will tell you that there are some broken/missing packages [08:58:57] hashar: Right, so apt-get knows about the state that dpkg-i left behind. [08:59:01] and -f make it install them [08:59:04] sorry definitely a big mess : [08:59:05] ( [08:59:26] RECOVERY - SSH on deployment-bastion is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [08:59:42] hashar: Does installing manually resolve the puppet resource for Zuul package? [08:59:51] !log rebooted deployment-bastion and cleared some files under /var/ [08:59:51] e.g. how does it interact with the rest of the manifest [08:59:54] Logged the message, Master [09:00:05] Krinkle: yeah because puppet uses apt-get install 'whatever package' [09:00:20] but since the package is missing from apt.wikimedia.org, it can't find it and bails out [09:00:34] that is the error messages you have seen yesterday with Package['zuul'] blatantly failling [09:00:36] hashar: Does apt-get install learn about 'zuul' via dpkg -i? [09:00:42] yup [09:00:48] though I don't know all the details [09:00:56] So puppet will continue after installing manually? [09:01:08] seems dpkg -i register in some state file that the 'zuul' package is provided by a file /home/hashar/zuul_XXX.deb [09:01:19] yeah puppet will be happy [09:01:33] because once installed manually the state file is updated to state that 'zuul' is installed [09:01:47] so when puppet verify whether the package is there (running: apt-cache policy zuul) [09:01:54] it will get a positive [09:02:07] Hm.. any idea why the mount is gone? [09:02:09] iirc you can see what puppet is using as underlying command by running with debug [09:02:15] puppet agent -tv --debug [09:02:21] should dump all the shell commands bein gused [09:02:43] Krinkle: which mount? :) [09:02:48] home/ [09:03:19] it is supposed to be a NFS mount yeah [09:03:32] which instance has the issue? You might have to remount /home [09:03:39] or just reboot :D [09:04:04] I already reboooted twice [09:04:09] integration-slave-trusty-1010 [09:04:23] The one I've been working on for a week. It happened again, it takes a week to re-create our instances :-( [09:05:16] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #564: FAILURE in 55 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/564/ [09:05:36] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [09:05:48] puppet isn't doing it because it fails on zuul [09:06:00] ahhh [09:06:00] did you test the zuul package on a new instance? [09:06:02] good puppet [09:06:28] yeah I have installed the deb package on all instances we have [09:06:46] wanna try the manual install ? [09:06:46] new instances *after* the puppet patch that broke it [09:06:51] I can't without a mount [09:07:06] well on integration-slave-trusty-1010 I get the /home/ mounted properly [09:07:46] Could not chdir to home directory /home/krinkle: Permission denied [09:07:46] -bash: /home/krinkle/.bash_profile: Permission denied [09:08:11] ls -ld krinkle/ hashar/ [09:08:11] drwx------ 21 hashar svn 4096 Apr 8 13:24 hashar// [09:08:11] drwxr--r-- 20 krinkle wikidev 4096 Apr 8 19:33 krinkle// [09:08:20] bah [09:08:27] fixed [09:08:35] weird [09:08:42] have you done something ? [09:08:56] nope [09:08:57] I just queried our groups using : id krinkle [09:09:25] maybe some cache entry were stalled and the commands ended up querying LDAP for fresh informations [09:09:46] your UID/ GID / Groups etc are in LDAP and iirc there is a local cache on the instance [09:09:51] might have been corrupted somehow [09:10:09] hashar: So this technique, is this good enough to do for our new pool of slaves? [09:10:13] Assuming this is the last error we see [09:10:30] I can go ahead and do this for all new instances and move on. We'll see about doing it via apt.wikimedia next month [09:10:45] I mean, you can get it in there now, but it won't apply until next month. [09:10:48] yup [09:10:57] gotta document it on the manual setup page [09:11:07] but I think I remembered we can have a local apt repo for labs project [09:11:33] Hm.. dpkg -i gives exit code 1 [09:11:35] I assume that's normal? [09:11:46] yeah it fails to install the package because of some missing dep probably [09:11:59] Yeah [09:11:59] wanna hang out to share the screen? [09:12:16] can't do at themoent. multi tasking [09:12:23] :) [09:13:46] hmm [09:13:57] toollabs has local apt repos [09:14:32] made possible in puppet using 'labsdebrepo' [09:14:39] will look at setting up [09:14:54] that might prove to be useful eventually [09:18:49] hashar: https://phabricator.wikimedia.org/P466 looks good? [09:18:51] I added the zuul bt [09:19:55] Krinkle: should do [09:19:55] I might add it to integration-jenkins:/bin depending on how long we need to use it [09:20:04] e.g. bin/patch-slave-trusty.sh [09:20:06] hopefully no more than a week [09:20:15] I mean the entire script [09:20:21] we might well have that shell script added to puppet [09:20:25] :D [09:20:31] We'll need it for nodepool [09:20:37] and probably more hacks [09:20:44] nodepool will land in apt.wikimedia.org [09:20:48] zuul as well [09:20:59] I mean, if nodepool will create new slaves, it will need this [09:21:02] but I am not 100% happy with the package I came up with [09:21:06] ah yeah [09:21:16] hashar: I see the jessie instance is pooled in Jenkins [09:21:19] nodepool executes two scripts, one when creating the instance [09:21:32] and another one when booting it up in the pool and before adding the instance to the pool of slaves [09:21:36] how is it doing? [09:21:47] jessie that is to migrate the debian-glue jobs to it [09:22:10] alexandros as crafted some very nice build env to let us build deb package against all the distro we have and having apt.wikimedia.org has a source [09:22:12] but [09:22:18] RECOVERY - Puppet staleness on deployment-bastion is OK: OK: Less than 1.00% above the threshold [3600.0] [09:22:23] I pooled it with the generic puppet class which installs all the mediawiki packages [09:22:30] and a lot of them are not available in jessie or have been renamed [09:22:40] I have filled a task about it, faidon already looked at it and commented [09:22:53] we need to adjust the mediawiki:: puppet definitions to vary some package names [09:23:02] and also figure out whether some packages are actually still needed [09:23:14] an example is libmemcached10 [09:23:26] which we have on ubuntu but is no more on debian cause it provides a later version [09:23:30] maybe libmemcached42 [09:23:40] so have to figure out whether the cluster can run with that newever version [09:23:59] hashar: See https://tools.wmflabs.org/nagf/?project=integration#h_integration-slave-trusty-1010_cpu [09:24:02] The memory graph [09:24:14] For some weird reason, the initial boot has either broken or very high memory usage [09:24:18] and then after reboot it's normal [09:24:34] The first 2 hours were normal [09:24:38] the dark green 'cached' memory is linux cache [09:24:46] whenever you read files on the system, that ends up in that cache [09:24:57] I think it's just broken because it's a flat line [09:25:01] and when the file is written / deleted, the kernel updates discard the cache entry for you automatically [09:25:21] First boot is fine, then second reboot it's broken, and then third reboot (after applying slave and no more errors) it is fine again [09:25:22] so cached is not necessarly a big issue, specially when an instance is being provisionned since there are loooot of writes / reads being done [09:25:26] happens every single instance [09:25:30] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:25:31] been that way for over a yeah [09:25:34] year* [09:25:36] the inactive I guess some process went wild [09:25:48] we have a daemon running 'atop' [09:25:58] which takes sample of cpu / mem / io usage every 10 minutes or so [09:25:59] The actual usage is not that high [09:26:02] it's wrongly reported [09:26:13] that let you browse the history of what is running on the machine [09:26:17] No way an instance has continuous 12 hours exactly that amount of mem usage [09:26:19] weird doc at https://wikitech.wikimedia.org/wiki/Atop [09:26:21] it's flat [09:26:35] ohh [09:26:54] maybe the daemon sending metrics to graphite was broken/Stalled ? [09:27:07] and the flat graph would be caused by lack of new metrics points [09:27:16] Yeah, the first boot it work fine, second boot it goes high and flat, third boot it's fine again [09:27:48] https://phabricator.wikimedia.org/T91351 [09:29:21] 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193521 (10Krinkle) A better example from the new integration-slave-trusty-1010: {F110397} The first boot is fine. Then aft... [09:29:53] 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193524 (10Krinkle) [09:29:55] 10Continuous-Integration: Re-create ci slaves (April 2015) - https://phabricator.wikimedia.org/T94916#1193523 (10Krinkle) [09:30:11] hashar: btw, blockers for re-create tasks I use as a way to track issues we discovered or are bothered by. Not real blockers per se. [09:31:14] RECOVERY - Puppet failure on integration-slave-trusty-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [09:31:26] I am looking at the metric on graphite.wmflabs.org [09:31:55] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [09:35:08] !log Pooled integration-slave-trusty-1010 [09:35:10] Logged the message, Master [09:38:51] 10Deployment-Systems, 6Release-Engineering, 6Services, 6operations: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1193558 (10mobrovac) [09:41:40] 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193566 (10hashar) Looking at [[ https://wikitech.wikimedia.org/wiki/Atop | atop ]] history, there is nothing suspicious. I... [09:43:13] hashar: btw, there is another fatal issue started 10 days ago that comes up very often. [09:43:16] https://wikitech.wikimedia.org/wiki/Release_Engineering/Argh [09:43:22] "Jenkins unable to reach Gearman" [09:43:26] I've aggregated it from SAL [09:43:55] Jenkins goes into a state where we can't relaunch gearman, it gives 503 error from /ci/configure [09:44:58] 10Continuous-Integration: Setup a local apt repository for 'integration' labs project - https://phabricator.wikimedia.org/T95534#1193567 (10hashar) 3NEW [09:45:22] Curious if you're able to find out more about it. I gave it my best, but came up empty. Might be more your area :) [09:45:29] oh man [09:45:36] and I thought it was becoming more stable [09:45:42] Yeah :( [09:46:00] the Zuul deadlock is usually caused by a patch being force merged [09:46:17] the last 2 days with SWAT have been very frustrating [09:46:24] that one https://phabricator.wikimedia.org/T93812 [09:46:28] took 2 hours longer Tuesdau and Wednesday [09:46:32] doh [09:46:33] because of our sucky queue [09:46:41] We really need to do something about it [09:46:49] split the queue again ? :D [09:46:57] Because I don't know Zuul very well, the only thing I know as a solution is to disable dependent pipeline for the time being [09:47:26] This can't keep going on like this [09:47:48] I really wish I have noticed the work on consolidating all the jobs :/ [09:48:12] I would most probably have thought about the issue of having all repos sharing the same queue in gate-and-submit [09:48:19] is there a config flag to disable the queue or to make the queue manually (e.g. declare "mwext" -> mwcore, without the automaatic thing based on job overlap) [09:48:34] nop [09:48:39] or at least known I know of [09:49:00] the queues are generated when Zuul loads its configuration [09:49:16] the zuul diff job probably had huges console log when the changes been made [09:49:32] well, it already had 1300 extensions in the same queeu [09:49:48] the diff is likek 2 mega bytes whenever we change mwext, so it didn't seem important [09:49:49] yup [09:50:01] all extensions rely on mw/core [09:50:27] yeah, that's fine, but the problem is that unrelated projects also get caught. And we have master <> wmf/* also depending [09:50:36] though two extensions changes should probably be ind ifferent queues if there is no mw/core change ahead [09:51:01] yeah there is no knowledge about branches :( [09:51:12] what upstream assume, is that you have no idea what branches a job is going to use [09:51:24] I feel like the dependant pipeline is nice in theory, but not ready yet. A beta feature we should not run in prod. [09:51:27] you could well have a job ending up always using master [09:51:38] well it is fine [09:51:50] upstream says they're removing it in zuul v3 in favour of explicit queue. [09:51:52] until you mess up the convention of having each repos having jobs named differently [09:52:02] then that trick zuul in thinking all those repos are tightly coupled together [09:52:13] but yeah explicit queue would be better [09:52:26] yeah, but we had to consolidate because of disk space and workspace scaling [09:52:45] even now that problem is not solved. [09:52:52] our labs slaves are much smaller than the prod slaves [09:53:21] I posted on some task a way to skip the whole clone entirely [09:53:28] using git clone --shared [09:53:34] but we talked about it early this week [09:53:38] but I mean, it being oblivious to branches seems like an obvious issue. I would never implement dependent pipeline without branches. Is it worth the trouble right now? [09:53:45] not sure why I thought about --shared instead of hardlinks though [09:55:41] !log restarted Zuul to clear out some stalled jobs [09:55:44] Logged the message, Master [09:55:57] Yippee, build fixed! [09:55:57] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #565: FIXED in 45 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/565/ [09:55:59] yeah, and the force merge is also a regression [09:56:08] how come that wasn't a problem before. [09:56:36] would it be worth it for us to spend time patching that instead of waiting for upstream? [09:56:43] that always has been afaik [09:56:45] Might be more important than other projects we're doing at the moment. [09:57:07] at least the code causing the deadlock has been in zuul for aqges [09:57:20] so yeah [09:57:26] +2 on patching ourselve [09:57:31] yeah, but it almost never caused a problem. Now it's causing problems everyday requiring manual intervention to fix. [09:57:33] and we have a bunch of patches pending for zuul-cloner [09:57:37] such as clean / submodule update [09:58:01] The main thing that bothers me is manual intervention. We can't operate CI in a scenario where it is normal to require manual intervention just to keep it running. [09:58:31] unproductive [09:58:36] so the deadlock above need to be fixed [09:58:44] and also not scaling, because we are not online 24/7 [09:58:47] Yeah [09:59:09] brb [10:00:24] me too ,moving desks [10:01:41] 818 mediawiki-extensions-hhvm@4 [10:01:41] 887 mediawiki-extensions-hhvm@3 [10:01:41] 946 mediawiki-extensions-hhvm@2 [10:01:41] 955 mediawiki-extensions-hhvm [10:01:41] 992 mediawiki-core-doxygen-publish [10:01:41] 1467 mediawiki-core-npm@2 [10:01:43] 1550 mediawiki-core-npm [10:01:45] 1807 browsertests-VisualEditor-language-screenshot-os_x_10.10-firefox [10:01:50] in MB [10:01:54] that is a lot :D [10:17:20] (03CR) 10Hashar: [C: 04-2] "I want to keep the DependentPipeline for a wild range of reasons I mentioned on T94322." [integration/config] - 10https://gerrit.wikimedia.org/r/202958 (https://phabricator.wikimedia.org/T94322) (owner: 10Legoktm) [10:26:25] (03Abandoned) 10Hashar: (WIP) Experiment zuul-cloner with extensions [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/141846 (owner: 10Hashar) [10:31:54] !log deployment-bastion has a lock file remaining /mnt/srv/mediawiki-staging/php-master/extensions/.git/refs/remotes/origin/master.lock [10:31:57] Logged the message, Master [10:42:06] hashar: Christoph can not create Jenkins jobs [10:42:32] Access Denied WMDE-Fisch is missing the Job/Create permission [10:42:35] https://integration.wikimedia.org/ci/newJob says [10:43:56] hashar: found him at https://integration.wikimedia.org/ci/user/wmde-fisch/ [10:43:57] Yippee, build fixed! [10:43:57] Project beta-code-update-eqiad build #51059: FIXED in 56 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/51059/ [10:44:32] zeljkof: please have him fill a task / bug [10:44:41] hashar: sure, sending him mail right now [10:44:53] and post me the task #id will reply on it : [10:44:54] ) [10:46:09] !log repacked extensions in deployment-bastion staging area: find /mnt/srv/mediawiki-staging/php-master/extensions -maxdepth 2 -type f -name .git -exec bash -c 'cd `dirname {}` && pwd && git repack -Ad && git gc' \; [10:46:11] Logged the message, Master [10:48:34] Project beta-scap-eqiad build #48303: FAILURE in 4 min 36 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48303/ [10:57:54] 10Continuous-Integration: Create CI slaves using Debian Jessie (tracking) - https://phabricator.wikimedia.org/T94836#1193694 (10hashar) [10:57:55] 10Continuous-Integration, 5Patch-For-Review: Jessie has no install candidate for openjdk-6-jdk - https://phabricator.wikimedia.org/T94999#1193692 (10hashar) 5Open>3Resolved The contint puppet manifest no more attempts to install openjdk-6 on Jessie hosts. Version 7 works just fine. [11:07:21] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [11:17:19] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.018 second response time [11:32:52] 10Continuous-Integration, 5Patch-For-Review: Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit in the short term - https://phabricator.wikimedia.org/T94322#1193732 (10Krinkle) > Commit A that removes the deprecated function wfExample() is now breaking all extensions that still rely on it.... [11:36:17] 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1188546 (10zeljkofilipin) @hashar might know. [11:44:13] 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193738 (10KartikMistry) 3NEW [11:53:20] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [11:54:02] hashar: T95539 please. [11:55:28] 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1193763 (10hashar) `wmf-insecte` is the Jenkins IRC client provided by [[ https://wiki.jenkins-ci.org/display/JENKINS/Instant+Messaging+Plugin | Instant Messaging Plugin ]]. There... [11:58:20] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.023 second response time [12:29:08] 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1193806 (10zeljkofilipin) As far as I know, @manybubbles speaks Java. :) [12:30:59] !log beta: reset hard of operations/puppet repo on the puppetmaster since it has been stalled for 9+days https://phabricator.wikimedia.org/T95539 [12:31:04] Logged the message, Master [12:32:34] 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193816 (10hashar) Beta puppetmaster is deployment-salt.eqiad.wmflabs the git repo under /var/lib/git/operations/puppet is magically auto rebased via a cronjob. The working copy is detached and has t... [12:32:55] 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193818 (10hashar) 5Open>3Resolved p:5Triage>3Normal a:3hashar [12:33:10] kart_: solved :D [12:33:18] kart_: the local repo had some patch cherry picked on it [12:33:30] kart_: and the magic script did not magic to auto update the repo [12:39:43] !log https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ is still broken :-( [12:39:47] Logged the message, Master [12:40:01] !log spurts out Permission denied (publickey). [12:40:03] Logged the message, Master [12:40:46] hashar: thanks [12:49:59] 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1193865 (10hashar) 3NEW [12:50:02] 10Browser-Tests: Transfer the main Sauce Labs account to a generic WMF account - https://phabricator.wikimedia.org/T94191#1193872 (10zeljkofilipin) > zfilipin > Wikimedia > > Hi Renata, > > I am waiting for our IT to create a new e-mail address. I will let you know as soon as I hear back from them. > > Željk... [12:50:45] 10Browser-Tests: Transfer the main Sauce Labs account to a generic WMF account - https://phabricator.wikimedia.org/T94191#1193874 (10zeljkofilipin) > Renata Santillan > Sauce Labs > > Hi Zeljko, > > No problem! We're ready to help when you have more information. > > Best, > > Renata > > April 8, 2015, 10:1... [12:50:57] Hey, how can I correctly run PHPUnit on vagrant? It’s a bit complicated because the extension is only being used on one of the wikis on my vagrant instance [12:51:17] 10Browser-Tests: Transfer the main Sauce Labs account to a generic WMF account - https://phabricator.wikimedia.org/T94191#1193878 (10zeljkofilipin) > zfilipin > Wikimedia > > Hi Renata, > > I have created a new account with username wikimedia. > > Željko > > April 9, 2015, 2:49 PM [12:51:23] 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1193879 (10KartikMistry) Thanks @hashar [12:51:25] vagrant@mediawiki-vagrant:/vagrant/mediawiki$ php tests/phpunit/phpunit.php --wiki=livingstyleguidewiki /vagrant/mediawiki/extensions/OOUIPlayground/tests/phpunit/ [12:51:25] Fatal error: Class undefined: OOUIPlayground\WidgetRepository in /vagrant/mediawiki/extensions/OOUIPlayground/tests/phpunit/CodeRendererTest.php on line 21 [12:51:35] (because it’s not using the correct wiki) [12:52:46] getting a funky error: https://integration.wikimedia.org/ci/job/wikidata-query-rdf/100/console [12:58:54] PROBLEM - Puppet failure on deployment-salt is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [12:59:42] !log Creating integration-slave-trusty-1011 - integration-slave-trusty-1016 [12:59:44] Logged the message, Master [13:00:11] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:01:27] !log integration-zuul-packaged applied zuul::merger and zuul::server [13:01:31] Logged the message, Master [13:04:45] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [13:14:30] !log integration-zuul-packaged applied role::labs::lvm::srv [13:14:32] Logged the message, Master [13:16:43] (03CR) 10JanZerebecki: [C: 031] Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 (owner: 10Legoktm) [13:23:23] 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1193965 (10zeljkofilipin) Looks like there are 3 IE jobs without explicit browser version: # https://integration.wikimedia.org/ci/view/BrowserTests/job/browsertests-Flow-en.wi... [13:24:51] 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1193966 (10zeljkofilipin) The same problem in all 3 jobs: ``` 00:00:15.265 (...) bundle exec cucumber (...) --tags @internet_explorer_ (...) 00:00:18.455 0 scenarios 00:00:18... [13:27:08] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [13:29:43] RECOVERY - Puppet failure on integration-zuul-packaged is OK: OK: Less than 1.00% above the threshold [0.0] [13:34:05] RECOVERY - Long lived cherry-picks on puppetmaster on deployment-salt is OK: OK: Less than 100.00% above the threshold [0.0] [13:37:17] 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1193984 (10zeljkofilipin) Given a simple Selenium script: ``` lang=ruby require "selenium-webdriver" saucelabs_username = "username" saucelabs_key = "key" name = "internet_... [13:40:22] (03CR) 10Hashar: "Create a mediawiki/tools/phpmd repo ? :)" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/201956 (owner: 10MarkAHershberger) [13:41:50] 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1194007 (10zeljkofilipin) Since there is no easy way to determine IE version if it is not set, I think the best way would be to insist that the version is always set explicitly... [13:46:00] 10Browser-Tests: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1194033 (10hashar) If you come to require IE to have version explicitly set, you probably want to update jjb/macro-browsertests.yaml and have it exit early whenever the versio... [13:48:23] PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [13:48:35] (03PS1) 10Zfilipin: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) [13:49:42] 10Continuous-Integration: Create CI slaves using Debian Jessie (tracking) - https://phabricator.wikimedia.org/T94836#1194042 (10hashar) [13:49:43] 10Continuous-Integration, 5Patch-For-Review: Update puppet for packages having different names in Jessie - https://phabricator.wikimedia.org/T95000#1194039 (10hashar) 5Open>3Resolved a:3hashar Solved! That also made firefox to be magically upgraded just like chromium. Labs instance integration-slave-jes... [13:53:29] RECOVERY - Puppet failure on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [13:55:51] (03PS2) 10Zfilipin: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) [13:56:22] (03CR) 10Zfilipin: "Patch set 2 adds created and deleted jobs to the commit message." [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [14:02:05] 10Browser-Tests, 5Patch-For-Review: IE Browser tests job have no test being run due to a mistake in cucumber tag - https://phabricator.wikimedia.org/T95398#1194108 (10zeljkofilipin) The new jobs are running fine: - https://integration.wikimedia.org/ci/view/BrowserTests/view/Echo+Flow/job/browsertests-Flow-en.... [14:05:09] (03PS3) 10Zfilipin: Fix failed Internet Explorer browser test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) [14:05:48] (03CR) 10Zfilipin: "Patch set 3 implements the suggestion from https://phabricator.wikimedia.org/T95398#1194033" [integration/config] - 10https://gerrit.wikimedia.org/r/203063 (https://phabricator.wikimedia.org/T95398) (owner: 10Zfilipin) [14:09:17] Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #583: FAILURE in 38 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/583/ [14:20:36] 10Continuous-Integration: Migrate all debian-glue jobs to Jessie slaves - https://phabricator.wikimedia.org/T95545#1194160 (10hashar) 3NEW [14:21:10] 10Continuous-Integration, 6operations: Build Debian package jenkins-debian-glue for Jessie - https://phabricator.wikimedia.org/T95006#1194170 (10hashar) [14:21:34] 10Continuous-Integration: Migrate all debian-glue jobs to Jessie slaves - https://phabricator.wikimedia.org/T95545#1194160 (10hashar) [14:21:35] 10Continuous-Integration: Create CI slaves using Debian Jessie (tracking) - https://phabricator.wikimedia.org/T94836#1194172 (10hashar) [14:24:43] !log deleting integration-slave-jessie-1001 extended disk is too smal [14:24:46] !log deleting integration-slave-jessie-1001 extended disk is too small [14:24:46] Logged the message, Master [14:24:48] Logged the message, Master [14:25:12] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #1: FAILURE in 25 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/1/ [14:27:10] PROBLEM - Host integration-slave-jessie-1001 is DOWN: CRITICAL - Host Unreachable (10.68.16.72) [14:31:55] 10Continuous-Integration: Replace project-specific "{name}-thing" jobs with generic "thing" ones - https://phabricator.wikimedia.org/T91997#1194202 (10Krinkle) [14:33:28] 10Continuous-Integration: Replace project-specific "{name}-thing" jobs with generic "thing" ones - https://phabricator.wikimedia.org/T91997#1101137 (10Krinkle) [14:34:13] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #1: SUCCESS in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/1/ [14:35:11] Yippee, build fixed! [14:35:11] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #467: FIXED in 9 min 3 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/467/ [14:37:45] (03CR) 10BryanDavis: "Seems to be done now via some other patch: " [integration/config] - 10https://gerrit.wikimedia.org/r/174417 (https://bugzilla.wikimedia.org/73530) (owner: 10Hashar) [14:37:56] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0] [14:47:18] RECOVERY - Host integration-slave-jessie-1001 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [14:49:02] 10Continuous-Integration: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1194315 (10hashar) [14:49:13] 10Continuous-Integration: job creation permission on jenkins for WMDE-Fisch - https://phabricator.wikimedia.org/T95546#1194317 (10hashar) p:5Triage>3Normal [14:51:46] PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [14:52:58] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #1: FAILURE in 51 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/1/ [14:55:22] PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [0.0] [15:03:13] PROBLEM - Host integration-slave-jessie-1001 is DOWN: CRITICAL - Host Unreachable (10.68.16.72) [15:05:33] 10Browser-Tests, 6Release-Engineering: Do not say "< wmf-insecte> Yippee, build fixed!" - https://phabricator.wikimedia.org/T95395#1194390 (10greg) p:5Low>3Lowest [15:08:39] PROBLEM - Host integration-slave-trusty-1010 is DOWN: CRITICAL - Host Unreachable (10.68.17.210) [15:12:16] 10Beta-Cluster: deployment-prep (Beta)'s operation/puppet is outdated - https://phabricator.wikimedia.org/T95539#1194453 (10greg) >>! In T95539#1193816, @hashar wrote: > Beta puppetmaster is deployment-salt.eqiad.wmflabs the git repo under /var/lib/git/operations/puppet is magically auto rebased via a cronjob. >... [15:13:21] werdna: file a bug, you asked at a mostly non-active time for the team :) [15:13:38] manybubbles: the endpoint error? [15:14:01] greg-g: yeah - it doesn't seem to be causing trouble but its an error anyway [15:14:26] step 1) file a bug :) [15:14:32] I can :P [15:15:06] step 0) search for string in phab and notice it's already reported: https://phabricator.wikimedia.org/T93321 [15:15:09] manybubbles: ^ [15:15:41] greg-g: sorry, I was being lazy. [15:15:45] thanks for the search [15:15:49] :) [15:20:06] greg-g: good morning :) [15:20:41] the beta cluster job that runs scap is borked with ssh auth failure and I cant figure it out :( [15:22:15] !sal [15:22:16] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:23:30] hashar: is the ssh key in keyholder? [15:24:26] hashar: is there a bug for... oh there he is [15:24:37] g'morning thcipriani :) [15:24:47] * greg-g passes torch to you [15:24:55] * greg-g goes into meetings for next 1.5 hours [15:25:16] btw: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48303/console [15:26:06] yup, so looking at: SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add -l [15:26:13] there are no identities [15:26:17] 10Beta-Cluster: beta-scap-eqiad no more run due to ssh Permission denied - https://phabricator.wikimedia.org/T95562#1194513 (10hashar) 3NEW [15:26:17] so we just have to add one [15:26:30] thcipriani: I filled a task above :) [15:26:39] kk [15:26:48] in short we had 9 days of puppet patches pending because the repo was stalled on the puppetmaster [15:26:51] might be the reason [15:29:07] 10Browser-Tests, 6Release-Engineering: Browser tests running against beta all failing because of mw-api-siteinfo.py - https://phabricator.wikimedia.org/T95163#1194528 (10greg) Sorry about the radio silence here. >>! In T95163#1182020, @Gilles wrote: > If someone who's an admin on labs for the "Integration"... [15:30:37] 10Beta-Cluster: beta-scap-eqiad no more run due to ssh Permission denied - https://phabricator.wikimedia.org/T95562#1194539 (10greg) p:5Triage>3Unbreak! [15:32:46] !log added mwdeploy_rsa to keyholder agent.sock via chmod 400 /etc/keyholder.d/mwdeploy_rsa && SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa && chmod 440 /etc/keyholder.d/mwdeploy_rsa; permissions in puppet may be wrong? [15:32:48] Logged the message, Master [15:33:21] hashar: the next build of that job _should_ work, but we'll see. [15:37:12] RECOVERY - Host integration-slave-jessie-1001 is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [15:38:11] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194574 (10hashar) 3NEW [15:38:14] thcipriani: you are a magician :) [15:38:18] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [15:38:22] also found out l10nupdate is most probably broken [15:38:25] it writes to the wrong place [15:39:08] I am trying to find the configuration [15:39:43] * hashar whistles [15:40:11] Yippee, build fixed! [15:40:11] Project beta-scap-eqiad build #48334: FIXED in 6 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/48334/ [15:42:06] yay! [15:43:09] yippee! [15:45:12] hashar: looks like l10n should be output to /srv/mediawiki-staging/php-[version]/cache/l10n is that not right? l10nupdate is somewhat opaque to me yet :\ [15:46:03] !sal [15:46:03] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:46:11] thcipriani: it is opaque to me as well :) [15:48:02] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194623 (10hashar) l10nupdate code is in puppet `modules/scap/files/l10nupdate-1` and it has `GITDIR=/var/lib/l10nupdate/mediawiki` /var/lib/l10nupdate/ has been created on 2015-03-25 02:00... [15:51:13] thcipriani: congrats on fixing scap! [15:53:20] hashar: thanks, that key will need to be primed on reboot. Probably not super desirable :\ [15:58:56] thcipriani: what does "primed" mean? [15:59:00] non native english here :D [15:59:21] hashar: primed means prepared / ready to go [15:59:22] usually [16:00:05] !log integration-slave-jessie-1001 recreated. Applying it role::ci::slave::labs which should also bring in the package builder role under /mnt/pbuilder [16:00:08] Logged the message, Master [16:00:33] hashar: "primed" is not really a good term it's just the one that I've heard bd808 use :) it just means running the command I put into SAL, you'll also need the mwdeploy_rsa pass which is in labs/private [16:00:38] :) [16:00:59] ah you know about labs/private already [16:01:00] all good so [16:01:07] was going to suggest putting the keys there [16:01:30] more puppet madness. I am giving up for today [16:01:35] thanks again for the scap fix thcipriani ! [16:01:45] will be back tomorrow [16:01:51] hashar: yw, have a good evening! [16:01:59] hashar: which l10nupdate is broken? prod or some testing thing in beta cluster? [16:02:48] bd808: I think this is the ticket: https://phabricator.wikimedia.org/T95564 [16:04:14] oh. I thought that we made /var/lib/l10nupdate a symlink to /srv/l10nupdate [16:04:27] I wonder if puppet is undoing that for us [16:08:33] 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances. - https://phabricator.wikimedia.org/T95569#1194719 (10Krinkle) 3NEW [16:18:52] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [16:22:24] werdna: use mwscript --wiki=blah phpunit... [16:23:23] 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1194794 (10Dzahn) http://ubuntuforums.org/showthread.php?t=802156 tldr: bad proxies sudo aptitude -o Acquire::http::No-Cache=True -o Acqui... [16:25:38] 10Continuous-Integration, 6Labs: Purge graphite data for deleted integration instances. - https://phabricator.wikimedia.org/T95569#1194807 (10yuvipanda) Should I just delete all the data under the integration project, and let it start again from scratch? [16:34:51] legoktm: I tried mwscript phpunit —wiki and that didn’t work [16:36:36] RECOVERY - Puppet staleness on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:36:40] RECOVERY - Puppet failure on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [16:36:42] RECOVERY - Puppet failure on integration-slave1002 is OK: OK: Less than 1.00% above the threshold [0.0] [16:37:56] RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [16:38:00] RECOVERY - HHVM Queue Size on deployment-mediawiki01 is OK: OK: Less than 30.00% above the threshold [10.0] [16:38:13] PROBLEM - Citoid on deployment-sca01 is CRITICAL: Connection refused [16:39:52] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%) [16:47:35] 10Browser-Tests, 6Mobile-Web, 10MobileFrontend: add metadata to ChunkyPNG image - https://phabricator.wikimedia.org/T67274#1194885 (10greg) @jdlrobson: can you give some background here and/or let me know if this is still needed? [16:52:10] !log Pool integration-slave-trusty-1011...integration-slave-trusty-1016 [16:52:13] Logged the message, Master [17:10:58] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1194974 (10greg) [17:11:51] !log Depool integration-slave1402...integration-slave1405 [17:11:54] Logged the message, Master [17:17:43] (03PS1) 10Krinkle: Remove bash -x from mw-install-* and mw-run-update [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203111 [17:17:52] (03CR) 10Krinkle: [C: 032] Remove bash -x from mw-install-* and mw-run-update [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203111 (owner: 10Krinkle) [17:18:29] 16:39 < shinken-w> PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%) [17:18:39] (03Merged) 10jenkins-bot: Remove bash -x from mw-install-* and mw-run-update [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203111 (owner: 10Krinkle) [17:18:40] less than 100% free? [17:18:52] grrrit-wm: :D [17:18:54] greg-g: :D [17:19:13] it is a terrible message [17:19:18] I’ve been meaning to fix that as well [17:19:26] BUT TOO MANY THINGS *EXPLODES* [17:19:31] YuviPanda: file a task [17:19:34] well techincally ...it's true :) [17:19:47] Krinkle: should I just remove all metrics under integration.? [17:19:53] greg-g: there’s already one I think. let me find [17:19:59] chasemp: yeah, I'd kinda hope so, I'm just curious if it's actually a problem [17:20:04] !log Creating integration-slave-precise-1011 [17:20:06] Logged the message, Master [17:20:13] * greg-g ssh's [17:20:23] greg-g: I was poking fun at yuvi :) [17:20:29] YuviPanda: Preferably not.. [17:20:46] YuviPanda: Though if it's easier, let's do that next monday after I recreated the instances. [17:20:53] Krinkle: that’s definitely easier :) [17:20:59] I'll be deleting a few more isntances and then it'll be stable for the next month [17:21:15] /dev/vda2 1.9G 1.8G 63M 97% /var [17:21:27] 63M free :/ [17:21:50] Krinkle: cool [17:22:02] Krinkle: can you note that on the bug and set a time so I’ll make sure I’m around? [17:22:31] greg-g: that instance needs recreating but too many things on it and we don’t know what’ll break... [17:22:39] YuviPanda: Actually, while we'll delete a few more instances, not metrics. So it'd be cool to delete integration.* now. [17:22:40] * YuviPanda sshs [17:22:49] Krinkle: ok, moment [17:23:00] Then we'll delete the extra instances later, but at least indidivual metrics will be usable again [17:23:08] It would be helpful to have the during these two days :) [17:23:09] Thanks :) [17:24:04] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195043 (10greg) Current df -h ``` gjg@deployment-bastion:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1... [17:25:04] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195048 (10greg) p:5Triage>3High [17:27:01] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195060 (10yuvipanda) Alright, so 'real' solution is to recreate that instance. Since atm that's a bit of a yak shave, I'm just going to symlink things around [17:29:37] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195063 (10thcipriani) Salient comments from this morning: oh. I thought that we made /var/lib/l10nupdate a symlink to /srv/l10nupdate I wonder if puppet is undoing that for... [17:32:45] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195097 (10yuvipanda) I just created the symlink again. Running puppet again. [17:33:46] thcipriani: greg-g uhm, puppet seems hosed on all of deployment-prep [17:34:34] like, totally [17:34:38] certificate failure [17:35:12] YuviPanda: are you running this on deployment-bastion? [17:35:24] thcipriani: it failed there and also failed on salt [17:35:41] filing taks now [17:35:59] 10Beta-Cluster, 6Labs: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195122 (10yuvipanda) 3NEW [17:36:07] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195130 (10yuvipanda) Ok, puppet seems hosed on all of deployment-prep. Filed T95586 [17:37:36] 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1195157 (10EBernhardson) [17:37:37] 7Blocked-on-RelEng, 10Browser-Tests, 10Continuous-Integration, 6Collaboration-Team, and 2 others: Pass MEDIAWIKI_CAPTCHA_BYPASS_PASSWORD in on Jenkins so GettingStarted browser tests pass - https://phabricator.wikimedia.org/T91220#1195155 (10EBernhardson) 5Open>3Resolved [17:39:52] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [17:40:07] YuviPanda: huh. the cert deployment-salt was trying to use was for the agent was i-0000015c.deployment-prep.eqiad.wmflabs [17:41:14] fallout from the enc 'true' string, I'd guess. But it's weird, removing the environment causes it to generate a new cert :\ [17:41:32] 10Beta-Cluster, 6Labs, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195176 (10greg) p:5Triage>3Unbreak! [17:42:46] thcipriani: did beta ever have ENC set? [17:44:23] I think so, they ran into the 'true' problem at the same time. wait, deployment-salt is the puppetmaster? No wonder everything is weird. [17:44:43] yes [17:44:46] deployment-salt is the puppetmaster [17:44:54] thcipriani: last successful puppet run was 281 mins ago [17:45:02] look at the /etc/puppet/puppet.conf that seems wacky [17:45:30] at least vs what I've been seeing in staging [17:45:35] thcipriani: in which host? [17:45:42] on deployment-salt [17:46:11] really? which part seems whacky? [17:46:17] I haven’t looked at things [17:46:29] err [17:46:34] things I meant /etc/puppet/puppet.conf [17:47:49] YuviPanda: well, I think the ssldir is incorrect, also I think that the master section should have more, stuff, at least it does in modules/puppetmaster/templates/20-master.conf.erb [17:47:59] oh, hmm [17:48:00] fair enough [17:48:04] * YuviPanda isnt’ really sure what’s happening [17:48:11] * YuviPanda runs facter [17:50:18] YuviPanda: I bet that briefly the dc changed to i-0000015c.deployment-prep.eqiad.wmflabs which overwrote the puppet.conf, so I bet if we just overwrite the puppet.conf with good values and rerun puppet it'll self correct. [17:51:09] since role::puppet::self checks the ::fqdn against the puppetmaster value in ldap and they didn't match [17:51:35] since the /etc/resolv.conf was updated to deployment-prep.eqiad.wmflabs [17:52:19] uh oh [17:53:50] what's uh oh? [17:54:30] uh oh as in ‘I have no idea how to do that’ :) [17:54:45] how do we overrwrite puppet.conf with good values? [17:58:55] well, that is an excellent question. [18:05:22] * YuviPanda has no answer, and has to go now [18:07:17] kk, I think I'm going to try overwriting the puppet.conf pieced together from /etc/puppet/modules/puppet/self [18:07:48] I really just think it needs to get to the right cert directory and it'll self-correct from there. [18:12:18] hashar: just in time, I was just looking into this: https://phabricator.wikimedia.org/T95586 which is happening on deployment-salt [18:12:34] oh my god [18:12:42] please no [18:13:06] I _think_ I know why it's happening: the /etc/puppet/puppet.conf is pointing to the wrong ssldir [18:13:15] thcipriani: so the puppet client on that instance establish a ssl connection with the master [18:13:29] when the client is setup for the first time it send its cert to the master [18:13:33] and on the master we have to sign it [18:13:57] the host in the cert is based on the instance ec2id and eqiad.wmflabs [18:13:59] right, and it has been signed, it's in the directory /var/lib/puppet/server/ssl [18:14:14] so on line 3 you see i-0000015c.eqiad.wmflabs [18:14:15] but right now it's pointing at /var/lib/puppet/client/ssl [18:14:49] last friday an experimental DNS server has been introduced [18:15:06] which slightly change the fully qualified domain name (fqdn) for instances [18:15:13] so instead of: .eqiad.wmflabs [18:15:19] you have the project inserted as a subdomain [18:15:21] exactly, and what happened, I think, was the fqdn changed, which removed the puppetmaster role [18:15:28] ie: .deployment-prep.eqiad.wmflabs [18:15:36] and that is in turn used in the puppet conf [18:15:40] gotta look at puppet.conf [18:15:46] so tldr [18:15:59] I spend a good half a day fixing it up on the integration project [18:16:07] I looked at beta and it was not impacted [18:16:20] and it was not impacted because the operations/puppet repo has been stall for the last 9 days [18:16:32] when I have unblocked operations/puppet that caused the faulty change to be deployed [18:16:35] damn [18:16:44] !sal [18:16:45] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:16:59] I _think_ I can fix this, because the same thing happened in staging [18:17:15] yup [18:17:25] the dnsmasq server should be the default really [18:17:28] I took notes on https://phabricator.wikimedia.org/T95273 [18:17:40] I eventually had a corrupted puppet.conf on the master [18:17:49] so I ended up having to rebuild a bunch of conf files manually [18:17:57] and I deleted all ssl certs and regenerated all of them [18:18:03] but there must be a smarter way to handle it [18:18:12] at first [18:18:25] I would set the hiera() conf to use dnsmasq https://wikitech.wikimedia.org/w/index.php?title=Hiera:Integration&diff=152484&oldid=152033 [18:18:38] though since puppet clients are not running, it is not going to be applied [18:18:48] so I have manually ran: [18:18:51] echo 'domain eqiad.wmflabs [18:18:52] search eqiad.wmflabs [18:18:52] nameserver 10.68.16.1' > /etc/resolv.conf [18:18:52] /etc/init.d/nscd restart [18:19:32] https://phabricator.wikimedia.org/T95273#1185320 even has the whole script [18:19:34] but that is scary [18:19:46] potentially just changing the resolv.conf should be enough [18:22:07] thcipriani: ah and the ssldir are messed up as well [18:22:25] right, so they need to point at server rather than client [18:22:30] so right now [18:22:40] puppet.conf has a section [main] [18:22:44] ssldir = /var/lib/puppet/client/ssl [18:22:47] wait! [18:22:50] so I suspect the master is using the client cert [18:22:55] when it should use ... the server cert [18:23:05] I remember having seen a diff once I fixed puppet [18:23:12] I got it, so here's what changed in your puppet.conf [18:23:36] https://phabricator.wikimedia.org/P500 [18:23:55] so if you restore those settings + resov.conf you should be good to go [18:24:03] (03PS1) 10Awight: CiviCRM job can be run concurrently [integration/config] - 10https://gerrit.wikimedia.org/r/203187 (https://phabricator.wikimedia.org/T91895) [18:24:04] here the puppet conf on integration puppet master : https://phabricator.wikimedia.org/P501 [18:24:04] I grabbed that out of the /var/log/puppet.log [18:24:14] note how [master] has: ssldir = /var/lib/puppet/server/ssl/ [18:24:35] yup [18:24:36] ah yeah P500 is the diff [18:24:41] accurately describe the issue [18:25:06] seems the hostname change cause some puppet manifest to no more recognize the instance has being the master [18:25:10] so the [master] section is dropped [18:25:13] but the master is still around [18:25:18] so yeah restore [18:25:48] right, the hostname is critical because role::puppet::self checks the ::fqdn against the puppetmaster set in ldap [18:26:01] and if they match, it gives it the puppetmaster role [18:26:02] patch --reverse !! [18:28:16] 10Continuous-Integration, 6Labs: integration labs project DNS resolver improperly switched to openstack-designate - https://phabricator.wikimedia.org/T95273#1195498 (10hashar) The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the maste... [18:28:39] 10Beta-Cluster, 6Labs, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195122 (10hashar) The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the master as being the master... [18:28:44] thcipriani: so yeah should be much faster to fix [18:29:06] you would never believe how much I have screamed while fixing it for integration :( [18:29:10] hashar: I just restored /etc/puppet/puppet.conf [18:29:23] so if we want to try a puppet run, it _should_ work [18:29:27] is puppet agent happy now ? [18:30:08] hashar: seems to be running... [18:30:32] dang: Could not find class role::labs::instance [18:31:12] try again ! :D [18:31:33] at least it seems to compile just fine [18:31:50] Apr 9 18:31:43 deployment-salt puppet-master[1404]: Compiled catalog for i-0000015c.eqiad.wmflabs in environment production in 10.80 seconds [18:31:58] but [18:32:06] Could not retrieve facts for i-0000083a.eqiad.wmflabs: SQLite3::BusyException: database is locked: [18:32:26] Apr 9 18:32:14 deployment-salt puppet-agent[6371]: (/Stage[main]/Role::Labs::Instance/File[/etc/mailname]/content) -deployment-salt.deployment-prep.eqiad.wmflabs [18:32:26] Apr 9 18:32:14 deployment-salt puppet-agent[6371]: (/Stage[main]/Role::Labs::Instance/File[/etc/mailname]/content) +deployment-salt.eqiad.wmflabs [18:32:28] seems to work [18:32:43] thcipriani: sqlite does not really handle concurrent connections :D [18:32:59] heh, sorry :) [18:34:32] so I ran puppet on deployment-bastio [18:34:46] it cant reach some metadata directory :-( [18:34:59] Connection refused puppet://deployment-salt.eqiad.wmflabs/plugins [18:35:19] on integration, /etc/puppet/auth.conf ended up being corrupted [18:35:52] maybe restarrting puppetmaster would suffice [18:36:22] yeah, maybe. I noticed that in the puppet run on deployment-salt it did correct the auth.conf [18:36:29] great [18:36:43] so, maybe everything's magically fixed... [18:36:52] restarting puppetmaster [18:37:10] na :( [18:37:11] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:37:26] bummer [18:37:31] PROBLEM - Puppet failure on deployment-kafka02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:37:31] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [18:37:51] PROBLEM - Puppet failure on deployment-db1 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:38:21] ah [18:38:26] puppetmaster refuses to start [18:39:15] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:39:17] and no idea how to look at logs [18:39:32] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:40:02] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:40:21] hmm, not in syslog or dmesg [18:40:28] it is back up somehow [18:40:46] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:41:16] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:41:58] there is some apache / ruby thing providing metadata [18:42:28] supposed to listen on port 8140 [18:43:45] PROBLEM - Puppet failure on deployment-elastic08 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:45:11] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:45:54] hashar: hmm, I see puppetmaster::passenger role, but it doesn't even look like apache is installed on deployment-salt [18:46:06] gonna kill -9 it [18:46:19] PROBLEM - Puppet failure on deployment-upload is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:46:31] kk [18:46:35] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:46:43] restarting again [18:46:48] looking at netstat -tlnp [18:46:53] to figure out whether a ruby process listen [18:47:01] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:47:04] tcp 0 0 10.68.16.99:8140 0.0.0.0:* LISTEN 9697/ruby [18:47:09] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:47:15] running puppet locally [18:47:21] works!!! [18:47:32] thcipriani: I think the puppet master having the bad conf was still running [18:47:40] and the init.d script was not killing it for some reason [18:47:44] had to kill -9 [18:47:47] ah, that makes sense [18:47:51] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:47:52] then start the process again via the init script [18:47:59] and apparently we have a working puppetmaster again [18:48:10] that is tedious [18:48:19] yeah, that was kinda rough [18:48:25] all those sysadmins tasks remembers me it is a job :) [18:48:27] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:48:33] PROBLEM - Puppet failure on deployment-fluoride is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:48:38] so tl:dr; puppet / labs etc are awesome [18:48:45] but random crazy failures occurs often [18:49:18] root cause in the end is the hostname changed causing puppetmaster to be downgraded magically as a normal client [18:49:27] PROBLEM - Puppet failure on deployment-redis01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:49:31] hashar: had to do the same a while back https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL#March_18 [18:49:43] PROBLEM - Puppet failure on deployment-test is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:50:02] marxarelli: oh [18:50:08] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:50:17] maybe it ended up being locked by too many catalog being compiled [18:50:39] 10Continuous-Integration, 7Upstream: Fails npm build failure "File exists: ../esprima/bin/esparse.js" - https://phabricator.wikimedia.org/T90816#1195557 (10Krinkle) 5Resolved>3Open Happened again. https://integration.wikimedia.org/ci/job/npm/2194/console ``` 18:15:08 Building remotely on integration-slav... [18:51:16] PROBLEM - Puppet failure on deployment-memc02 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [18:51:40] so now what's up with all these puppet failures [18:52:01] so [18:52:09] no clue :) [18:52:12] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [18:52:23] but an interesting thing I have seen is that some packages for Precise have been updated [18:52:28] it may take time for them to resolve [18:52:37] and the repo is signed with a GPG key we do not have on instance [18:52:46] might cause issues [18:54:40] top #1 reason I love puppet: Duplicate declaration: Package[zip] [18:55:25] OK, just to doublecheck that these puppet runs will be fine, I ran puppet on deployment-mediawiki01 and it went fine, so hooray! [18:55:47] memc02 as well [18:55:50] so I guess they will recover [18:55:52] meanwhile [18:56:01] on deployment-bastion there is a nasty change going on [18:56:05] with syslog-ng and rsyslog [18:56:11] yeah, saw that [18:56:19] TECH DEBT OF DOOOM [18:56:34] so in short on prod we used syslog as a central aggregator [18:56:49] then we had all app servers to use rsyslog to relay their local log to that central aggregator [18:56:57] and we never bothered to move the central syslog to rsylog [18:57:06] - or at least we hadn't back 1 + year ago - [18:57:20] so on beta everything should have rsyslog to relay log [18:57:27] BUT deployment-bastion should only have syslog-ng [18:57:40] and of course rsyslog and syslog-ng packages conflict [19:00:50] * hashar https://www.youtube.com/watch?v=GlYj0ogWNRA *Funky Disco House Mix * [19:03:58] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [19:04:42] 10Continuous-Integration: Deprecate global CodeSniffer rules repo and phpcs jobs - https://phabricator.wikimedia.org/T66371#1195612 (10Krinkle) [19:05:27] 10Continuous-Integration: Deprecate global CodeSniffer rules repo and phpcs jobs - https://phabricator.wikimedia.org/T66371#697969 (10Krinkle) a:5Krinkle>3None [19:06:16] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:06:18] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:07:12] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [19:07:28] RECOVERY - Puppet failure on deployment-kafka02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:07:32] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:07:50] RECOVERY - Puppet failure on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [19:08:29] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:09:05] PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:09:35] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [19:10:01] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:10:45] RECOVERY - Puppet failure on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [19:13:47] RECOVERY - Puppet failure on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [19:13:56] 10Continuous-Integration, 10Librarization: Jenkins: Create job for verifying committed "vendor" directory from composer - https://phabricator.wikimedia.org/T74952#1195643 (10Krinkle) [19:14:29] RECOVERY - Puppet failure on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:14:33] whew [19:15:13] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [19:16:16] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [19:16:34] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:17:08] RECOVERY - Puppet failure on deployment-zookeeper01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:17:10] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:17:25] thcipriani: seems all good [19:17:50] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [19:18:34] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [19:18:50] hashar: nice—now we can address the /var partition, which is what we were trying to do in the first place :P [19:19:02] that is not going to be easy :( [19:19:06] RECOVERY - Puppet failure on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [19:19:07] probably easier to build some new instance [19:19:14] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [19:19:26] and migrate services hosted on deployment-bastion to different and fresh instances [19:19:44] RECOVERY - Puppet failure on deployment-test is OK: OK: Less than 1.00% above the threshold [0.0] [19:20:21] well, I posted some comments here: https://phabricator.wikimedia.org/T95564 [19:22:00] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:23:00] 10Beta-Cluster: /var/lib/l10nupdate fills up deployment-bastion /var partition - https://phabricator.wikimedia.org/T95564#1195659 (10hashar) @yuvipanda the puppet manifests ensure /var/lib/l10nupdate is a directory, so you cant really symlink. Up until March 24th the l10nupdate working directory was in /srv/l10... [19:24:48] 10Beta-Cluster, 6Labs, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195675 (10hashar) 5Open>3Resolved a:3hashar Ok solved! That was the exact same issue as on integration and staging project. Changing the hostname cause the puppetmaster... [19:25:09] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:35:05] PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:43:52] (03PS2) 10Legoktm: Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 [19:45:47] (03CR) 10Legoktm: [C: 032] Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 (owner: 10Legoktm) [19:49:11] (03Merged) 10jenkins-bot: Merge mwext-Wikibase-* repo and repo-api jobs [integration/config] - 10https://gerrit.wikimedia.org/r/202932 (owner: 10Legoktm) [19:50:03] !log deployed https://gerrit.wikimedia.org/r/202932 [19:50:06] Logged the message, Master [19:53:46] PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:10:24] 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195824 (10Dzahn) 5Open>3Resolved a:3Dzahn fixed with method 2: ``` # apt-get clean # cd /var/lib/apt # mv lists lists.old # mkdir -p... [20:16:18] 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195841 (10Dzahn) root@deployment-bastion:~# apt-key list | grep -B1 ftpmaster pub 1024D/437D05B5 2004-09-12 uid Ubuntu Ar... [20:20:52] fixed deployment-bastion's APT sources [20:21:11] 10Beta-Cluster, 6Labs, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195853 (10hashar) Thanks a ton @dzahn for the fix, the reference and the detailed step by step instructions! [20:21:16] wondered if there were pending package upgrades since apt-get update was fixed [20:21:29] saw that it would upgrade both libc6 and php5, so a bunch [20:21:41] also saw it would _down_grade salt-minion (?) [20:21:57] didnt execute it [20:25:22] 10Browser-Tests, 10Continuous-Integration, 10Wikimedia-Fundraising: Create unit and integration tests for Fundraising extensions to identify breaking MediaWiki changes - https://phabricator.wikimedia.org/T89404#1195872 (10awight) [20:39:26] 10Continuous-Integration, 6operations: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1195915 (10hashar) Thanks @faidon for the preliminary investigation. Should I fill subtasks for the 5 points you mentioned? It seems that each will reach out to di... [20:41:17] 10Continuous-Integration, 6operations: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1195919 (10hashar) [20:41:52] 10Continuous-Integration, 6operations: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10hashar) [20:41:55] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [20:44:34] ah syslog is all happy [20:45:04] thcipriani: marxarelli: tip for beta cluster, syslog should be centraly collected by deployment-bastion and are written to /data/project/syslog [20:46:01] hashar: neat. [20:46:15] and it is spammed with: init: citoid main process ended, respawning [20:48:28] heh [20:48:44] 10Beta-Cluster, 10Citoid: Citoid Syntaxerror on beta cluster - https://phabricator.wikimedia.org/T95616#1195947 (10hashar) 3NEW [20:48:52] I have no idea how many tasks I have created this week [20:49:04] I feel like my job title should now be "task filler" [20:49:43] 21 [20:49:46] https://phabricator.wikimedia.org/maniphest/query/uQyw3ZctIBhm/#R [20:50:10] 10Beta-Cluster, 10Citoid: Citoid Syntaxerror on beta cluster - https://phabricator.wikimedia.org/T95616#1195960 (10hashar) /etc/citoid/config.yaml is definitely a YAML file but somehow it is being loaded as a javascript file :/ [20:50:29] nice [20:50:42] greg-g: OpenStack infra is considering Phabricator [20:50:49] instead of their home made bug system [20:51:25] saw that :) I've been ignoring the commentary though [20:52:22] best quote is "if only it was written in python" [20:52:24] :) [20:59:19] 3Continuous-Integration-Isolation: Figure out how Jenkins conf is maintained by OpenStack - https://phabricator.wikimedia.org/T95049#1195996 (10hashar) OpenStack has a fully puppetized Jenkins. They have split their puppet modules as independent repositories so that people from the OpenStack community can benefi... [21:15:13] 10Beta-Cluster, 10Citoid: Citoid Syntaxerror on beta cluster - https://phabricator.wikimedia.org/T95616#1196074 (10mobrovac) Merci beaucoup for noticing and letting us know @hashar ! [21:18:51] 10Beta-Cluster: beta-scap-eqiad no more run due to ssh Permission denied - https://phabricator.wikimedia.org/T95562#1196076 (10hashar) a:3thcipriani Fixed by Tyler [21:24:39] chasemp: they are a python shop so yeah :( [21:24:49] that is one of the reason we migrated out of perl Bugzilla [21:25:44] anyway I am out [23:34:56] (03PS1) 10Krinkle: zuul: Don't raise "abort" as error to the user [integration/docroot] - 10https://gerrit.wikimedia.org/r/203251 [23:35:10] (03CR) 10Krinkle: [C: 032] zuul: Don't raise "abort" as error to the user [integration/docroot] - 10https://gerrit.wikimedia.org/r/203251 (owner: 10Krinkle) [23:35:47] (03Merged) 10jenkins-bot: zuul: Don't raise "abort" as error to the user [integration/docroot] - 10https://gerrit.wikimedia.org/r/203251 (owner: 10Krinkle) [23:43:14] 10Continuous-Integration, 7Upstream: Fails npm build failure "File exists: ../esprima/bin/esparse.js" - https://phabricator.wikimedia.org/T90816#1196461 (10Krinkle)