[12:58:07] CI weekly meeting is starting in a couple minutes [12:59:52] #startmeeting CI weekly meeting [12:59:53] Meeting started Tue Apr 7 12:59:53 2015 UTC and is due to finish in 60 minutes. The chair is hashar. Information about MeetBot at http://wiki.debian.org/MeetBot. [12:59:53] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [12:59:53] The meeting name has been set to 'ci_weekly_meeting' [13:00:01] good (eu) afternoon [13:00:03] o/ [13:00:08] jzerebecki: Krinkle :) [13:00:22] \o [13:00:24] #link Previous Agenda https://www.mediawiki.org/wiki/Continuous_integration/meetings/2015-03-30 [13:00:31] #link Agenda https://www.mediawiki.org/wiki/Continuous_integration/meetings/2015-04-07 [13:00:39] #topic actions retrospective [13:00:45] i'm split between two meetings, but trying my best [13:00:46] Lets look at previous week meeting actions [13:01:06] https://www.mediawiki.org/wiki/Continuous_integration/meetings/2015-03-30#Topics [13:01:23] iirc I triaged all fives items in there [13:01:55] https://phabricator.wikimedia.org/T62143 Provide a way to have a demo directory alongside the documentation on doc.wikimedia.org [13:02:23] Krinkle: from your last comment and given your last week past work, seems we are close to solve this one [13:03:02] since publishing /demo/ directory is primarly for VE / oojs , I guess you can assign the bug to yourself? [13:03:07] hashar: Yeah, the directory navigation I created for int.wm.o/cover/ is generic and can be reused there. [13:03:30] excellent! [13:04:06] #info Work by Timo to publish /cover/ directories is generic and can be reused to publish /demos/ directory. Would solve https://phabricator.wikimedia.org/T62143 [13:04:27] hashar: Which list are we working from? [13:04:36] Ah, I see [13:04:51] retropsective of past week topics https://www.mediawiki.org/wiki/Continuous_integration/meetings/2015-03-30#Topics [13:05:19] #info Fixed Publish QUnit coverage on integration.wikimedia.org https://phabricator.wikimedia.org/T87490 [13:05:44] https://phabricator.wikimedia.org/T91707 l10n-bot self-force-merging sometimes breaks mediawiki/core master [13:06:11] that is blocked by l10n-bot triggered hundreds of testextensions jobs [13:06:19] which in turn clone mediawiki/core and fills the disk [13:06:35] I noticed some commit that would unify the testextensions jobs under a single job [13:06:48] hashar: Except for the ones with dependencies. [13:07:03] hashar: Yeah, the idea is to use reusable workspaces, like for the other generic jobs. [13:07:08] I guess the workspace is cleared between build ? [13:07:16] Will be a bit slower (re-clone), but for the extension repo is much smaller on average [13:07:25] mediawiki-core would be preserved I expect [13:07:38] still work in progress [13:07:39] by legoktm [13:07:55] I proposed a rather crazy way to save us from cloning mw/core on https://phabricator.wikimedia.org/T93703#1144542 [13:08:06] which is to have a mirror of mw/core and use git clone --shared [13:09:23] we also have repositories running the shared extension job [13:09:26] hashar: is that the same as cloning with hard link to local ref? [13:09:29] which clones several of them together [13:09:30] Or is shared different? [13:09:35] the hard link seems safer. [13:09:38] shared is slightly different [13:09:44] it grab the objects directly from the mirror repo [13:10:13] so if the mirror repo is ever written to, the working copy can potentially becomes corrupt [13:10:25] writing to the mirror repo might invokes git gc [13:10:29] which would deletes objects [13:10:44] and if the working clone references thoses no more existing objects, it becomes corrupt :( [13:10:46] what's the advantage of --shared over re-cloning each time with hard links (like we used to do) [13:11:19] oh hardlink [13:11:47] I was always impressed by the way that worked (your idea :) ) [13:12:02] I cant remember off hand :( [13:12:14] I think I have hit a wall when using hardlink on labs instance [13:12:40] let state that bug needs further investigation [13:13:11] but I think the point was to avoid creating all the hardlinks under /.git/ [13:13:19] and let us repack the source repo [13:13:35] will update the task stating it needs more info [13:14:00] If we wipe the workspace and re-create the hardlink clone each time, there should no need for that [13:14:23] yeah with zuul cloner that might be what we end up doing [13:14:31] #info https://phabricator.wikimedia.org/T93703 reduce copies of mediawiki/core in workspaces: needs more investigation for the difference between git clone --shared and just hardlinks. [13:15:01] lets move on [13:15:11] from past week https://phabricator.wikimedia.org/T91766 Gerritbot shouldn't post "Change merged by jenkins-bot:" messages any more [13:15:15] that is not CI [13:15:35] it is the Gerrit bot interface between Phabricator and Gerrit. I have removed the task from our scope [13:15:50] OK [13:15:55] from past week https://phabricator.wikimedia.org/T92909 All new extensions should be setup automatically with Zuul [13:16:04] that one is quite interesting [13:16:10] since we now have generic jobs [13:16:20] we could have Zuul to understand project wildcards [13:16:23] something like: [13:16:35] project: mediawiki/extensions/* --> trigger 'npm' 'composer' [13:16:53] so as soon as the repo is created in GErrit it will get job triggered for it [13:16:59] hashar: Hm.. yeah. 'npm [13:17:04] hashar: Hm.. yeah. 'npm' should probably be opt-in though [13:17:14] I have stated last week and updated the task to say that the feature need to be added to Zuul [13:17:26] made it a blocker of https://phabricator.wikimedia.org/T94409 (Upgrade Zuul server to latest upstream) [13:17:34] which in turns is blocked by my Debian packaging work [13:18:09] hashar: wildcard is supported in latest upstream? [13:18:15] Krinkle: nop [13:18:23] someone will have to implement it :) [13:18:48] Right [13:18:56] but I definitely love the idea [13:19:17] I guess we covered past week topics [13:19:29] #topic CI isolation status [13:19:35] hashar: I think auto-setup should be deprioritised. It doesn't happen very often, and our job config isn't ready for it yet. For example, phpunit isn't part of composer for extensions (cannot be). [13:20:12] Krinkle: it is low already, feel free to move it down to lowest ( https://phabricator.wikimedia.org/T92909 ) [13:20:24] #link CI isolation status board https://phabricator.wikimedia.org/tag/continuous-integration-isolation/board/ [13:20:26] Oh, I was looking at the workboard column [13:20:58] #info Antoine had a meeting with Chase / Andrew B and Greg last Friday. We are going to set up a server in labs subnet to support nodepool [13:21:26] #link https://phabricator.wikimedia.org/T95045 install/deploy labnodepool1001 [13:21:54] #info talked about puppetizing Jenkins configuration and even setting up an entirely new Zuul etc architecture in parallel [13:22:20] #action Antoine to update CI isolation architecture and reply to Chase/Andrew B questions from last meeting. [13:22:41] #info Next meeting Friday April 10th 7pm UTC. [13:22:49] Krinkle: so we made some good progress [13:23:03] hashar: I'd like to rename Backlog to Untriaged and create Backlog column [13:23:15] Asked James_F to verify and I think we made a mistake on our workboard :P [13:23:15] yeah [13:23:21] any question about CI isolation? [13:23:33] hashar: Yeah, nothing specific but let me think.. [13:23:38] #info Labs precise instances are now using the Debian zuul package. Not puppetize though. [13:23:42] hashar: I spoke with openstack-infra a bit about how they do things [13:23:55] They're abandoned the concept of using snapshots [13:24:08] oh [13:24:20] instead they're using image builder to build new instances from scratch (not using a base image) and then upload to glance directly. [13:24:27] so they spawn vm out of a baseimage then run the whole puppet provisionning script ? [13:24:33] They had lots of bugs with it, that's why they abandoned it [13:24:55] they create a new disk basically, using chroot to some extend. [13:25:02] And then install linux+puppet. [13:25:10] which applies the base role and CI slave role [13:25:54] #action Antoine to puppetize zuul Debian package [13:26:13] #action Antoine to migrate gallium Zuul install to Debian package [13:26:18] hashar: I also spoke to them about our queue issues. I can't find the task but I left a comment somewhere from james e blair [13:26:25] #action Antoine to build and deploy Trusty Zuul Debian package [13:26:36] they've working on zuul v3 which will have explicit queues [13:26:52] (because of generic jobs) [13:26:54] yeah I noticed a lengthy mail by James about rethinking zuul [13:27:37] have you had a chance of talking with James about all our projects ending up sharing the same Dependent queue ? [13:28:38] hashar: Ah, you're on their mailing list? [13:28:40] * Krinkle should subscribe [13:28:42] yeah [13:28:50] openstack-infra@lists.openstack.org iirc [13:28:54] yeah [13:29:21] #link http://lists.openstack.org/pipermail/openstack-infra/ OpenStack infrastructure team mailing list (they maintain Zuul/Nodepool) [13:30:00] OK. Backlog created. [13:30:06] #topic Phabricator column names [13:30:21] #link https://phabricator.wikimedia.org/project/board/401/ CI workboard [13:30:22] Is there anything in Next column that is not actionable right now? [13:30:55] so the idea [13:31:05] is that when you click on the project link you end up on the workboard [13:31:20] on the left side bar there is an Anchor that points you to a search query [13:31:32] which list open/stalled pages grouped by priority [13:31:41] Hm.. [13:31:43] #link https://phabricator.wikimedia.org/maniphest/?statuses=open%2Cstalled&allProjects=PHID-PROJ-6wswany56ulk4z7k33jb#R CI project default search [13:31:50] that list the unbreak now task at the top [13:31:55] followed by Needs triage [13:32:09] that used to be the place you ended when clicking a project link [13:32:14] but it is now the workboard :( [13:32:16] Yeah, but Priority field in CI tasks is often not our priority but downstream's priority [13:32:24] yeah which is a mess :( [13:32:30] You can sort workboard by priority though [13:32:49] https://phabricator.wikimedia.org/project/sprint/board/401/query/open/?order=priority [13:33:03] so to me Untriaged / Backlog columns are a bit redundant [13:33:07] what we could have [13:33:10] https://phabricator.wikimedia.org/T94138 [13:33:32] is have all new tasks enter the default Untriaged column [13:33:32] Anything we have discussed here that is not our problem (downstream) or is triaged can go to backlog [13:33:39] thatway each week we can go through Untriaged. [13:33:46] and not go to the same thign multiple times [13:33:50] then have a hierald script to move them to the Backlog comment whenever they are no more "Needs Triage" [13:34:07] hashar: If they have a priority downtream that we have not yet triaged them [13:34:13] it should stay untriaged in that case [13:34:19] oh [13:34:23] :) [13:34:55] I must say have no idea about a good workflow with Phabricator [13:34:56] :( [13:35:06] Maybe we can treat Untriaged as Inbox/Unread [13:35:21] that is what I did with Backlog [13:35:32] hashar: But then where did you move it to when it is handled? [13:35:34] then would pick items from Backlog to Next once the task is depending on us [13:35:42] and move it to In-Progress whenever working on [13:36:00] or if depends on someone else but us, move it to Externally Blocked [13:36:01] James created Backlog/Next/In-progress for VisualEditor [13:36:09] and VE doesn't even have the downstream priority problem [13:36:22] The advantage is that: Many tasks are not actionable (have dependencies or need discussion) [13:36:27] so should not be in Next [13:36:30] but remain in backlog [13:36:59] We also have too much in progress. We're not actively working on all that [13:37:15] What is status on https://phabricator.wikimedia.org/T91396 by the way? [13:37:24] hold on [13:37:43] would you mind writing your thoughts about the usages of the different columns ? [13:37:58] maybe as a wikipage under https://www.mediawiki.org/wiki/Continuous_integration [13:38:08] OK, will do later today :) [13:38:14] though ideally Phabricator would let us add a description to each column :) [13:38:16] Just a proposal, I want your thoughts [13:38:24] Backlog: Inbox unread (for you and me to handle) [13:38:31] Untriaged* [13:38:36] #action Timo to write his thoughts about the Workboard columns and potential usage [13:38:49] I will probqably be fine with whatever you come up with [13:38:50] Backlog: We've read it, considered it. Either we don't need to work on it, or we'll work on it in the future sometime. [13:38:54] just need to know how to use them hehe [13:38:58] Next: Once you complete a task, you can pick one here to work on [13:39:08] yeah write that down to some wiki page :) [13:39:20] would be interesting to exchange about it on the teampractice list [13:39:29] 20 minutes left [13:39:39] #topic Have jenkins jobs logrotate their build history https://phabricator.wikimedia.org/T91396 [13:39:42] grr [13:39:46] #link https://phabricator.wikimedia.org/T91396 [13:40:35] #info Antoine did most of the grunt work. There are still jobs not logrotated to be investigated. Most of the grunt work has been done, so lowering priority. [13:40:45] hashar: Hm.. I don't get it. [13:40:45] Krinkle: almost all jobs are logrotated now [13:40:52] hashar: We add logrotate to default. [13:40:54] Then everything has it. [13:40:55] done. [13:40:55] yeah [13:40:58] how can some jobs not have it? [13:41:00] but some job do not have the default [13:41:07] or are still configured in Jenkins but no more maintained by JJB [13:41:24] Does everything in JJB have default? [13:41:27] I think the maven job templates do not use the default maybe [13:41:38] Does everything in JJB have logrotate* [13:41:55] almost [13:41:57] to be double checked [13:42:19] hashar: OK [13:42:25] I've got two more tasks to update for now [13:42:26] #link https://phabricator.wikimedia.org/T91396#1090060 A list of jobs not logrotatedd as of March 19th (updated from time to time) [13:42:34] Ah, nice [13:42:46] some of the jobs were left over after they got unified as generic ones [13:43:00] such as the *phpcs-HEAD which are probably no more existing [13:43:05] but stayed in Jenkins config [13:43:09] yeah, but those don't grow anymore. [13:43:15] We'll delete those when we purge. [13:43:15] yup [13:43:19] so definitely under my radar [13:43:20] shouldnt be part of the logrotate job [13:43:24] but not the higheest prio :) [13:43:42] https://phabricator.wikimedia.org/T91410 [13:44:08] I have lowered the prio [13:44:14] hold on [13:44:18] got a few more for you :) [13:44:26] #topic labs instances creation [13:44:36] I noticed you deleted the integration-slave-preciseXXX instances [13:44:38] OK :) [13:44:39] any reason ? [13:44:50] hashar: I deleted them the same way I'll delete the trusty ones again today [13:44:53] because they're broken [13:44:58] ARHARHHHHHHHHHHHHHH [13:44:58] o/ good morning [13:45:05] legoktm: hello :) [13:45:08] and our puppet is too low tested/reviewed/quality to be reliable [13:45:14] if they are broken, it must be deleted and re-created [13:45:19] which only takes one minute [13:45:20] not really [13:45:29] if they don't work, we fix what is broken and re-create [13:45:33] then we see if it's really fixed [13:45:35] I spent a good afternoon looking at the Precise instance and the failures were unrelated to our manifests [13:45:43] they were either weird puppetmaster issues or some [13:45:48] Yes, I know [13:45:51] I filed the tasks [13:45:54] incorrect issue in ops/puppet manifests [13:46:05] So we'll re-create them now that the bug is fixed [13:46:08] but once appropriate puppet patch got cherry picked they worked just fine [13:46:16] We have existing instances that work fine [13:46:26] the new ones I created for the sole purpose of testing puppet [13:46:33] it was overkill to delete them since I got them fixed and that took me a while :/ [13:46:33] that's why I re-create them every month [13:46:49] hashar: what did you do on the invididual instances? nothing right? [13:47:02] beside running puppet? yeah notthing more [13:47:04] anyway [13:47:11] I would recommend to build a single instance first [13:47:17] instead of 4 / 5 of them [13:47:23] then fill tasks [13:47:25] Yeah, so there is no work wasted. [13:47:31] and once solved delete that single instance then create another one [13:47:38] There were issues, I filed the tasks, you fixed them, and now we can re-create [13:47:44] I thouigh the precise instances were about to be pooled so I worked on polishing them [13:47:46] Yeah, that's why I created only one slave-trusty [13:47:48] last week [13:47:49] instead of 5 [13:47:57] I created 5 at first because I thought there were no issues [13:47:59] though we have 5 now :) [13:48:15] but EVERY month when we re-create things it turns out, yet again (surprise!) our puppet is broken for some reason [13:48:21] #info integration-slave-precise* instances created / deleted because of puppet errors. Should be all fixed now [13:48:35] I've done this 4 times and not a single time did we have less than 5 critical issues that are completley unrelated. [13:48:45] #idea only create a single instance to verify puppet pass instead of spawning five we will end up deleting anyway. [13:49:08] yeah puppet keeps breaking [13:49:10] mostly due to ops/puppet and labs stability [13:49:15] Yeah [13:49:20] but it does not necessarly have a huge impact on the end result [13:49:27] OK. re-creation is low prio, I do it mostly to test our puppet [13:49:36] which is a good idea [13:49:40] I'll try again tomorrow from scratch [13:49:42] just test it with a single instance instead of five :) [13:49:43] with precise and trusty [13:49:47] Yeah, will do :) [13:49:51] so for Trusty [13:50:09] the DNS resolver got badly changed ~ 24 hours ago, I will mail QA and labs list about it [13:50:26] TL:DR; puppetmaster had to be reinstalled and all labs instances puppet certs regenerated [13:50:31] which I wasted my time on this morning [13:50:40] Argh [13:50:44] the integration-slave-trusty-1001 to 1004 are fully operational now [13:51:00] Not yet pooled right? [13:51:10] I have deleted integration-slave-trusty-1005 because the wrong puppet class was applied to it ( ci::website instead of ci::labs::slave) [13:51:13] none are pooled [13:51:14] Because I'm quite sure we did not do all steps from wikitech:Integration/Setup yet [13:51:37] so trusty-1005 is a full reinstallation as of 2 hiours or so ago. It is still working : [13:51:38] ( [13:52:19] to me 1001 - 1004 are ready. Need to add to them the manual steps from wikitech:Integration/Setup [13:52:55] hashar: I'll re-create them later today or tomorrow and pool them if they pass or file more tasks :) [13:53:07] Krinkle: no need to recreate them :) [13:53:23] It only takes a minute, and I simply cannot trust our infrastructure. [13:53:24] I have spent the beginning of the afternoon making sure they are conform to what puppet expect [13:53:47] If everything works it shouldn't matter that they re-create, exactly the same. [13:54:00] still a waste of time since it takes roughly 2 hours to build :) [13:54:21] it only takes me 2 minutes of blocking time [13:54:25] the 2 hours I do other stuff. [13:54:41] Right now we've postponed re-created for 8 days. 2 more hours won't hurt. [13:55:04] I'm just being thorough :) [13:55:17] well I have made then pristine [13:55:37] OK [13:55:50] delete them if you want, but that sounds to me you are dishing out the work I did today to have them 100% puppet clean [13:56:06] #action Antoine to mail QA/labs about the DNS resolver and puppet master corruption on integration labs project [13:56:13] The main problem I see with puppet failures is that many of our manifest do not recover. If they fail the first time, the second time they will not re-try everything. Some things will appear to be applied so it leave it behind. [13:56:38] Thus causing a bug one or two days later when something we dn't use very often fails. [13:56:40] yeah potentially [13:56:49] legoktm: around ? [13:56:53] yep [13:56:57] before we end [13:57:05] #topic next meeting date and time [13:57:08] So to me spending 2 minutes re-creating it is 100% worth the effort to eliminate that risk [13:57:23] and save me potentially an hour of debugging [13:57:27] I have set the meeting at 1pm UTC (3pm CET) which is definitely too early for SF [13:57:47] what about starting next week meeting 1 hour later? [13:57:52] legoktm: are you in SF timezone? [13:58:14] yeah I'm in SF's timezone. 2pm UTC works for me [13:58:45] well, as long as its not on monday, I'm going to be on a train then. tuesday works great [13:58:49] legoktm: isn't it like 7am in SF? [13:59:30] Krinkle: what do you think about moving the meeting to Tuesdays at 2pm UTC (4PM CET) [13:59:38] 7am is better than 6 :P [13:59:41] Works for me [13:59:55] #info 2pm UTC is a bit too early for SF. Monday is rush hours for some [14:00:10] legoktm: we might move it a bit further later on :) [14:00:53] #agree Next meetings moved from Mondays 13:00 UTC to Tuesday 14:00 UTC (16:00 CET / 7:00 PST) [14:01:05] legoktm: the whole thing is very experimental :) [14:01:16] #topic Rejoice [14:01:19] :) [14:01:35] #info Congratulations on Krinkle and legoktm for all the hard work in March ! [14:02:01] I will push the minutes to the wiki [14:02:17] craft a summary mail and spam a few lists [14:02:22] #endmeeting [14:02:23] Meeting ended Tue Apr 7 14:02:22 2015 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [14:02:23] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-07-12.59.html [14:02:23] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-07-12.59.txt [14:02:23] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-07-12.59.wiki [14:02:23] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-07-12.59.log.html