[14:00:47] hello [14:00:50] ci meeting starting now [14:01:04] hi [14:01:08] Krinkle jzerebecki legoktm addshore :D [14:01:12] #startmeeting CI weekly meeting [14:01:13] Meeting started Tue Apr 14 14:01:13 2015 UTC and is due to finish in 60 minutes. The chair is hashar. Information about MeetBot at http://wiki.debian.org/MeetBot. [14:01:13] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [14:01:13] The meeting name has been set to 'ci_weekly_meeting' [14:01:22] :o [14:01:38] I had no idea this was happening :P I am just always here! [14:01:41] #link Agenda https://www.mediawiki.org/wiki/Continuous_integration/Meetings/2015-04-14 [14:01:54] addshore: that is merely a cabal meeting for now :d [14:02:08] I haven't widely announced it to wikitech / wikidata / eng list etc [14:02:21] #topic actions retrospective [14:02:22] Hi [14:02:27] so [14:02:41] last week I have listed a bunch of actions on my side here is the update [14:03:15] #info CI isolation architecture document not updated. Still pondering with Chase / Andrew B. where to place servers [14:03:59] #info Zuul package deployed on all labs instances. Switch of Zuul scheduler on gallium should occur on Thursday https://gerrit.wikimedia.org/r/#/c/202714/ [14:04:18] Antoine to mail QA/labs about the DNS resolver and puppet master corruption on integration labs project [14:04:19] bah [14:04:37] pm corruption? [14:05:02] yeah we had a crazy disruption last week [14:05:12] the DNS resolver has been changed [14:05:32] which resolved labs instance name with a project name inserted in their fully qualified domain name [14:05:51] so integration-puppetmaster.eqiad.Wmflabs became integration-puppetmaster.INTEGRATION.eqiad.wmflabs [14:06:12] since the puppet certs names are based on the DNS name that caused the SSL cert to no more be recognized [14:06:25] also made the puppet master to revert back itself to a single puppet client [14:06:39] OK [14:06:41] beta was not impacted originally because ops/puppet was stalled [14:06:52] So which one are we using now [14:07:02] as soon as I unblocked it, the faulty puppet change has spread on beta cluster instances and caused the same issue [14:07:06] anyway all solved! [14:07:09] Cool [14:07:19] the status is that we are still using the old dnsmasq server [14:07:50] OK [14:08:00] another side effect of the new one, is that *.beta.wmflabs.org dns queries done from labs instances were no more yielding the IP of the instance but the public IP [14:08:06] and the public IP is not reacheable by labs instance [14:08:20] that made any jobs depending on beta to fail -such as browser tests- [14:08:26] that was rather crazy [14:09:00] Krinkle: have you write anything about the ci board columns ? [14:09:04] Not yet [14:09:12] Wanted to discuss with yo first [14:09:40] #link https://phabricator.wikimedia.org/tag/continuous-integration/ Ci workboard and their columns [14:09:50] so [14:10:15] filled tasks comes up with a priority of "Needs Triage" [14:10:17] (open tasks: https://phabricator.wikimedia.org/project/sprint/board/401/ [14:10:35] Yeah, but in cases of tasks for other projects, we do not own the priority [14:10:42] Ci is supplemental [14:10:45] but anyone can set the priority for us and they dont show up nicely in the workboard by default. So I love our manual Needs Triage column [14:11:02] Okay :) [14:11:16] the order Untriaged -> Backlog -> Next -> In-Progress -> Done -> Externally blocked [14:11:27] that looks very linear and self explanatory :D [14:11:44] Great! [14:11:58] hashar: So how to deal with them [14:11:58] I usually look at the open tasks and filter them by priority [14:12:04] ie https://phabricator.wikimedia.org/project/sprint/board/401/query/open/?order=priority [14:12:05] e.g. when there is a new task, when do you move to Backlog [14:12:15] Yeah, me too [14:12:15] so the top of each columns are tasks having an untriaged priority [14:12:42] I wish we could set it so that a task with untriaged priority could not be moved out of the untriaged column [14:12:52] No, that's wrong [14:13:06] Well, kind of [14:13:12] the other way around is wrong [14:13:16] it's not mutually exclusive [14:13:41] If it's an external task, they should triage it themselves before we act [14:13:48] so we'd move it to Backlog and leave Need Triage for them. [14:13:59] If it's an internal task, we should triage before moving to Backlog of course [14:14:10] ahhh [14:14:25] So it's up to us. Not sure we need a technical restriction in place. [14:14:40] I was merely ranting :D [14:14:58] in my career I have been used with ticket system that are very restrictives and enforce a specific workflow [14:14:59] Wanna walk through the 30 ones quickly? Maybe without meet bot [14:15:07] phabricator is way more liberal which is nice [14:15:32] yeah, bugzilla had a status matrix restriction [14:15:34] lets create a meetbot topic [14:15:40] k [14:15:55] #topic Tasks triage [14:16:04] wanna take the lead Krinkle ? [14:16:08] Sure [14:16:30] https://phabricator.wikimedia.org/T95912: Diamond metrics for cpu.system suddenly up 100% after a reboot [14:16:36] Maybe #link the tasks :D [14:16:44] #link https://phabricator.wikimedia.org/T95912: Diamond metrics for cpu.system suddenly up 100% after a reboot [14:16:50] Well, I was going to link them if there's something interesting to be said only. [14:16:56] ok [14:17:09] Most of this is just you and I catching up on each others's task reports [14:17:20] so is that 100% cpu actually happening on the instance? [14:17:25] No [14:17:41] Its' Diamond/Graphite/Something being silly [14:17:43] the graphite / statsd whatever system has been changed recently [14:17:45] might be related [14:17:50] This is from after that [14:18:01] Or rather before and aftet [14:18:05] you can see a small gap [14:18:14] https://phab.wmfusercontent.org/file/data/v6rouywdgjfrmntbbpbv/PHID-FILE-5s7vf2arnz6fjkspmaam/capture.png [14:18:20] 04/11 a tiny slice is missing [14:18:22] that's the switch [14:18:37] no difference. all other instances are fine [14:18:53] this is a report for upstream labs affecting us. Just wanted to make sure you're aware before I move it out of triage. [14:19:02] (should we have a policy like not self-triage?) [14:19:27] I have no clue what is going on to be honest [14:19:49] might be worth querying ops. I think it is godog that is responsible for statsd/graphite [14:20:14] I don't always report it, but I see this shit almost every week. As far as I am concerned we do not currently have reliable monitoring. Which makes things hard. [14:20:26] the pings we get are useless, and the pings we need we don't get. [14:20:40] It always goes crazy. [14:20:54] I wish we just had simple ganglia. cpu. memory network disk. Basic stuff. Nothing fancy. [14:21:02] well on the graph you see it magically recovers but I guess it is after reboot [14:21:07] Yes [14:21:11] maybe diamond has an issue there [14:21:12] The deamon was probably broken [14:21:24] Which happens often [14:21:29] Like https://phabricator.wikimedia.org/T91351 [14:21:35] But in those cases they are all crazy [14:21:37] not just one property [14:21:43] Here it was just cpu.system somehow [14:21:45] but then it should just send raw metrics. Maybe that is an issue with statsd which more or less aggregate metrics before sending them to graphite [14:22:07] ah [14:22:17] imho both bugs are the same [14:22:45] Reminding myself not to drag and drop tasks when sorting=priority [14:22:47] I would update the oldest one https://phabricator.wikimedia.org/T91351 with data from https://phabricator.wikimedia.org/T95912 [14:22:48] Phabricator is shit [14:23:15] maybe we can enable some debug logs in diamond to assist [14:23:37] They are separate tasks. One is about the data being bogus and constant until the third reboot. The other is a running instance for weeks suddenly one cpu property going bogus [14:23:51] Let's move on. [14:24:02] I've triaged as normal and moved to externaly blocked. It's not a priority for us I think? [14:24:08] nop [14:24:16] mostly reported it in case it happens again [14:24:17] just have to poke ops about it [14:24:21] fair [14:24:25] next! [14:24:55] #agree diamond/statsd issue is external. Should poke ops about it. [14:25:00] #agreed diamond/statsd issue is external. Should poke ops about it. [14:25:13] #link https://phabricator.wikimedia.org/T86544: Write test to ensure all mw.hooks are documented [14:25:19] (going to oldest first, to get the queue empty) [14:25:44] I left a comment there last month so I guess it's "triaged" [14:25:57] so it is to make sure that all introduced hooks are properly documented right ? [14:26:05] Yeah [14:26:15] sounds like a lint check [14:26:19] Probably needs someone to write a jscs plugin to scan for it [14:26:41] unless we have someone willing to write the code, I think we should decline it [14:27:08] Hm.. good point [14:27:12] or set priority lowest with no assignee. Which mean the task will bit rot for a couple years and ends up being closed [14:27:17] I tend to close early [14:27:28] I liked "needs volunteer" [14:27:32] Lowest is useless [14:27:37] yup [14:28:09] there is a tag for needs volunteer [14:28:17] Hm.. instead of declining, maybe we can remove #contint project [14:28:28] jzerebecki: is that any effective ? [14:28:32] the task itself is valid for mediawiki [14:28:52] maybe it can be moved to MediaWiki-Unit-Test [14:29:06] Well, it's not a unit test [14:29:15] not sure whether there is a MediaWiki-Code-Quality project :D [14:29:41] hashar: for querying it is, I usually OR it with the tag easy. as in magically appearing people solving all the tickets? there are never enough :) [14:29:48] so: figure out a new component, remove #contint from it and tell the opener to find some javascript developers to pursue the hints you gave [14:30:09] jzerebecki: sounds good! I should use those tags a bit more [14:30:32] Krinkle: just #agreed with whatever you want to do. Moving it out of CI sounds good to m [14:30:32] e [14:30:47] hashar: What are the meetbot commands? [14:30:52] Krinkle: https://wiki.debian.org/MeetBot [14:30:56] Krinkle: should be accurate [14:31:24] usually you start a #topic the post #info and #links . discuss about it and #agreed of an #action [14:31:26] more or less [14:31:45] Hm.. OK. Was just curious about the tree structure between #link and #agreed [14:32:02] #link https://phabricator.wikimedia.org/T62720: Jenkins should run tests for the Wikipedia app before merge [14:32:08] !! [14:32:35] I think #agreed items end up at the bottom of the summary together with #action items [14:32:55] So it loses context of nearest link? [14:33:00] so that task is blocked on https://phabricator.wikimedia.org/T88494 Android app build: Gradle checkstyle + app build [14:33:17] Ah, right. [14:33:20] #info https://phabricator.wikimedia.org/T88494 "Android app build: Gradle checkstyle + app build" is a blocker [14:33:20] I see you already handled it [14:33:22] cool [14:33:30] I exchanged a few emails with mobile [14:33:31] but nothing more [14:33:39] bah [14:33:45] Yeah.. [14:33:46] next? [14:33:50] I guess we can delegate them the setup of gradle [14:34:05] oh, the blocker is in #contint too [14:34:07] I dind' tsee that [14:34:10] #action Antoine to reach out to mobile team and let them setup the gradle / android SDK stuff on CI labs slaves [14:34:25] Yeah, they've been asking for a while [14:34:27] cool [14:34:39] hashar: So what about the "Add Jenkins jobs for x" type of tasks [14:34:53] I have added the task to my todo list [14:35:10] for people wanting to add jobs, I point them to the tutorial [14:35:21] should we move to Next for ci volunteer (you, lego, me) or to Backlog for them to write patch? [14:35:21] well if they are from wmf / wmde [14:35:26] there's like 20 of those tasks [14:35:32] Okay [14:35:50] does Phabricator has any template system to easily bulk answers to similar tasks? [14:35:58] Nope [14:36:12] https://www.mediawiki.org/wiki/Continuous_integration/Tutorials [14:36:25] I haven't reviewed those tutorials since the jobs renaming [14:36:38] * Krinkle switched back to https://phabricator.wikimedia.org/project/sprint/board/401/query/open/?order=natural because order=priority makes it hard to find the oldest and Phabricator is very fragile with changing priority when dragging [14:36:51] k [14:36:54] https://phabricator.wikimedia.org/T69722 [14:36:59] (jobs for SwiftMailer extension) [14:37:28] #info for tasks "Add jenkins jobs on X", point folks to https://www.mediawiki.org/wiki/CI/JJB and https://www.mediawiki.org/wiki/Continuous_integration/Tutorials [14:37:28] Ah, yeah. Would be good to review those tutorials [14:37:40] lets fill a task :) [14:37:48] doiing so [14:37:59] hashar: OK. I'll triage them in the mean while [14:39:00] #link https://phabricator.wikimedia.org/T96024 Review CI tutorials on mediawiki.org [14:39:10] here is the new task [14:39:12] hashar: https://phabricator.wikimedia.org/T69722 [14:39:19] Is that a good comment? [14:39:31] here's the next one https://phabricator.wikimedia.org/T93274 [14:39:37] Krinkle: that looks excelent to me [14:39:44] (with two L yeah sorry) [14:39:44] OK. I'll copy then :) [14:40:25] https://phabricator.wikimedia.org/T93274 [14:40:34] I am wondering whether there is a Phabricator compoennt for that extension [14:41:33] Nope :-( [14:41:37] hashar: https://phabricator.wikimedia.org/tag/mediawiki-extensions-oauth/ ? [14:41:45] That's a different one [14:41:46] https://www.mediawiki.org/wiki/Extension:OAuthAuthentication [14:41:48] jzerebecki: I am not sure that is the same :( [14:42:06] there is also Oathauth [14:42:20] Woah [14:42:22] #link https://phabricator.wikimedia.org/T70113: Alert when time to merge exceeds a known limit [14:42:41] hashar: you are right, diferent one [14:42:57] for the oauth something, maybe figure out whom are the authors and add the related project team. I suspect it would be mediawiki-core [14:43:15] the time to merge is an attent to detect that CI is stalled somehow [14:43:24] we have a bunch of data in metrics we can potentially alert on [14:44:15] greg filled a couple or more tickets and I pasted nice graphs showing potentially anomaly. Not sure on what and at which threshold to send an alarm [14:44:15] Time to actual merge in Gerrit is probably too much to measure and prone to error and false positives. [14:44:20] yup [14:44:22] Measuring queue time in zuul should be more direct [14:44:27] specially with the dependent queue [14:45:01] surely if we have to many task waiting in the gearman queue, that is an issue [14:45:07] Yeah, but we can differentiate between 'queued' and 'running' or "ready but waiting for dependency" [14:45:08] the link in https://phabricator.wikimedia.org/T93274 does not explain what jenkins job is missing, but actually shows a jenkins job as running [14:45:11] the last one is fine [14:45:27] so in short: no bandwith to think about it [14:45:29] if it's queued for long or stuck running, that's something we could maybe extract [14:46:16] jzerebecki: oh correct [14:46:33] jzerebecki: so raymond created the task, I probably noticed it and created the job. Made one non voting because it was failling [14:46:38] jzerebecki: and I forgot to close the bug [14:47:15] that would explain it [14:47:19] hashar: So if we do figure out the exact metric, how would we get alerts on irc and email? [14:47:25] shinken? [14:48:14] I have closed https://phabricator.wikimedia.org/T93274 thanks jzerebecki ! [14:48:22] yeah shinken [14:48:49] there is a nagios/icinga/shinken plugin that is able to query some graphite metric and report an alarm based on a threshold of X value over Y time range [14:48:51] or something alike [14:49:05] so if you have more than 10 gearman functions waiting for 10 minutes, you get an alrm [14:49:18] Yeah [14:49:23] or if you get more than 50 gearman functions waiting in 2 minutes you raise a critical alarm [14:49:26] something like that [14:49:26] Like we have for the diamond metrics for our instances [14:49:31] exactly [14:49:38] I guess they both end up in graphite [14:49:38] coo [14:49:39] l [14:50:00] so I think the idea for production is to eventually get rid of ganglia and of custom checks scripts [14:50:06] and rely more on diamond / graphite checks [14:50:27] anyway for us, that needs someone to look at what we want to monitor [14:50:35] figure out the metrics that could be monitored [14:50:40] and then define threshold and alarm levels [14:50:58] #agreed We need to figure out what metric and threshold, then we can use Shinken to monitor the Graphite query [14:50:58] it is nowhere near rocket science but need a good amount of time to think about it [14:51:03] yeah [14:51:13] 25 left [14:51:16] hehe [14:51:25] maybe we should triage more often :D [14:51:43] have a meeting with team practice group in 9 minutes so I can not extend [14:51:44] We just have a backlog since this is the first [14:51:49] yeah [14:51:51] We'll catch up next week [14:51:56] Let's do one or two more [14:52:04] Wanna pick? [14:52:26] #link https://phabricator.wikimedia.org/T91697 Launching Jenkins slave agent fails with "java.io.IOException: Unexpected termination of the channel" [14:52:30] I have picked a random one :D [14:52:42] okay [14:52:48] so [14:53:02] the good thing with java is that the stacktraces are often very detailed [14:53:14] the drawback is that I dont know java nor the jenkins code base [14:53:28] maybe it is just the ssh connection dieing somehow [14:53:33] yeah [14:53:34] or a timeout [14:53:40] it happens quite often when relaunching slave agent [14:53:44] After like 1 or 2 seconds [14:53:46] does it has any impact beside having to relaunch the connection ? [14:53:53] just try again [14:54:00] but it can take many times sometimes [14:54:06] annoying but survivable I guess [14:54:07] Worrying about what if maybe one doesn't come back up [14:54:16] your best bet would be to fill the ticket to upstream Jira :( [14:54:27] https://issues.jenkins-ci.org/ [14:54:28] They would close for can't reproduce [14:54:36] yeah sometime [14:54:39] but at least you tried :) [14:55:00] there is a slight chance to get a "more information" request or even a patch [14:55:07] oh java.io.EOFException [14:55:18] that is the communication channel receiving 'end of file' I guess [14:55:24] well [14:55:43] so I would fill a ticket upstream and close our [14:55:48] or just close it :/ [14:56:01] OK. I'l file upstream at some point. Just tagged as Upstream and external blocked for now [14:56:06] it's survivable , as you say [14:56:20] #link https://phabricator.wikimedia.org/T94273 dvipng spurts coredump on Precise instance [14:56:51] I have no clue what dvipng is [14:56:55] #info core dump generation has been enabled on CI labs slaves during summer 2014 while HHVM was still segfaulting a lot. No more used nowadays though [14:56:57] so [14:56:59] how does it relate to mysql [14:57:09] dvipng is used to render math expressions as png files [14:57:13] used by MathSearch [14:57:21] and something weird happen in it that causes it to segfault [14:57:27] our slaves are using Precise [14:57:27] ok [14:57:31] prod is most probably using Trusty [14:57:45] there is no package providing the debugging symbols so the core is more or less useless [14:57:54] and I am not willing to spend time investigating it [14:58:01] the only reason it was filled was /var/ being filled up [14:58:02] Let's disable core dumps? [14:58:04] yeah [14:58:13] glad to see we are on the same mood [14:58:17] is it just the line in bin/ ? [14:58:21] How do you disable it [14:59:02] https://github.com/wikimedia/integration-jenkins/search?utf8=%E2%9C%93&q=core+dump&type=Code [14:59:11] $wgDjvuDump = '/usr/bin/djvudump'; [14:59:12] -vDebug.CoreDumpReportDirectory="$LOG_DIR" \ [14:59:13] interesting [14:59:45] https://github.com/wikimedia/integration-jenkins/blob/master/bin/mw-set-env.sh#L9 [15:00:35] #agreed disabled core dumps generation on CI slaves https://phabricator.wikimedia.org/T96025 [15:00:41] there is an ulimit somewhere [15:01:00] OK [15:01:10] Cool. [15:01:26] This has been good [15:01:33] end ? [15:01:35] I am in conf call [15:01:38] :D [15:01:39] #end? [15:01:43] #endmeeting [15:01:43] Meeting ended Tue Apr 14 15:01:43 2015 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [15:01:43] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-14-14.01.html [15:01:43] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-14-14.01.txt [15:01:43] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-14-14.01.wiki [15:01:44] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-04-14-14.01.log.html [15:01:44] thanks ! [15:01:49] will write summary tonight I guess [15:01:55] Okay! [15:02:21] Krinkle: jzerebecki thanks! [18:51:25] We will be starting the tech talk The state of Team Health across Wikimedia Engineering by the TPG in 10 min [19:00:23] starting in a moment :) [19:03:52] live in 1 min [19:05:16] starting now..... [19:05:43] questions will be at the end, but please ping me if you have anything you would like me to ask [19:05:57] rfarrand i'm in the hangout, but nobody else is there, should i just go to YT? [19:06:20] bgerstle: yes! [19:06:22] sorry about that [19:06:25] k [19:06:35] audio comments appreciated, thanks! [19:07:08] Can someone post the YT link? thanks [19:07:14] FYI there's a really cool "interactive" G+ page for the livestream: https://plus.google.com/u/0/hangouts/onair/watch?hid=hoaevent/crrengrbtbca8ncndpv3ic4gdrc&ytl=-YfkxpJTuY4&wpsrc=yta [19:07:14] it is in the topic [19:07:17] awight: ^ [19:07:25] thanks! [19:07:35] http://www.youtube.com/watch?v=-YfkxpJTuY4 [19:08:45] meh, G+ thing isn't really anything special, go to YT ^ [19:16:16] 17 remote people watching :) [19:20:16] o/ [19:20:24] (has taken HC survey) [19:31:36] questions [19:32:06] start asking now if you have them :) [19:32:46] where can we read about these plans to resolve technical debt? [19:33:03] will ask [19:34:41] :D [19:34:45] last chance! [19:35:25] rfarrand: any blog posts on results ? [19:35:33] will ask :) [19:37:28] hi KatyLove [19:37:37] Hi matanya ! [19:37:44] :) [19:38:06] thanks for joining! :) [19:38:11] 18 people remote! [19:38:37] yay kristen! [19:38:39] about 20 people in SF [19:38:51] 38 people = big tech talk!