[01:56:01] Hi milimetric--still in work mode? just wondering if you might have any thoughts about tests hanging on async_result.get() in test_cohorts... [01:56:36] Here's the last bit of the stack trace I get when I hit ctrl-C: [01:56:36] File "/vagrant/wikimetrics/tests/test_controllers/test_cohorts.py", line 87, in test_detail_by_name_after_async_validate [01:56:36] async_result.get() [01:56:36] File "/usr/local/lib/python2.7/dist-packages/celery/result.py", line 169, in get [01:56:36] no_ack=no_ack, [01:56:37] File "/usr/local/lib/python2.7/dist-packages/celery/backends/base.py", line 220, in wait_for [01:56:39] time.sleep(interval) [09:08:30] AndyRussG: that sounds like a deadlock on wikimetrics tests [09:09:06] take a look at this: https://www.mediawiki.org/wiki/Analytics/Wikimetrics/FAQ#Tests_just_hang_or_fail_due_to_queue_issues.2C_what_do_I_do.3F [09:32:06] (PS15) Nuria: Add ability to tag a cohort [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/133091 (owner: Terrrydactyl) [09:34:46] (CR) Nuria: [C: 2] Add ability to tag a cohort [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/133091 (owner: Terrrydactyl) [11:28:41] nuria: aronud the icinga alert. [11:28:47] Can you restart the consumer? [11:28:54] What does the log say? [11:29:39] Can it be related to ottomata's reshuffling of kafka brokers? [11:29:52] mmmm... [11:30:00] Is it just the kafka consumer or are other consumers stopped as well? [11:30:06] Driving tcp://127.0.0.1:8600?socket_id=kafka -> kafka://eqiad?brokers=analytics1021.eqiad.wmnet,analytics1022.eqiad.wmnet&topic=eventlogging-00.. [11:30:06] No handlers could be found for logger "kafka" [11:30:44] No handlers found ... :-) [11:30:48] Interesting. [11:31:06] Let me check EventLogging ... I only saw that mongodb went away ... maybe kafka with it? [11:31:14] but boy these logs w/o timestamps ... [11:31:42] no wait this is an old log [11:31:51] I think we can live with the kafka/consumer being down. [11:31:58] Are other consumers working as expected? [11:32:01] i need to leave but i shall be back and will look for a recent log [11:32:38] never mind , error is teh same in most recent log: [11:32:39] Driving tcp://127.0.0.1:8600?socket_id=kafka -> kafka://eqiad?brokers=analytics1021.eqiad.wmnet,analytics1022.eqiad.wmnet&topic=eventlogging-00.. [11:32:39] Driving tcp://127.0.0.1:8600?socket_id=kafka -> kafka://eqiad?brokers=analytics1021.eqiad.wmnet,analytics1022.eqiad.wmnet&topic=eventlogging-00.. [11:32:39] No handlers could be found for logger "kafka" [11:32:39] Driving tcp://127.0.0.1:8600?socket_id=kafka -> kafka://eqiad?brokers=analytics1021.eqiad.wmnet,analytics1022.eqiad.wmnet&topic=eventlogging-00. [11:33:57] Are the other consumers working? [11:34:29] :w( [11:34:36] s/w/-/ [11:37:57] Just to have this logged somewhere: [11:38:11] Events are still coming through to vanadium. [11:38:17] Graphite reports usual rates [11:38:27] Events make it to the database in usual rates [11:40:17] So the statsd reporter seems to be up [11:40:32] the all-events multiplexer seems to be up [11:40:46] the mysql-db10* consumer seems to be up [11:41:17] the graphite consumer seems to be up [11:42:13] mongodb consumer has been turen off. That's ok. [11:42:49] So what could be possible is that writing logs could be affected in addition to the kafka consumer [11:46:23] As there is no HDFS import from kafka, we can live with the kafka consumer being down [11:46:55] Asking for help around verifying that logs get written in ops channel. _joe_ is on rt duty. [11:56:29] hi qchris, I caught up on the above [11:56:50] Hi milimetric [11:56:57] we should try to claim this, either in icinga or with an email to ops [11:57:05] nuria received an alert, but it seems she had to run, so I took over. [11:57:10] yep, i saw [11:57:18] apergos is helping in ops channel. [11:57:24] k [12:01:47] milimetric: Tried to claim the alert in icinga. [12:01:52] But it did not let me. [12:01:58] I'll send an email to ops list. [12:02:04] And claim that way. [12:02:15] ok, we should figure out how to make icinga let us claim things [12:02:18] ori can claim [12:03:08] Meh. I need not get claim permission or nothing. Those who get alarms should be able to do it. [12:03:27] I refuse to try to get alarms until it is settled what the expectations from us are. [12:29:16] morning all :) [12:34:33] So my Wikimetrics tests are running on Vagrant, found a silly mistake I was making after checking out the celery log (thanks to Nuria for the link) [12:34:56] At the end of the test run there is an encouraging "OK" message [12:35:42] Still get lots of error messages popping up during the test run, though, so I'm not sure it's really totally "OK" 8p [12:35:46] http://etherpad.wikimedia.org/p/Wj4HcOWoHk [12:36:16] AndyRussG: Ja. Sadly enough, those are ok. [12:36:24] :) [12:36:33] I welcome any patches that remove those messages [12:36:41] Hi qchris and milimetric [12:36:48] Ah OK thanks a lot :) [12:36:49] Hi AndyRussG :-) [12:36:49] basically, logging and nose get along like pirates and the english [12:37:30] Ah heh [12:37:44] I'm OK with a pirate definition of OK [12:37:55] milimetric, nose kills them, except when they're victimising the spanish, at which point it becomes okay and patriotic? [12:38:02] that seems unfair and kind of mean to nuria ;p [12:38:35] :) [12:39:42] Like old pirates or modern-day pirates? [12:40:09] I hear Canada is a haven for digital pirates [12:40:13] Logging is digital, right? [12:40:27] no this is definitely old pirates [12:40:43] like nose runs, and then logging says something about how nose is a scurvy dog [12:41:45] Ah... then nose continues illegally ripping the DRM'd, copyrighted logs? [12:41:47] and nose is all 'arr, tis a keel-hauling for yer, matey!' and then they fight [12:41:53] I mean, more. [12:46:54] I could add a bit to the FAQ about it [12:53:11] ottomata: eventlogging's kafka part caused issue today (yesterday?) [12:53:19] Are there any plans for using this soonish, [12:53:24] or can we just turn it off? [12:53:37] hm, we can just turn it off, i know of no plans right now [12:53:39] but, it should work... [12:53:53] buuuut, i do plan on doing some kafka failover testes soon [12:53:58] so maybe better to turn it off, dunno [12:54:13] By what nuria said, it tried to connect to "kafka://eqiad?brokers=analytics1021.eqiad.wmnet,analytics1022.eqiad.wmnet&topic=eventlogging-00" [12:54:17] (yes only two brokers) [12:54:24] and failed. [12:54:32] that's fine, it shoudl fail on 21 and succeed with 22 [12:54:43] if it was a nice kafka client :) [12:54:47] Ha! [12:54:57] That's good to hear. [12:55:07] gonna try to start it [12:55:12] if I know how... [12:55:29] There should be an upstart config for it. [12:55:34] on vanadium. [12:55:58] hm, ohmahgoodness you can do nested upstart configs? [12:56:02] i seee eventlogging dir there [12:56:27] It could be called "consumer-kafka" or something. [12:56:45] something having kafka in its name. [12:56:48] nope [12:56:49] ls -R /etc/init | grep -i kafka [12:57:11] maybe all consumers are just contorlled by single consumer script? [12:57:49] ha, no idea how to use a nested upstart dir :p [12:58:04] Mhmm... rereading puppet files. [12:58:43] eventlogging::service::consumer "notify => Service['eventlogging/init']" [12:58:51] ha [12:58:54] yeah that doesn't work on cli [12:58:58] that service name like that anyway [12:59:08] ok [12:59:10] eventloggingctl [12:59:11] maybe [12:59:36] don'tsee kafka listed in status [12:59:38] iuuunnoooo [13:00:11] * qchris has no clue either. [13:01:18] hmm, let's just remove it for now [13:01:18] eh? [13:01:24] k. [13:01:35] I'll prepare a patch. [13:03:01] https://www.mediawiki.org/w/index.php?title=Analytics%2FWikimetrics%2FFAQ&diff=1034540&oldid=955022 [13:03:19] oh [13:03:22] qchris i'm already on it! [13:03:33] ottomata: Even better :-) Thanks. [13:03:38] https://gerrit.wikimedia.org/r/#/c/139101/ [13:05:01] qchris: yes, the other consumers are working [13:05:27] nuria: Thanks :-) [13:06:04] so i guess we wait for otto, right? [13:07:09] ottomata is already here. [13:07:14] He is just about to turn the consumer off. [13:07:24] nuria: https://gerrit.wikimedia.org/r/#/c/139101/ [13:07:42] nuria, not that it shouldn't work [13:07:44] it should work right now [13:07:49] but, i am going to do things that will make it not work in the future [13:07:52] failover tests, etc. [13:07:57] so it'll be nice to just not have it flap for now [13:08:24] let me read the gerrit change and your doc about the service issues [13:08:48] ok i see , disabling for now [13:09:00] yeah [13:10:05] icinga reports as OK again. [13:10:09] Thanks ottomata \o/ [13:10:53] AndyRussG, haha [13:11:09] :) [13:11:55] yup [13:13:30] stress + no breakfast + cold coffee = indiscriminate subjecting of innocent fellow human beings to untold bad jokes [13:16:15] ottomata: we would need to disable teh alarm too, right? [13:16:42] AndyRussG, sorry about the stress! Go get breakfast :) [13:16:45] not sure how that works, but i think its ok? [13:16:59] ori's script looks throught the configured consumers and creates alerts based on them [13:17:08] rather than each consumer alert being defined explicitly in puppet [13:17:24] not sure how it will tell icinga that things are ok if a config file goes missing, maybe it will just stop reporting it? not sure. [13:17:29] anyway,it says things are ok now :) [13:18:24] Ironholds: thanks! [13:21:11] The script creates alarms on the fly? [13:21:16] No. [13:21:25] Well ... alarms: yes [13:21:29] It has to :-) [13:21:34] while(qchris_needs_sleep){ [13:21:38] distract_with_alarms() [13:21:39] } [13:21:40] ;p [13:21:54] But there is only one service definition. [13:22:03] is that script also in puppet? [13:22:12] And this single service checks all eventlogging consumers. [13:22:17] nuria: yes. [13:22:38] nura: oh wait. puppet or the extension. [13:22:42] Let me find it. [13:23:41] modules/eventlogging/files/check_eventlogging_jobs [13:24:31] https://git.wikimedia.org/blob/operations%2Fpuppet/f85b1dbcd61bbb58684ff93704c1804e808a5d6e/modules%2Feventlogging%2Ffiles%2Fcheck_eventlogging_jobs [13:24:34] ^ nuria [13:24:35] ya i saw it but [13:24:45] what i do not understand is how monitor.pp [13:24:50] gets excuted [13:25:02] ahhh [13:25:05] Which monitor.pp? [13:25:11] it must be inclusion of the class [13:25:29] qchris: monitor.pp is the one that actually executes that script [13:26:04] with this syntax ... source => 'puppet:///modules/eventlogging/check_eventlogging_jobs', [13:26:11] nuria: There are 5 monitor.pp, but none is for eventlogging. [13:26:18] You mean nrpe::monitor_service ? [13:26:30] noo.. [13:26:31] eventlogging::monitor [13:26:54] on EL module [13:27:00] anifests/monitor [13:27:09] manifests/monitor [13:27:13] That's monitoring.pp [13:27:17] not monitor.pp [13:27:23] ah sorry, yes [13:28:21] so how does this: 'puppet:///modules/eventlogging/check_eventlogging_jobs', [13:28:25] But that file is not responsible for this icinga alert. [13:28:32] get executed periodically? [13:28:54] Look in manifests/role/eventlogging.pp [13:29:05] There you'll find the class role::eventlogging [13:29:17] This class contains nrpe::monitor_service { 'eventlogging': [13:29:28] which is the icinga check. [13:29:35] And the role::eventlogging class gets included in [13:29:57] ah i see [13:30:01] manifests/site.pp [13:30:07] for vanadium.eqiad.wmnet. [13:30:27] so the monitoring.pp definition is not used? [13:31:12] There are 9 monitoring.pp :-/ [13:31:39] modules/eventlogging/manifests/monitoring.pp does the graphite monitoring [13:31:41] that you set up. [13:32:02] and the eventlogging::monitoring class only brings files in place. [13:32:37] ah so this "puppet:///modules/eventlogging/check_eventlogging_job" is just 'create the file there', i see thanks [13:33:08] Cool :-) [13:33:38] Ja, the wrapping "file { ...}" just makes sure the file is there [13:33:44] (ensure => present) [13:35:32] ok, so no need to turn off alarms then [13:35:39] Right. [13:37:23] Thanks for your patient explanations. [13:38:52] I guess was too harsh. [13:38:54] Sorry nuria. [13:39:06] I'll stay away from the computer till standup. [13:39:12] i was saying for real .... [13:40:17] I am slowly learning that puppet is the "mother of everything"... [13:40:23] good to know [13:56:23] (PS1) Milimetric: Take DOM manipulation out of tag code [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/139104 [14:15:56] (PS2) AndyRussG: Improve server-side cohort upload form validation [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/138151 [14:16:59] (CR) AndyRussG: "Fixed a test" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/138151 (owner: AndyRussG) [14:18:49] (PS7) AndyRussG: WIP Create a cohort from campaign membership [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/126927 (owner: Awight) [14:18:51] (CR) jenkins-bot: [V: -1] WIP Create a cohort from campaign membership [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/126927 (owner: Awight) [14:58:25] heading to cafe, back in a bit [17:05:55] [travis-ci] wikimedia/mediawiki-extensions-EventLogging#215 (wmf/1.24wmf9 - 6fc4c34 : Reedy): The build passed. [17:05:55] [travis-ci] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/6fc4c3437054 [17:05:55] [travis-ci] Build details : http://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/27421559 [18:16:02] pshshhhh,i think that now that these brokers have 12 disks, they aren't blinking as much at all the traffic [18:16:39] i'm going to add upload! [18:16:44] gimme da trafiiiiiic [18:21:12] ottomata, noooo ;p [18:21:51] i'm doing failure tests [18:21:55] so I want it to die :) [18:21:56] sorta... [18:22:05] it does that on its own! :P [18:22:07] * Ironholds runs [18:22:52] psshH! [18:22:56] only if a disk dies! [18:22:59] get outta herrreee [18:28:45] yankee ;p [18:30:39] who you callin yankee!? [18:30:43] i hail from south of the mason dixon! [18:31:11] 'get outta here'? [18:31:15] (also: ooh, where?) [18:32:05] ah ha [18:45:31] ottomata! [18:45:46] the constantly-dying-reducers query outputted an actual error message this time :D [18:45:55] whatdya get?! [18:47:26] I shall attach to the bug, but it claims to be "error while doing final merge" [18:47:27] oh, bollocks [18:47:44] I think this might /actually/ be "I spent tuesday kind of accidentally occupying all the space on stat2" [18:48:02] * Ironholds sighs. I'll throw it in the bug anyway [18:48:20] could be useful [18:57:35] ha, that' would be a lot of space [18:58:43] yeah, tail-eating awk script. Long story ;p [20:34:39] Ironholds: do you know the status of DarTar [20:35:41] YuviPanda, he is an Italian gentleman in his mid-30s [20:35:46] oh! sorry. [20:35:47] married. [20:35:51] ah [20:35:52] damn [20:35:58] Ironholds: is he in the office today? [20:36:00] or around / available? [20:36:04] He is, although I can't see him [20:54:48] (PS1) Milimetric: Make alembic skip test migrations in prod [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/139202 (https://bugzilla.wikimedia.org/65893) [20:59:16] (CR) Milimetric: "Tested and found working in dev." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/139202 (https://bugzilla.wikimedia.org/65893) (owner: Milimetric) [22:02:48] ori: yt? [22:02:58] DarTar: hey [22:03:04] are you in the office? [22:03:08] yes, are you? [22:03:30] yes, Ironholds and I are scribbling on a whiteboard to figure out how to use the moduleStorage log for something [22:03:38] can we interrupt you for a moment? [22:04:03] gimme 15 mins? [22:04:10] totally [22:04:11] :) [22:29:28] Ironholds: allllmost there [22:44:17] ori, cool! [23:11:57] (CR) Awight: [C: 1] "Looks like an improvement." (4 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/138151 (owner: AndyRussG) [23:43:02] (PS8) AndyRussG: WIP Create a cohort from campaign membership [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/126927 (owner: Awight) [23:43:04] (CR) jenkins-bot: [V: -1] WIP Create a cohort from campaign membership [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/126927 (owner: Awight) [23:45:23] (CR) AndyRussG: "Still kinda messy... :)" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/126927 (owner: Awight)