[08:50:10] (CR) Nuria: [C: 2] "Change works fine." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/126020 (https://bugzilla.wikimedia.org/64443) (owner: Milimetric) [08:50:11] (Merged) jenkins-bot: Fix bad initial recurrent report [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/126020 (https://bugzilla.wikimedia.org/64443) (owner: Milimetric) [10:25:22] nuria: You wanted to talk about the eventlogging card. [10:25:32] Sooooo ... [10:25:35] yess [10:26:12] is there any guide as to how set up monitoring or the code is teh documentation? [10:26:25] I do not know of such a guide. [10:27:15] As you said before: I guess we can (ab)use o-ri's script. [10:27:44] And hook that up with Icinga. [10:28:30] script is this one [10:28:32] https://gist.github.com/atdt/8deed4bc2d311ba0122f#file-el-status-py [10:29:06] it has hosts hard coded [10:29:21] can monitoring retrieve teh host it is running on dynamically? [10:29:24] *the [10:30:08] An alternative would be ganglia/python_modules/eventlogging_mon.py of the EventLogging extension. [10:30:25] About dynamic configuration ... I would do the dynamic part through puppet. [10:30:40] Add the monitoring script as template/file to puppet, [10:31:01] And when using puppet to instantiate the template/file, provide the needed parameters. [10:31:45] (CR) Nuria: Update column types for logging table (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136328 (https://bugzilla.wikimedia.org/65944) (owner: Milimetric) [10:34:59] In what repo is this at? "ganglia/python_modules/eventlogging_mon.py" [10:35:27] As written above: EventLogging extension. [10:35:47] That is: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/EventLogging [10:43:31] and that script is the one used to make a graph then? cause it reports global counts [10:44:24] The graphs are generated by separate services (ganglia, ...), but it feeds the data for those graphs IIRC. [10:48:19] ok, let me see if i can execute that script in vanadium [10:56:36] qchris: i do not understand..... the EL extension code is deployed to hosts that have mediawiki running [10:56:47] and also to hosts like vanadium? [10:56:56] so the ganglia code is executed from there? [10:58:03] Let me find an overview file ...https://wikitech.wikimedia.org/wiki/File:Eventlogging-backend.svg [10:58:49] nuria: ^ does that help which part is run where? [10:59:36] The graph is not totally straight-forward to read ... but it shows the ingredients. [11:05:05] Oh ... and about deploying the EventLogging extension ... I have no access to vanadium, but if I am not mistaken, there should be a '/srv/deployment/eventlogging/EventLogging' directory there [11:05:15] And this directory should hold eventlogging code from [11:05:32] the remote https://gerrit.wikimedia.org/r/mediawiki/extensions/EventLogging [11:06:03] ^ nuria [11:06:18] puf, tons of info [11:06:42] Ha :-D But you have the benefit of being able to look at the machine. [11:06:51] I am stuck with reading puppet. [11:06:55] Hahaha :-) :-( [11:07:30] Well, if you put that together from puppet.. you are a super hero, really [11:08:11] Much rather it is "business as usual" :-) [11:15:57] still looking [11:19:22] ok, the reporter publishes to statsd, i still do not get who executes the python script that sends stuff to ganglia [11:23:18] or how would we test any of this outside of vanadium [11:23:28] ^qchris [11:24:24] or where does the statsd data end up [11:26:38] (Sorry. I missed the ping) [11:27:59] Does not statsd<->ganglia magic happen outside of our control in a general setup? [11:28:40] But I am not sure I understand your question. [11:29:09] Would we run the script outside of vanadium? [11:30:56] nuria ^ [11:31:17] well, by the looks of it we can have an alarm set on the thresholds of the data we alredy have, teh advantage of the script is that it splits the big pipeline [11:31:42] into events and reports the % of the pipeline that corresponds to every event [11:32:04] So where would you want to run the script? [11:33:12] well, i really have no idea, *seems* to me that the data we need to alarm on is already on statsd and graphite [11:33:49] but i really so not know enough of production infrastructure to know whta it needs to be done here. [11:33:52] *what [11:34:02] Hopefully otto or ori can help [11:34:07] Do we want such a fine grained logging? [11:34:13] s/logging/monitoring/ [11:34:36] You mean fine grained as in "per event"? [11:34:44] Yes. [11:34:51] no, not to start with at least [11:35:10] seems to me that with overall thresholds we will be ok [11:35:12] Then there is no need to instrument statsd etc. [11:35:39] A plain script that fires if we cross a total threshold would do. [11:36:44] yes, that works. But i still do not get where such a script runs (vanadium?) and to whom doe sit publish the data [11:37:04] (Man I cannot even find the card for that ... do you know which card has the latest specs?) [11:37:33] I'd let Icinga run the script run on vanadium. [11:37:47] (Until that becomes to heavy ... then we could move it somewhere else) [11:37:56] So ... as simple as possible. [11:39:28] (Here it is https://www.mediawiki.org/wiki/Analytics/EventLogging and lines 79 onwards on https://etherpad.wikimedia.org/p/analytics-tasking ) [11:39:59] So neither the card nor the tasking calls out per event monitoring, so I'd keep it simple and not implement per event monitoring. [11:40:11] (although per event logs would be nice) [11:40:32] but is icinga even running on vanadim? [11:41:05] there are per event logs in graphite [11:41:06] Icinga can connect to vanadium upon need. [11:41:21] There is a sample Icinga alert ... let me find it. [11:42:17] "Check status of defined EventLogging jobs" for vanadium on icinga.wikimedia.org [11:43:56] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/manifests%2Frole%2Feventlogging.pp#L157 [11:44:42] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/modules%2Feventlogging%2Ffiles%2Fcheck_eventlogging_jobs [11:44:59] ^ are the instantiation in puppet, and the corresponding file. [11:46:08] Icinga runs this script on vanadium. [11:46:20] I see the code, but i do not see anything on http://icinga.wikimedia.org/ [11:46:32] nor do i see an icinga process on vanadium [11:46:46] Did you log in on http://icinga.wikimedia.org/ [11:47:03] yes [11:47:17] In the search box, enter "vanadium" [11:47:35] Execute the search [11:47:58] That should take you to a page showing a table of ~10 rows. [11:48:12] A column in the middle being all green. [11:48:19] ok, i see [11:48:42] Ok. The first non-labeled row says "Check status of defined EventLogging jobs" in the service colmun. [11:48:49] but this is vadium ps auxfw | grep -i ici [11:48:54] returns nothing [11:48:59] That's ok. [11:49:09] As written above, Icinga can connect to vanadium. [11:49:18] It needn't run as service there. [11:52:55] and where is it configured that it is icinga running the check_eventlogging_jobs? [11:53:23] As written above ... [11:53:25] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/manifests%2Frole%2Feventlogging.pp#L157 [11:53:47] that is part of role::eventlogging [11:54:31] And that role is associated to vanadium in: [11:54:33] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/manifests%2Fsite.pp#L2790 [11:55:21] right but it is run by "nagios" right? and .. somehow that ends up on icinga later..? [11:56:03] Icinga is a Nagios fork. [11:56:48] i just read that .... [11:56:54] I am not sure how/when wmf switched from nagios to icinga, but generally, [11:57:28] reading nagios when you'd expect icinga should mostly be fine. [11:58:17] But if you are curious, you can dig in there as well: [11:58:51] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/modules%2Fnrpe%2Fmanifests%2Fmonitor_service.pp [11:59:19] But that part is used all over our puppet files, so I just trust it to do "the right thing" :-) [12:00:34] ok, the fact that the script that reports throughput tographs is deployed with teh actual feature code but the monitoring of the system is deployed in the puppet production depot [12:00:47] is .. really... confusing... [12:01:37] :-) Mixed setups are common for the wmf. [12:01:42] I do not mind it too much. [12:01:45] and where are actions regarding alrms configured? [12:01:54] so exit 2 triggers what? [12:02:08] "exit 2" is Icinga magic. [12:02:21] Or are those fixed for every icinga script? [12:02:26] Yes. [12:02:34] ok, that actually makes sense [12:02:46] http://docs.icinga.org/latest/en/pluginapi.html#returncode [12:02:51] ^ docs of exit codes. [12:04:29] ok, and as far as I can see all this needs to be tested in vanadium [12:04:58] Need not be vanadium. But vanadium sounds like the natural place for a first shot to me as well. [12:05:32] how would the alarm know to send e-mail to our list? is there a way to configure alarm receivers? [12:05:37] But be careful to not page all ops when testing ... so double check your code beforehand :-) [12:05:49] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/files%2Ficinga%2Fcontactgroups.cfg [12:06:01] ^ contact groups (=receivers) for alarms. [12:07:43] and the link among contact group and larm itself? [12:07:46] *alarm [12:08:11] Again in the monitoring service definitions [12:08:13] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/manifests%2Frole%2Feventlogging.pp#L157 [12:08:21] there is a contact_group key [12:09:03] But these are all parts of our setup that I do not know at all. [12:09:13] So be defensive :-) [12:09:23] And if in doubt ... assume that I am wrong. [12:10:20] ok, many thnaks i think i kind of get all the pieces, the last question is who deploys to vanadium? [12:10:43] puppet is supposed to do this automatically. [12:11:03] So merging into operations/puppet should get it deployed without 30 minutes automatically. [12:11:13] s/without/within/ [13:05:58] (CR) QChris: Add code to auto-drop old hive partitions and remove partition directories (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/135128 (owner: Ottomata) [13:32:41] (CR) QChris: [C: 2 V: 2] "Since ottomata needs to do this stuff by hand up to now," [analytics/refinery] - https://gerrit.wikimedia.org/r/135128 (owner: Ottomata) [13:36:59] :) [13:37:12] qchris, quick comment back atcha about the partition glob [13:37:23] that's why that part of this was in the bin/ script, rather than in the util lib [13:37:23] Nonono. I am away :-) [13:37:35] it is specific for just the webrequest table [13:37:35] haha [13:37:49] I think it's fine. [13:37:56] ok cool ;) [13:37:56] ohmyyyyy so many reviews to do today [13:38:00] you and ori,woooo [13:38:09] It's just that we might miss partitions of the table that are not under table_location. [13:38:32] But we should not have them here, so it should be safe. [13:38:43] aye [13:38:49] But at some point we might add partitions that live in different places. [13:39:01] We'll find out then :-) [13:39:34] So anything else I have to merge to make your "Keep Icinga calm" workflow is more automatic? [13:42:03] naw, there will be some puppet for a cron job in a bit [13:42:05] thank you! [13:42:19] actually i probably have to talk to oliver first [13:42:23] since he is the only one using this data [13:42:33] we don't have a lot of room right now, so we can only keep so much of it [14:21:04] Sir ottomata ... could you help me debug https://bugzilla.wikimedia.org/show_bug.cgi?id=66005 ? [14:21:41] I cannot log in to dataset1001, so I cannot check things there. [14:22:10] (CR) Milimetric: Update column types for logging table (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136328 (https://bugzilla.wikimedia.org/65944) (owner: Milimetric) [14:22:23] (I suspect that file ownership on dataset1001.wikimedia.org::pagecounts-ez/merged/2014/2014-05 is strange [14:30:27] checking [14:32:31] Darn. Overlooked a meeting :-( [15:01:18] ottomata: Sorry for pinging you before and then dissappearing into a meeting :-( [15:01:34] Did you get a chance to look at dataset1001? [15:07:27] yes, sorry, fixed and updated but [15:07:28] bug [15:07:35] Whoa. Thanks :-D [15:10:54] (CR) Nuria: [V: 2] Update column types for logging table [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136328 (https://bugzilla.wikimedia.org/65944) (owner: Milimetric) [15:17:53] (CR) Nuria: Fix datetime parsing problem (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136429 (owner: Milimetric) [15:18:39] (CR) Nuria: [C: 2] Clean up errant print [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136430 (owner: Milimetric) [15:18:57] (CR) Nuria: [C: 2] Update column types for logging table [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136328 (https://bugzilla.wikimedia.org/65944) (owner: Milimetric) [15:20:42] ottomata: i'm permitted to self-merge changes that get a sanity-check +1 from ops [15:20:53] so you're not totally on the hook for verifying every letter ;) [15:23:25] haha, ok, i'm going to start looking at them in jsut a few minutes [15:23:29] i'm happy to do them! [15:23:40] thank you! :) [15:45:19] (CR) Milimetric: Fix datetime parsing problem (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136429 (owner: Milimetric) [15:47:52] (CR) Nuria: "I might have misunderstood the metric definition but I think this metric cannot use a cohort as it *finds* users so it cannot start from a" (3 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136431 (https://bugzilla.wikimedia.org/65944) (owner: Milimetric) [15:53:21] (PS2) Nuria: Fix invalid users display for invalid cohort [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136434 (owner: Milimetric) [15:53:23] (CR) jenkins-bot: [V: -1] Fix invalid users display for invalid cohort [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136434 (owner: Milimetric) [16:04:18] (CR) Ottomata: [C: 2 V: 2] Lint scripts/restore_from_files.py [analytics/geowiki] - https://gerrit.wikimedia.org/r/136297 (owner: QChris) [16:05:03] (CR) Ottomata: [C: 2 V: 2] Lint scripts/make_limn_files.py [analytics/geowiki] - https://gerrit.wikimedia.org/r/136298 (owner: QChris) [16:41:05] (CR) Nuria: "Could you please provide a bit more info as to the bug fixed by this changeset?" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136434 (owner: Milimetric) [17:39:11] (PS2) Milimetric: Fix datetime parsing problem [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136429 [17:52:25] (PS3) Milimetric: Fix datetime parsing problem [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136429 [17:52:31] (PS2) Milimetric: Clean up errant print [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136430 [17:52:42] (PS3) Milimetric: Add Newly Registered Users metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136431 (https://bugzilla.wikimedia.org/65944) [17:55:20] jgonera, data generated and attached to the trello card. [17:55:25] (tablet browser choices) [17:55:35] thanks Ironholds [17:56:48] np :) [18:00:03] (PS4) Milimetric: Add Newly Registered Users metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136431 (https://bugzilla.wikimedia.org/65944) [18:02:35] (CR) Milimetric: Add Newly Registered Users metric (3 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136431 (https://bugzilla.wikimedia.org/65944) (owner: Milimetric) [18:35:04] (CR) Ottomata: [C: 2 V: 2] Lint geowiki/wikipedia_projects.py [analytics/geowiki] - https://gerrit.wikimedia.org/r/136299 (owner: QChris) [18:35:36] (CR) Ottomata: [C: 2 V: 2] Lint geowiki/mysql_config.py [analytics/geowiki] - https://gerrit.wikimedia.org/r/136300 (owner: QChris) [18:35:56] (CR) Ottomata: [C: 2 V: 2] Lint geowiki/geo_coding.py [analytics/geowiki] - https://gerrit.wikimedia.org/r/136301 (owner: QChris) [18:36:15] (CR) Ottomata: [C: 2 V: 2] Remove unused geowiki/format_output.py [analytics/geowiki] - https://gerrit.wikimedia.org/r/136302 (owner: QChris) [18:37:43] Ironholds: hiyaaa [18:37:51] (i'm in a meeting but have a q for you) [18:37:51] hey ottomata :) [18:37:57] sure! [18:38:06] so, i'm having a bit of trouble keeping space free in HDFS for now [18:38:15] kill April [18:38:16] christian just merged my code to auto delete stuff [18:38:18] kill it with fire. [18:38:26] can I keep a rolling 31 day history? [18:38:43] as in, you'd only have data for 31 days back at any time [18:38:45] That works too. [18:38:46] i can probably do a few more days in there [18:38:48] ok awesome [18:38:53] hrm [18:38:56] we're getting more space...sometime soon :) [18:38:58] minor problem there, though [18:39:00] ja/ [18:39:01] ? [18:39:02] wait, no, nevermind. [18:39:14] So, historically that would've been a problem because you have to partition on *something* for the queries to run [18:39:26] and having the partitions come into existence or vanish as the wheel turns is a pain [18:39:40] you can always to where < than [18:39:42] but unless I suddenly find some pressing need to consume bits traffic (not going to happen) [18:39:43] o mean [18:39:44] > than [18:39:50] I can partition still [18:39:51] that too! [18:39:54] or year = 2014 [18:39:54] yeah, this seems fine :) [18:40:00] ok great, danke [18:40:02] until 2015 rolls around ;p [18:40:06] where year > 2000 [18:40:07] :) [18:40:08] just don't do it in the next couple days? [18:40:14] I have this big query running over May [18:40:15] hmm, ok, i can wait, was going to do it today [18:40:16] ok [18:42:07] ottomata: any bandwidth to spare for reviews? [18:42:30] yeah, soon, sorry, got caught up in a kafka thing, and now ops meeting [18:42:37] i will start them today [18:42:51] ottomata, thanks; sorry about that :( [18:42:59] if you'd caught me an hour ago it'd be fine, but...the query is running. [18:43:03] ha, k, s'ok [18:43:16] yeah dunno what happens with hive is data is removed out from under it... [18:43:18] probably nothing good [18:43:34] yeah [18:43:38] it'd be fun to find out! :D [18:43:47] ottomata: if there's a gap in the conversation in the ops meeting could you mention that i have a big backlog of patches that could use a sanity check? :) [18:44:05] sure, all rcstream? [18:44:30] you have like 30+ patches here !!! :) [18:44:49] naw, more lik 20 ops/puppet ones :) [18:46:41] ottomata: mediawiki and rcstream [18:46:48] k [18:47:16] oh, faidon is mentioning it right now [18:47:21] your mediawiki ones [18:47:29] saying guiseppe is helping? [19:16:31] ottomata: <3 thanks! [19:17:42] :) [19:20:38] ottomata: that one needs a +2 [19:20:42] the rcstream python change i mean [19:21:03] i tested it and the service isn't public yet [19:22:16] you can't merge it? [19:23:04] i can, but the conventions differ [19:23:11] it's not an operations/ thing [19:23:15] well, wth [19:23:19] i'll merge it [19:30:07] ha, ok [19:52:15] ottomata: https://gerrit.wikimedia.org/r/#/c/136817/ is a simple fixup for I796119477 if you have a sec [19:53:28] thar you go :) [19:53:36] ottomata: thanks very much [23:45:10] Get Ironholds. [23:45:17] I made that table I was talking about. [23:45:39] I have a script that will update it once per 24h. [23:46:34] halfak, you made a dbtable? where? [23:46:45] analytics-store [23:46:47] you glorious sod! [23:47:04] that's recompense for the free accomodation? [23:47:07] also, would you like anything from SF? [23:47:19] so... Should I post to analytics@ or analytics-internal@? [23:47:26] internal probably makes sense [23:47:37] I don't know who we have with access to the dbs who doesn't work here. [23:47:43] (it worries me that I literally /do not know that/) [23:47:52] I think that analytics-internal is only our team though. [23:48:37] hmn. true. Use your judgment, then :) [23:48:57] OK. [23:51:46] can you think of anything the GroupLens people could stand to find out about/hear about/ask questions about that I know, btw? [23:51:57] I feel like I should make a vague stab at making this visit productive for both sides