[09:35:31] hola qchris [09:36:19] Hola nuria [09:41:18] We can talk when you are 'back' [09:41:30] no lack of stuff for me to do [09:43:13] What's up? [09:43:43] nuria, 'talk' as in IRC, or 'talk' as in Hangout? [09:43:59] Take a look at the graphite change change: https://gerrit.wikimedia.org/r/#/c/137280/ [09:44:19] i 'mimic' how the threshold check is set up for media wiki [09:45:01] I have no clue about those :-) [09:45:11] but i am not sure if that makes sense given the EL code that publish data to grphite is only deployed to vanadium [09:45:24] ok i will ask ops [09:45:33] but - this you might know- [09:46:00] how do we deploy this code to vanadium to test it? [09:46:12] if you do not know i will ask ori when he comes in [09:46:35] I do not have experience with testing in production. [09:46:49] I test in labs/local [09:47:00] And then ask ops opinions [09:47:16] If they merge, I try to be around and see if things break. [09:47:55] no, it's WIP, they will not merge [09:48:31] But ... will your code get merged to vanadium? [09:48:31] i *think* this cannot be tested in labs as there is no EL there i believe [09:48:59] Mock EventLogging and test with that. [09:49:34] it's graphite an EL what's needed [09:49:40] Or rather graphite in this case ... [09:49:41] actually EL doesn't matter [09:49:46] Yes. [09:49:51] as you just need to publish a metric [09:50:56] Reading puppet ... I think vanadium is not the place this will have effect, but tungsten is the machine. [09:50:58] but does even graphite work on labs and is there an instance where we could test this? [09:52:02] right, yes, as it only has effect on teh graphite side [09:52:02] Let me check for you if Labs offers graphite out of the box. [09:53:32] I could not find a "graphite" checkbox. So you'd have to provide the mock on your own. [09:55:47] the 'mock'? [09:58:09] the mock for graphite [09:58:21] Like instantiating role::graphite on the labs instance. [09:58:51] Either by hand, through puppet, or by some other means. [09:59:47] But I am not sure if that testing is worth it. [10:02:34] i think i am going to need access to hafnium [10:02:49] or tugnsten [10:03:12] whichever one that has the process running that is checking thresholds [10:03:27] ? [10:03:42] What for would you need access? [10:03:50] You can typically do without access. [10:03:56] (And access would require 3 days) [10:04:35] cause otherwise I need to submit a chageset with lower thresholds to test alarms are firing if i cannot get to modify those locally [10:05:30] Well ... as you wish ... but 3 days to get access ... that is Monday. [10:06:26] Labs is easier and does not take that long. [10:06:55] ok, but is labs connected to icinga? does it have graphite? [10:07:15] No. You have to set those up on your own. Puppet will help you with it. [10:07:27] Testing in production does not strike me as good idea. [10:08:07] i am not adbocating testing in production, probably you wouldn't find anyone less fond of that than me [10:09:28] but having to set up icinga to test this change seems quite inefficient, now, if you can tell me how to start i can give it a try [10:09:31] *advocating [10:10:28] Identify the code that is run by monitor_graphite_threshold [10:10:32] (CR) Nuria: [C: 2] Fix datetime parsing problem [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136429 (owner: Milimetric) [10:10:43] And simulate that in labs. [10:10:49] (Merged) jenkins-bot: Clean up errant print [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136430 (owner: Milimetric) [10:11:15] It does not need to be a full blown icinga + graphite with all bells and whistles. [10:12:39] Probably you just need a few services. Go to wikitech, create a test instance in the analytics component, and install those needed services there. [10:13:13] the code run by puppet is this check_ganglia script [10:13:15] https://github.com/wikimedia/operations-puppet/blob/production/manifests/nagios.pp#L369 [10:13:40] sorry this one: https://github.com/wikimedia/operations-puppet/blob/production/manifests/nagios.pp#L473 [10:15:02] Right. That looks better. [10:28:43] so i build the instance and later configure what to run with puppet? [10:29:53] Yes. For example. [10:30:12] But _joe_ identified the code that gets run for you. [10:30:59] As that part is done, you can probably do without the labs instance and run it locally. [10:38:07] although that way i would be testing the output of 1 script [10:38:21] that connects to graphite and reports [10:38:39] Yes. [10:39:28] Against what should your test protect? [10:44:16] the group alarms go to for exmaple [10:44:16] *example [10:44:17] right? [10:44:26] as it should come to analytics not ops [10:45:08] and that is configured in puppet [10:47:22] If you want to test these kind of things, [10:47:42] You'd need a test-icinga. [10:47:54] I am not aware of wmf having such a thing. [10:48:03] So you'd have to set it up on your own. [10:48:46] You would not be able to use production for this, as (if the test goes wrong) ops get paged ... which is what one wants to guard against. [10:51:34] ok, understood, so testing where alarms go to is not possible on labs [10:51:43] That's not what I said. [10:51:53] Labs does allow you to set up your test-icinga. [10:52:04] It's just that it's not a 1-click-and-your-done solution. [10:52:47] The icinga monitor is puppetized, so most work is done already. [10:53:29] (see icinga::monitor) [10:57:09] would you advice we do that for testing thi sitem? [10:57:15] *this item [10:58:55] Not sure. [10:59:07] I'd start by making sure the script is alarming only when needed. [10:59:29] Then I'd read more puppet around contact_groups. [10:59:42] If that is straight forward and perfectly understandable, [11:00:05] I might skip it (and flag it on the change for ops to review it) [11:00:34] If the contact_groups part does not clarify from the puppet files, I'd test it. [11:00:50] contac groups are easy enough, they are here [11:01:22] https://git.wikimedia.org/blob/operations%2Fpuppet/d37109dc98954d2377ed25fce66b2361b8cd190e/files%2Ficinga%2Fcontactgroups.cfg [11:01:27] Ja, sure. [11:01:33] But how does that work in detail? [11:01:49] Do default groups get picked at some point. [11:01:59] Do some alarms get relayed to other grouphs. [11:02:25] Puppet is meta-configuration. That's typically easy to read. [11:02:34] icinga by default goes to ops, yes, at least that is what i just learned [11:03:11] But meta-configuration is worth nothing without understanding of the final configuration at the end of the day. [11:03:46] Yes. I assumed that alerts go to ops people by default. [11:04:01] But does the graphite_monitoring reliably overrule that? [11:04:22] These are the kind of questions I would make sure I can answer, if I decide to not test it. [11:16:00] qchris: that's very well-said [11:17:05] Hey ori :-) [11:17:11] Nice to see other read along. [11:17:35] And nuria ... if in doubt ... ask people like ori who know what they are talking about :-) [11:17:35] didn't mean to spook, just sleepless as usual [11:17:55] Well ... my puppet/graphite foo is really bad. [11:18:12] So please chime in if I say something wrong. [11:18:57] nothing so far, but that is one of the hazards of the puppet [11:19:19] ending up with abstractions that obscure the mapping between the puppet manifest and system state [11:19:32] :-) [11:21:46] i'm off, *wave* [11:21:53] ...some crazzy sleeping schedule..... [11:21:55] ciao [11:22:17] Sleep well :-) [11:41:46] (CR) Nuria: "I this changeset the newly register metric does not appear on the UI." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/136431 (https://bugzilla.wikimedia.org/65944) (owner: Milimetric) [11:48:43] qchris: seems to me that critical alarms are always received by ops [11:48:43] https://github.com/wikimedia/operations-puppet/blob/production/manifests/nagios.pp#L119 [11:49:54] 'admins and sms' [11:50:02] Yup. Makes sense. [11:52:33] so how things are setup, we shall receive 'warnings' but not critical alarms ( well, andrew will receive those) unless that is changed on puppet [11:52:48] i will talk with _joe_ tomorrow [11:53:14] If we want to, we can change that in puppet. [11:53:17] about perhaps doing that change to add the 'default' [11:53:27] to critical alarms too [11:53:39] that makes sense [11:53:53] We just need to implement in a convincing way, so that the new version works for both ops and us. [11:54:14] 'implement"....ahem .. me no comprendo? [11:54:18] isnt't it [11:54:27] like 'admin. sms'+$default [11:54:33] For example. [11:54:34] like 'admin, sms'+$default [11:54:47] I am not sure if that is the best choice in all cases. [11:54:54] That needs checking. [11:55:03] that $default is defined yes [11:56:10] Well ... it might be that adding $contact_group does not make sense for some use cases. [11:56:34] So one needs to go over the existing uses of the class, and have a look [11:56:46] (or rely on Ops to detect such issues) [11:57:11] Checking beforehand will for sure benefit when discussing the change. [11:57:55] will ask _joe_ either later on today or tomorrow what does he prefer [12:00:27] Wait ... I now read the class ... I think the above does not match the puppet code. [12:00:45] there are two intermediate classes though [12:00:50] the critical flag of the monitor_service is not the critical of monitor_graphite_threshold [12:01:14] Both warning and critical (in Icinga sense) will go to the same contacts. [12:01:27] nagios_critical just overrides the contacts. [12:01:36] ... wait ... [12:01:54] and nagios_critical is set to false anyways in your change. [12:03:58] ah ya, you are right. it is critical in sense of tier1 [12:06:46] * qchris goes off to lunch. I'll read up on IRC later [13:45:40] (PS1) Milimetric: Reorganize report creation around recurrence [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/137306 (https://bugzilla.wikimedia.org/66017) [13:56:09] hey qchris, you wanna put some points on this task: http://sb.wmflabs.org/b/66005/ [13:56:20] oh, or does that count as production [13:56:24] yeah... you're right, prod [13:57:02] It was not scheduled ... so no points ... But I am not sure. [15:32:29] (PS1) Milimetric: Fixes from Erik [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/137338 [15:33:22] (CR) Milimetric: [C: 2] Fixes from Erik [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/137338 (owner: Milimetric) [15:33:27] (CR) Milimetric: [V: 2] Fixes from Erik [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/137338 (owner: Milimetric) [17:54:30] ottomata1, can you unabandon the slow-parse change (https://gerrit.wikimedia.org/r/#/c/49678/) if that seems reasonable to you? [17:54:41] It's no longer blocked, and there's interest in making performance data more visible when possible. [18:03:56] superm401: unabandoned [18:04:05] Thanks, ottomata [18:05:07] does anyone know why stat1002 has our maxmind files at both /usr/local/share and usr/share/? [18:05:38] one of them is a symlink, i think [18:05:43] aha [18:05:48] for historical reasons, they used to be in one place, but not the other [18:05:52] I'm gonna assume that's the /local/ since 1003 has it at /usr/chare/ [18:05:54] and we kept the link to the old place around [18:05:55] *share [18:06:00] coolio. [18:06:05] ls -ld /usr/local/share/GeoIP [18:06:16] also, there is an incredible python library for MaxMind's GeoIP stuff. [18:06:27] oh ja? [18:06:33] it has, as well as ultra-fast city and country lookup that caches the .dat file in memory. [18:06:36] tz lookup. [18:06:44] It takes the city data and works out the nearest tzdata-recognised locale. [18:06:48] It's so beautiful I could cry. [18:07:13] (this is what I have spent my morning integrating into my code. I am totally making a list of these things one of these days so we can just puppetise them.) [18:07:35] aye cool [18:45:30] (PS7) Terrrydactyl: [WIP] Add ability to tag a cohort [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/133091 [18:45:38] (CR) jenkins-bot: [V: -1] [WIP] Add ability to tag a cohort [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/133091 (owner: Terrrydactyl)