[09:28:42] (Abandoned) Hashar: Lint: pass pep8 checks [analytics/wp-zero] - https://gerrit.wikimedia.org/r/134321 (owner: Hashar) [09:28:46] (Abandoned) Hashar: Lint: remove unused imports [analytics/wp-zero] - https://gerrit.wikimedia.org/r/134322 (owner: Hashar) [12:00:17] qchris [12:00:22] Heya. [12:00:33] question about your graph for EL [12:00:41] https://wikitech.wikimedia.org/w/images/d/d6/Eventlogging-backend.svg [12:01:30] where do you see the direct publishing to graphite? (hafnium) on the code? i see the connection in vanadium [12:01:37] so we are conected to hafnium [12:01:48] but i do not see where is that instantiated [12:02:15] You mean the line from "<> graphite" to "stat[s]d". Right? [12:04:23] no, actually to hafnium [12:04:24] see: [12:04:31] https://git.wikimedia.org/blob/operations%2Fpuppet.git/e030c07fd8b1db3bb66065797cb0a14b2bbbb31a/manifests%2Frole%2Feventlogging.pp#L180 [12:04:40] ^ that is the line to hafnium [12:04:40] tcp 0 0 vanadium.eqiad.wmn:8600 hafnium.wikimedia:36624 ESTABLISHED - [12:05:14] The relevant line above is: input => 'tcp://vanadium.eqiad.wmnet:8600', [12:06:38] wait .... [12:07:26] stats and hafnium are not the same thing right? [12:07:45] as in the statsd domain is not an alias of hafnium..right? [12:08:19] I think they are different. Right. Let me check. [12:09:14] statsd is tungsten. [12:09:28] hafnium is ... well ... hafnium. [12:09:37] So they are not the same machine. [12:10:03] me no comprendo [12:10:45] statsd.eqiad.wmnet is an alias for the machine with the name tungsten.eqiad.wmnet. [12:11:00] ya, what i do not get is the hafnium connection [12:11:18] ? [12:11:32] So hafnium consumes from vanadium and publishes to statsd. [12:11:51] The connection for "hafnium consumes from vanadium" is what you posted above. [12:12:46] Not sure ... which part looks wrong to you? [12:12:50] but who is instantiating that connection? the monitoring classes of puppet? [12:13:07] https://git.wikimedia.org/blob/operations%2Fpuppet.git/e030c07fd8b1db3bb66065797cb0a14b2bbbb31a/manifests%2Fsite.pp#L2727 [12:13:22] ^ That line causes instantiation of role::eventlogging::graphite on hafnium. [12:13:37] This role starts an EventLogging consumer. [12:13:46] (on hafnium). [12:13:59] This EventLogging consumer consumes (on hafnium) from vanadium. [12:14:08] That seems to be the connection you posted above. [12:17:29] nuria: Does the above make sense to you? [12:17:44] puf... [12:18:04] I take that as a no :-) [12:18:07] i do not see how that could be harvesting all metrics cause otherwise i does not make sense that [12:18:16] some would work but others wouldn't [12:18:50] Remember that I said that I am not sure that "per schema reporting" might not be in the code. [12:18:50] there is also a direct connection to statsd [12:19:15] You mean from vanadium to statsd? [12:19:20] Yes, that's the reporter service. [12:19:50] It gets instantiated here: [12:19:54] https://git.wikimedia.org/blob/operations%2Fpuppet.git/e030c07fd8b1db3bb66065797cb0a14b2bbbb31a/manifests%2Frole%2Feventlogging.pp#L156 [12:19:57] ya that one is clear [12:20:49] but why do we need 2? a statsd connection (directly) and a one proxied by hafnium? [12:21:13] I do not know. I haven't looked at the code. [12:21:34] Probably, they are reporting different things. [12:22:44] Looks like the reporter node is consuming from the processors directly, while the consumer on hafnium is consuming from the multiplexer. [12:22:53] might be that hafnium reports directly to graphite? [12:23:35] I am not sure, but I do not think so [12:23:37] output => 'statsd://statsd.eqiad.wmnet:8125', [12:23:44] sounds more like statsd to me. [12:23:50] ya, that does not add up [12:24:03] But dive into the code. That should remove abiguity and doubt. [12:24:10] ahhh [12:25:17] No need to be afraid of the code. The code is pretty short and easy. [12:25:18] the puppet code you mean, right? because El code (other than statsd publisher) is not doing much on this regard [12:25:37] Puppet + EventLogging code. [12:25:49] (Maybe statsd configuration. Not sure.) [12:25:55] i did looked at EL code, upstart code and logs on upstart [12:42:51] ^qchris, from me looking at the code what I got was that there are two ways to report to statsd [12:43:06] vi reporter (which seems to me reports global counts) [12:43:28] Yup. those two ways are also what is exhibited in the diagram you referenced above. [12:43:48] and via hafnium [12:44:18] which (as you said) is mostly managed by puppet [12:46:30] Does it sound right that the one that reports global counts is the "reporter" [12:46:43] eventlogging-reporter, that is [12:46:58] No clue. I do not know the code. [12:48:19] Let me have a look there. [12:50:18] what would be the best way to inspect the upd traffic sent to statsd on 8125 in the absence of tcpflow (which is not installed in vanadium) [12:50:47] I would not work on the production nodes. [12:51:03] I'd rather use labs instead. [12:52:15] Mhmm... I'd guess production machines would have tcpdump installed (but I am not sure). [12:52:24] You could use that to capture traffic. [12:53:01] wait .. I am not sure how can we use labs to troubleshoot this problem though [12:53:23] Also ... udp ... are you sure ...? [12:53:52] Well might be. [12:53:58] I do not know statsd. [12:54:35] In labs, I'd just setup a new instance and instantiate the needed classes there to mock EventLogging infrastructure. [12:54:36] it is udp [12:54:42] Ok. Udp it is. [12:54:44] statsd is as simple [12:54:49] as something can be [12:55:03] but also you can report directly to graphite which i did not know [12:55:19] Then to mock statsd, just open the statsd port for listening on udp and log what comes in there. [12:55:42] nc -l 8125, you mean? [12:55:48] sorry [12:56:18] nc -u -l 8125 [12:56:21] Wait ... wouldn't that be tcp. [12:56:31] Yes, the -u one looks better. [12:56:57] But I do not think that the other arguments are correct. [12:57:08] Do double them when trying it out. [12:57:32] s/Do double/Do double-check/ [12:59:11] ya no wonder cause i did from memory, but still [13:09:43] qchris i am not sure what you mean as if i look at the reporter it listens to several tcp streams and publishes to a udp one like [13:09:45] tcp 0 0 localhost:45644 localhost:8522 ESTABLISHED 24387/python [13:09:45] tcp 0 0 localhost:41611 localhost:8521 ESTABLISHED 24387/python [13:09:45] tcp 0 0 localhost:52757 localhost:8422 ESTABLISHED 24387/python [13:09:45] tcp 0 0 localhost:39069 localhost:8421 ESTABLISHED 24387/python [13:09:45] udp 0 0 *:60036 *:* 24387/python [13:09:57] 24387 is the reporter [13:10:23] usr/local/bin/eventlogging-reporter @/etc/eventlogging.d/reporters/statsd [13:10:37] Sorry. I do not understand the question. [13:10:41] so i think this process *:60036 [13:11:03] is the one publishing to statsd 8125 [13:12:14] but unless i ssh to statsd i cannot log on my end the incoming traffic [13:12:38] Sorry. I still do not understand the question. [13:12:58] Sorry, will try to re-explain [13:13:06] you said" mock statsd, just open the statsd port for listening on udp and log what comes in there." [13:13:16] Yes. [13:13:33] the statsd port for listening on udp is 8125 [13:13:58] That was meant for the labs instance. [13:14:47] The relevant part started in "In labs, [...]" [13:15:34] So /in labs/ I'd not setup a full statsd. But instead, I'd open the udp port on the instance itself and listen there, while I [13:15:54] reroute the statsd reporter to the local open "fake statsd" port. [13:16:20] I would not mess with production machines. [13:16:36] ah ok, sorry i am not sure how does labs helps here as we need to troubleshoot the stads connection in prod [13:16:46] which is working for some counts but not others [13:17:03] *statsd [13:17:12] Did you identify the code that is responsible for sending the per schema counts? [13:17:31] (Recall [12:18:50] Remember that I said that I am not sure that "per schema reporting" might not be in the code. ) [13:17:49] That identification can happen completely outside of production instances. [13:18:13] I think so [13:18:32] Ah. Ok. [13:18:34] Great. [13:18:47] So you know that it's the reporter node? [13:19:18] let me send you the line [13:19:26] No need to. [13:19:37] If you found something it, it's ok. [13:19:40] I trust you. [13:19:50] Then I'd verify that the code works in labs. [13:20:04] Like instantiating the reporter node in labs. [13:20:19] pipe in some data and see what comes out of the reporter node. [13:20:29] That can be done fully outside of production. [13:20:39] In this case ... you probably do not even need labs, [13:20:48] but can do this on you own machine right away. [13:20:53] It's just a python script. [13:21:21] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging.git/992b4f71c53bfd962a64b9a2eccf708e48623e31/server%2Feventlogging%2Fhandlers.py#L121 [13:22:03] Once you verified that the script is doing what it's supposed to do, [13:22:27] then I'd reread puppet to make sure that there is no obvious blocker. [13:22:38] Only afterwards, I'd think about going to the production machines. [13:26:05] But also there is this plugin: [13:26:06] nuria@vanadium:/usr/local/lib/eventlogging$ more exp.py [13:26:07] # -*- coding: utf-8 -*- [13:26:07] from eventlogging.factory import writes [13:26:07] @writes('measure') [13:26:07] def graphite_writer(path): [13:26:08] """Increments StatsD SCID counters for each event.""" [13:26:08] print path [13:26:08] while 1: [13:26:09] event = (yield) [13:26:10] print "heh! %s" % event [13:27:42] nuria, you said that you found the relevant code. I trust you. [13:28:12] ahem .. part of the relevant code as i still have no clue where thsi second file comes from [13:29:03] I'd divide and conquer. First (outside of production) verify that the code (from the EventLogging repository) does what it is supposed to do. [13:29:29] Then I'd worry about why this code is not effective in production, and where the exp.py is coming from. [13:47:16] * Ironholds is trying to make R and Hive play nicely directly [13:47:26] I have a different error from the error I started with, so we'll call that progress [13:47:45] :-P [13:54:14] qchris, nevertheless i would like to make sure there are no firewall issues as ori mentioned those had ocurred before [13:54:30] will try to see if ottomata can help as i have no permits [13:54:50] nuria: Sure. Do check them. And that might well be the issue. [13:55:14] nuria: But what you posted before did not look like you were fighting firewall issues :-) [13:55:35] right, not at all [13:55:49] i was trying to see why there are two different ways to report to statsd [13:56:36] also i had no clue where did the hafnium connection came from [14:01:20] ping milimetric [14:19:06] hrm; where does the hadoop-core jar live? [14:20:17] Ironholds: On analytics1010: [14:20:22] /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar [14:20:26] /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.0.0-mr1-cdh4.3.1.jar [14:20:31] /usr/lib/hadoop/client-0.20/hadoop-core-2.0.0-mr1-cdh4.3.1.jar [14:21:50] qchris_meeting, ta! [14:21:57] ahh, it's in client. Cool :) [14:25:12] NoClassDefFound error. huh. [14:30:39] huh! [14:30:44] they've started maintaining RHive again! [14:30:49] This may make my job comparatively trivial! [14:45:34] ottomata, yt? [14:49:16] (CR) Nuria: "I think this change should probably be decoupled from the admin scripts which are about done and, technically, it is not really needed for" (3 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/142514 (owner: Milimetric) [14:50:13] okay, it can't create a file. Maybe not so trivial ;p [14:52:01] Ironholds: not sure ... is this something a non-ops can help you with, or better wait for ottomata? [14:52:24] it's a long-term thing; we're not blocked by it :) [14:52:37] k. [14:52:42] I'll either work it out in my spare time or just wait until Ottomata is proximate enough that I can bribe him with alcoholic beverages :D [14:53:35] darn ... the benefits of being ops. ... you're getting bribed :-) [14:54:28] (CR) Milimetric: "Regarding de-coupling from admin scripts, I'm ok with that except we should work on those first." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/142514 (owner: Milimetric) [14:58:32] hi nuria, back [14:58:49] thnks for your help andrew [14:59:00] waht i wanted to do (if this makes sense) [14:59:20] is to check that from vanadium we can publish without issues to statsd [14:59:23] via udp [14:59:37] i undertstand that statsd [14:59:52] is tunsten.eqiad.wmnet [14:59:56] or tugnsten [15:00:25] k [15:00:32] do you know statsd port? [15:00:46] yes 8125 [15:00:51] but we can try with nc [15:00:56] aye [15:00:58] just seting up a listener [15:01:20] publish config is to statsd.eqiad.wmnet 8125 [15:01:41] i tried ssh-ing to tungsten but couldn't and have since requested permits [15:01:42] on rt [15:02:32] i'm having trouble logging into tungsten [15:03:43] with super-ops permits? [15:03:53] you need super-super-ops [15:04:20] there in, no it was just haning for a really long tim [15:05:44] ok, can we set up an udp listener there say on 10.000 [15:05:55] and try to send traffic from vanadium? [15:06:03] i can access to vanadium ok [15:06:26] i did one on port 8126 [15:06:28] cannot send stuff through! [15:06:36] aooohhhhh [15:07:46] now .. man .. what i do not get ... some traffic IS getting through to statsd [15:07:57] via hafnium? [15:08:11] ok, so 1 thing at a atime [15:09:47] i think you should be able to get through to statsd directly from vanadium - at least that is what i get from puppet - [15:13:58] where do you see that in puppet? [15:15:07] let me see, as this is configured in several places [15:15:34] this is is EL code that runs in vanadium: https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging.git/992b4f71c53bfd962a64b9a2eccf708e48623e31/server%2Feventlogging%2Fhandlers.py#L121 [15:15:54] hm, link doesn't work [15:16:12] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging.git/992b4f71c53bfd962a64b9a2eccf708e48623e31/server%2Feventlogging%2Fhandlers.py#L121 [15:16:27] mm...try again.? [15:16:43] this? [15:16:43] https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L121 [15:16:44] :) [15:18:25] ok, that works too [15:18:34] there is also [15:19:08] the configuration in vanadium here: [15:19:10] nuria@vanadium:~$ more /etc/eventlogging.d/reporters/statsd [15:19:10] --host=statsd.eqiad.wmnet [15:19:10] --port=8125 [15:19:19] which is filled in by puppet [15:19:37] I think those are the two direct points from vanadium-> statsd [15:19:53] but also el data is published from hafnium in to statsd [15:20:08] and that is here: https://git.wikimedia.org/blob/operations%2Fpuppet.git/e030c07fd8b1db3bb66065797cb0a14b2bbbb31a/manifests%2Fsite.pp#L2723 [15:20:16] halfnium? [15:20:52] hafnium.wikimedia.org [15:22:09] so really, we need to connect from hafnium to tunsten, right? [15:22:44] No, not really, from redaing the code [15:22:45] that wokrks [15:22:51] *reading the code [15:23:00] ? doesn't halfnium set up an eventlogging consumer [15:23:03] we connect to statsd directly from vanadium [15:23:09] and sends out to statsd? [15:23:13] but also from tungsten [15:23:17] on hafium: [15:23:17] input => 'tcp://vanadium.eqiad.wmnet:8600', [15:23:17] output => 'statsd://statsd.eqiad.wmnet:8125', [15:23:32] ya but there are more than 1 stads d puvblisher [15:23:38] PREPARE yourself: [15:23:46] https://wikitech.wikimedia.org/w/images/d/d6/Eventlogging-backend.svg [15:23:52] by qchris of course [15:23:55] haha [15:24:18] well, do the eventlogging schema counts work in graphite? [15:24:31] because i can netcat on 8126 hafnium -> tungsten [15:24:33] that works fine [15:24:50] only the "global" counts work, the "per schema" counts [15:24:54] stop working may 16th [15:25:38] but the hafnium thing is supposed to be the per schema counts metric, right? [15:25:47] Keeps a running count of incoming events by schema in Graphite [15:27:33] acording to docs yes, [15:28:09] so, are you sure that this part is broken because of a connection issue? [15:29:05] no, i am not sure the connection is what made the "per schema" counts not work [15:29:15] but the fact [15:29:28] that we cannot talk to statsd from vanadium [15:29:33] seems fishy [15:30:29] ja [15:32:48] nuria, sorry, moving convo back here [15:32:57] so neitehr vanadium or hafnium are on the analytics vlan [15:33:02] so that should not be relevant [15:33:08] but ja, that does seem fishy [15:33:20] what metrics are supposed to go directly from vanadium to tungsten? [15:34:29] i believe these ones for example: https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging.git/992b4f71c53bfd962a64b9a2eccf708e48623e31/server%2Feventlogging%2Fhandlers.py#L121 [15:34:37] unless i am totally missing something out here [15:34:47] sorry: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L121 [15:36:03] so, whatever uses that method [15:36:15] and are those working? [15:37:58] per schema counts are not [15:39:37] but per schema counts are not from vanadium [15:39:42] ok so [15:39:47] there may be multiple issues here [15:39:51] we shoudl separate them if we can [15:40:15] so, i can't do udp netcat from vanadium -> tungsten [15:40:15] so [15:40:15] question one [15:40:40] if there are metrics sent directly from vanadium, what are they, and are they showing up in graphite? [15:40:51] but i see a conection from tungsten into vanadium [15:40:58] see [15:41:01] (PS1) Milimetric: Add pretty symlink for WikimetricsBot [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/143040 (https://bugzilla.wikimedia.org/66087) [15:41:33] sorry , hafnium [15:41:48] tcp 0 0 vanadium.eqiad.wmn:8600 hafnium.wikimedia:36625 ESTABLISHED - [15:41:48] tcp 0 0 vanadium.eqiad.wmn:8600 hafnium.wikimedia:36627 ESTABLISHED - [15:41:48] tcp 0 0 vanadium.eqiad.wmn:8600 hafnium.wikimedia:36624 ESTABLISHED - [15:42:02] so, [15:42:10] those are tcp, dunno what those are [15:42:11] but [15:42:16] let's anser the first question first [15:42:23] what's up with vandium -> tungsten? [15:42:25] and do we care? [15:42:34] I think so [15:42:51] ok, so what metrics are generated by vandium -> tungsten? [15:42:56] let's see if they are working in graphite [15:42:56] as looking into the el config for teh consumer [15:43:19] on /etc/eventlogging.d/reporters [15:43:37] it displays: [15:43:38] statsd.eqiad.wmnet 8125 [15:44:11] ok, do we know the name of a metric that is sent via that reporter? [15:44:51] yes, all things working i believe this would send stuff like [15:44:53] nuria I submitted the symlink patch as a separate commit, I'm gonna go grab lunch now [15:45:10] k milimetrics thank youuu [15:46:13] eventlogging.schema.NavigationTiming.count [15:46:38] ottomata: eventlogging.schema.NavigationTiming.count [15:47:34] nuria, that's a vanadium -> tungsten one? [15:47:41] i thought the per schema counts were through hafnium? [15:48:56] how do I link to somethign in graphite...? [15:49:26] just save teh image [15:49:37] mouse over and say "get url for this img" [15:50:22] ok so this one [15:50:23] https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1404143379.367&target=eventlogging.overall.raw.rate [15:50:25] for example [15:50:35] i assume is a vanadium -> tungsten metric [15:50:36] right? [15:53:45] these are the overall metrics which are working, and according to comments yes, that is published from vanadium [15:54:11] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging.git/992b4f71c53bfd962a64b9a2eccf708e48623e31/server%2Fbin%2Feventlogging-reporter#L7 [15:56:20] ok so, then we should focus on hafnium metrics then, right? [15:56:38] if per schema metrics are supposed to go vanadium -> hafnium -> tungsten [15:56:45] then maybe somethign is wrong there [15:56:46] not vanadium -> tungsten [15:56:46] ja? [15:57:16] yes... although it is strange you cannot connect from vadium to statsd right? [15:57:25] well, i don't know about statsd [15:57:31] i'd have to shut downt the service to try its port [15:57:36] and i don't know how to debug statsd [15:57:37] vanadium->tungsten [15:57:42] and its harder to tell if that connection is working because its udp [15:57:46] without actually setting up a listener [15:57:57] or knowing how to look at incoming statsd metrics or something [15:58:03] you think it might be that only port 8125 is open? [15:58:12] possibly, i don't see any iptables rules, but i dunno [15:58:12] but! [15:58:15] this is more interesting [15:58:26] i cannot send from vanadium -> haflnium on udp 8126 [15:58:51] and the oposite? hafnium connecting to vanadium [15:59:17] that works fine [15:59:40] ah, iptables ruleson hafniumm [15:59:40] hmm [16:00:14] hafnium has base::firewall on it [16:01:05] iptables -L -n -v :) [16:01:21] nuria: what date did you say these metrics stopped working? [16:01:44] https://gerrit.wikimedia.org/r/#/c/134304/ [16:01:44] May 16th I believe, let me triple check [16:02:18] hm, this was merged june 2 [16:02:18] hm [16:02:23] also, i think the comment there is right [16:02:36] hafnium consumes from vanadium [16:02:40] so the connection should be outgoing, as we saw [16:02:44] hafnium -> vanadium is fine [16:03:16] so, metrics stopped working: http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1404144160.959&target=eventlogging.schema.NavigationTiming.count&from=00%3A00_20140501&until=23%3A59_20140630 [16:03:24] ~may 16th [16:04:59] ok, as far as I can tell, connections for those metrics and those nodes should work fine [16:06:02] i alredy checked that metrics are not being dropped with one of your fellow -ops people [16:06:25] whatcha mean? [16:06:57] the other option is that while connections are ongoing stream of interest is not published to 8600 on vanadium [16:08:19] but there are tons of events on that port, i see them doing: zsub vanadium.eqiad.wmnet:860 [16:09:14] sorry zsub vanadium.eqiad.wmnet:8600 [16:10:22] hm, q [16:10:32] how is the consumer on hafnium supposed to work? [16:10:44] it looks like it just forwards directly to statd, right? [16:11:17] oh does it infer by the fact that the protocol int he statsd url is statsd:// [16:11:17] ? [16:11:24] somehow the code knows to send that statsd message? [16:11:27] that you linked to? [16:11:56] that i do not know [16:13:24] the EL code is deployed to halfnium right? [16:15:45] yes [16:16:00] can you do ps auxfw | grep consumer [16:16:20] just fyi, i see udp packets coming into tungsten from both vanadium and hafnium on 8125 [16:16:30] networking stuff all checks out 100% i think [16:16:52] ha, nuria, no procs [16:16:55] on hafnium match [16:17:02] what? [16:17:18] so, must not be running? [16:17:22] we should see something like /usr/local/bin/eventlogging-consumer @/etc/eventlogging.d/consumers/all-events-log [16:17:32] nope [16:17:33] do ps auxfw to see what things there are? [16:17:37] there's nothing [16:17:41] just the grep process matches [16:17:46] ? [16:17:52] i mean [16:17:56] no consumer match [16:18:19] there are some diamond things [16:18:20] ok, according to puppet that process should be instantiated i think [16:18:23]