[08:27:04] Analytics-EventLogging, Wikimania-Hackathon-2016: Extract data from EventLogging for Romanian diacritics poll - https://phabricator.wikimedia.org/T138558#2403890 (Strainu) [08:50:49] preliminary results on cp3009.esams (cache misc server) shows 3 VSL timeouts registered by varnishlog with max VSL timeout set to 600 vs 19 registered with VSL timeout set to 120 (default) [09:09:25] elukey: here? [09:09:33] yep! [09:09:45] I need some ops pliz :) [09:09:52] surez [09:10:32] On analytics1030, I'd like to see what there is in /var/lib/hadoop/data/b/yarn/local/usercache/joal/appcache/application_1465403073998_42981 [09:10:47] This folder is owned by user yarn, and I only have rights with hdfs :( [09:12:24] joal: sure it is 1030? /var/lib/hadoop/data/b/yarn/local/usercache/joal/appcache is empty :( [09:12:35] mwarf :( [09:12:38] thanks [09:12:57] will probably ask for more soon [09:13:06] sure [09:13:58] wow elukey, could actually be any of /var/lib/hadoop/data/[b|c|d|e|f|g ..|m] [09:14:02] man ... [09:14:12] Don't bother, I have the feeling this not ok [09:14:36] Will try differently elukey [09:15:44] nothing for for el in b c d e f g h i j k l m; do ls /var/lib/hadoop/data/$el/yarn/local/usercache/joal/appcache/; done [09:15:48] on 1030 [09:15:58] ok, too fast for me :) [09:16:01] Thanks elukey ! [09:16:13] :) [09:55:42] joal: going afk for a couple of hours, is it ok or do you need moar help with logs? [09:57:35] elukey: no need currently, thanks :) [09:58:08] super, ttl! [09:58:16] later ! [11:00:49] Morning A Team! [11:01:00] Hi addshore [11:01:29] I have a question regarding this line https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala#L90 [11:01:47] sure [11:01:47] mainly why is it webrequest_source ? [11:02:07] that would result in hdfs://analytics-hadoop/wmf/data/wmf/webrequest/webrequest_source ? [11:02:35] addshore: we call it source but it's actually the varnish-cache types/groups [11:03:01] addshore: web [11:03:04] Would simply changing that to -> val parquetDataPath = "%s/pageview_hourly_source=text/year=%d/month=%d/day=%d/hour=%d" work? [11:03:05] oops, again [11:03:16] Ah, I get it :) [11:03:38] addshore: webrequest table is partitionned by webrequest_source, year, month, day, hour [11:03:47] ahhh [11:03:57] addshore: pageview_hourly table is partitionned by year, month, day, hour [11:03:58] okay, now it makes sense :D [11:04:08] So... val parquetDataPath = "%s/year=%d/month=%d/day=%d/hour=%d" [11:04:22] addshore: Partitions in hive are no more than folders on disks :) [11:04:31] You're all set :) [11:04:42] aweosme, I may have something for you to review in a seocnd then! [11:04:58] I just also need to change the filter! [11:06:30] oh actually, no, I need to split the data per wiki still.. [11:06:44] addshore: no rush ;) [11:21:24] joal: do you know how to iterate over a GroupedData ? [11:21:37] addshore: depends what you want to do with it [11:21:59] so essentially I do some stuff and the last line here is "val groupedData = parquetData.groupBy("project")" [11:22:33] then I want to iterate over the grouped data and thus for each project count the number of 'rows' [11:22:43] addshore: When you say do some, you mean filtering I guess [11:22:51] yup [11:23:01] so there is one filter, and then this groupBy [11:23:15] addshore: I didn't realize what you were after [11:23:36] :D [11:24:25] addshore: Easiest is probably to using sqlContext, see here: https://gist.github.com/jobar/63252ad3327df4d7a3b4a67aff3667c0 [11:24:49] Also addshore, you can test your code using spark-shell from stat1002 [11:25:26] But, if you are willing to continue with RDDs, you need to actually count (or sum, or reduce) after the grouping [11:25:42] Grouping just builds the grouped structure [11:26:06] ahh, so parquetData = parquetData.groupBy("project").count() [11:26:14] which leaves me with a DataFrame again! [11:27:03] addshore: that's why I'm speaking of sqlContext: everything you describe can be easily done with SQL, then RDD extraction for sending to graphite [11:28:05] Also, RDD? ;) [11:29:30] In spark, 2 main structures for handling data: DataFrame (sql oriented, with schema), RDD (no schema, more raw stuff) [11:29:56] You can get a RDD[Row] from a dataframe, then extract values from the row [11:30:44] Let's imagine you go for : val df = sqlContext.sql("SELECT uri_host, COUNT(1) FROM pageview_hourly where year = 2016 and month = 6 and day = 6 and hour = 0") [11:31:30] then you can: val rdd: RDD[(String, Long)] = df.rdd.map(row => (row.getString(0), row.getLong(1))) [11:35:12] hmm, okay! [11:39:54] joal: sticking with the previous method does this like correct? https://gerrit.wikimedia.org/r/#/c/295896/ [11:41:53] addshore: some correct, some not :) [11:42:05] :D [11:42:45] hehe, and I left a rouge println(myString); in there... [11:45:19] error: reassignment to val, ahh, so apparently you cant do that... [11:45:42] addshore: I'll rewrite L86->L95 differently [11:45:45] currently testing [11:45:50] okay! :) [11:46:12] One comment also: val in scala are immutable [11:46:33] yup, this is my first scala :) [11:48:41] So reassignment of parquetData would fail :) [11:59:01] Actually I rewrote most of the end: https://gist.github.com/jobar/32ee891370f51e8af8dd18d625b66c43 [11:59:05] addshore: --^ [11:59:36] Using the SQL syntax makes it way easier to grasp what data you have (dataframe is used in any case when reading parquet) [11:59:59] Then you need to convert the row: you can't access value by name in row [12:03:16] joal: I'd like to reboot aqs1001 for kernel upgrades (depooled from AQS, nodetool drain and then reboot). Anything against it? [12:03:25] nope [12:03:33] good, proceeding :) [12:18:03] elukey: HEEEEEELP [12:18:05] :) [12:18:27] joal: here I am :) [12:18:36] aqs1001 up and running, nodetool status is fine [12:18:40] great :) [12:18:51] elukey: I have a firewall issue I think [12:19:10] elukey: bulk loading uses RPC (Thrift) port: 9160 [12:19:22] And I think this one is not open [12:19:42] elukey: Do you know anyone that could help? [12:19:57] we can check ferm rules [12:20:18] 9160 should be opened on AQS right? [12:20:22] to make everything working [12:20:22] elukey: I'd say: *you* can check ferm rules dince I don't know where [12:20:43] I think 9160 is currently closed, and I'd like it to be openned from analytics [12:22:04] yeah but what I wanted to ask is if hadoop tries to connect to 9160 on AQS* (bit ignorant about the details of the bulk loading) [12:22:34] elukey: correct, I currently try to connect from hadoop to AQS (new) [12:22:37] on 9160 [12:23:53] elukey: Tried manually with telnet: no answrr [12:29:55] joal: thanks! *takes a look* [12:35:31] joal: will check in a min sorry, was double checking some memcached metrics :( [12:45:15] (PS2) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [12:45:28] so joal what else needs to be done to configure the spark job? [12:48:02] (CR) jenkins-bot: [V: -1] Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) (owner: Addshore) [12:48:55] joal: yes it is blocked [12:48:59] we have rules like [12:49:01] # Cassandra CQL query interface [12:49:02] ferm::service { 'cassandra-analytics-cql': [12:49:02] proto => 'tcp', [12:49:02] port => '9042', [12:49:02] srange => "(@resolve((${cassandra_hosts_ferm})) @resolve((${aqs_hosts_ferm})) ${analytics_networks})", [12:49:05] } [12:49:10] for the AQS role [12:49:19] we can add one ofr 9160 [12:49:36] I think that we'd need only to allow traffic from analytics right? [12:53:08] joal: https://gerrit.wikimedia.org/r/#/c/295907/ [12:53:54] I would prefer to wait a bit for moritzm's review before merging.. is it super urgent? [13:03:55] I'll have a look [13:07:37] moritzm: just gave -1 to myself :) [13:07:41] does 9160 need to be access from the entire analytics network? [13:07:54] elukey: yeah, you'll need a unique identifier for the ferm service [13:07:59] fixed :) [13:08:19] theoretically we'd need to be able to upload from each hadoop host [13:09:01] so from most of the analytics* to AQS [13:12:01] the current rules mostly allow the entire analytics network, an alternative would be to define a list of Hadoop hosts in Hiera, but given that the analytics network is mostly separated and we mostly only use the entire analytics network in the other rules we can keep it at that [13:14:13] when level of connections are we excepting here? [13:14:36] when merging keep an eye on the connecion count [13:14:54] one remark, but looks good in general [13:16:59] (PS3) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [13:17:39] Thanks elukey and moritzm (had to disappear for a bit) [13:17:46] o/ [13:20:42] (CR) Joal: "1 nit, but looks correct to me." (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) (owner: Addshore) [13:20:44] Hey :) [13:20:55] elukey: how are we on firewalling? [13:21:11] (PS4) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [13:21:59] joal: good, the only question was about the level of connections that will flow through Thrift [13:22:12] elukey: not so many I think [13:22:22] well we'll keep them monitored [13:22:30] k [13:22:41] elukey: Is it opened or not yet? (I wanna test !!!) [13:22:58] joal: nope give me 10 mins :) [13:23:03] sure elukey :) [13:23:24] sorry for the pressure, I'm like a little boy with a new toy (you know me) [13:23:29] I've just +1ed Luca's patch [13:23:37] Thanks moritzm :) [13:25:52] addshore: Have you tested your patch on the cluster [13:25:53] ? [13:25:54] thankssss.. salt puppeting now [13:26:21] No! I'll need a walkthrough as ot how to do that / a link to some docs! [13:26:43] hm .. I think we have no doc about that [13:26:49] :D [13:27:06] joal: good to go [13:27:12] awesome elukey ! [13:27:16] Will try that now [13:28:10] the me now when you start so I'll monitor the conntrack count [13:28:23] addshore: The way to go: build the jars (on stat1002 for instance) using mvn clean package command, then run a spark submit with the built jar [13:28:30] elukey: started :) [13:29:19] joal: do you have a specific target? [13:29:23] like aqs1004 [13:29:27] or all of them? [13:29:39] elukey: should go to all the 6 of them (using -a -b) [13:29:59] 6 == instances right? [13:30:07] not aqs1001 [13:30:07] correct [13:30:08] :P [13:30:12] :D [13:34:24] joal: okay! (I'm just checking to see if the data also needs to be split up in terms of users vs bots & spiders..! [13:34:35] addshore: good call ! [13:42:05] joal: is it working? [13:42:16] elukey: difficult to say [13:42:27] elukey: I don't see errors, but seems not to :( [13:42:52] mmmmmmmm [13:43:12] (let's see if bringing ottomata's mmmmm something starts working) [13:43:30] NOPE ! [13:43:33] errors elukey [13:44:10] elukey: Since I'm connecting to instances, maybe IPs are specific? [13:44:27] what do you mean? [13:44:52] (brb in 2 mins) [13:46:35] elukey: I'm trying and failing to connect to 10.64.0.12[6|7],10.64.48.14[8|9], and 10.64.32.1[89|90] [13:53:00] but for normal loading jobs, hadoop -> cassandra works for those ips.. [13:53:14] so it shouldn't a network ACL rul [13:53:20] elukey: different loading, different ports ;) [13:53:42] yes yes I meant at the router/switch level [13:54:06] let me try a couple of things [13:54:49] elukey: sure ! thanks [13:55:47] so elukey@analytics1030:~$ telnet aqs1006-a.eqiad.wmnet 9042 works [13:55:48] but [13:55:57] telnet aqs1006-a.eqiad.wmnet 9160 does not [13:56:02] mmmm [13:58:45] checking iptables -L [14:00:09] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2200676 (Nuria) @awight Let us know what help you are looking from analytics team. Th... [14:00:11] so I checked analytics1030.eqiad.wmnet's IP against the /24 subnets allowed to connect to port 9160 on aqs1004 and it seems ok [14:01:28] joal: can you past the exact error message? [14:01:32] *paste [14:01:52] elukey: It's java, connection timeout [14:02:38] elukey: Might be cassandra setting as well ! Just remembered that, maybe cassandra is not listening on port 9160? [14:02:56] * elukey checks [14:03:29] bingo, nothing listening on 9160 from netstat -nlpt [14:03:41] hmmm [14:03:44] MWARF ! [14:04:46] I found https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsEnableThrift.html [14:04:59] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2404800 (awight) @Nuria I don't think anyone has both the domain knowledge about Kaf... [14:05:29] elukey: would that do the trick? [14:05:57] probably, but there might be a puppet setting.. checking [14:07:45] $start_rpc = true, [14:07:45] $rpc_address = $::ipaddress, [14:07:46] $rpc_port = 9160, [14:07:55] this is the thrift cassandra default config [14:08:29] aaaand in the aqs role [14:08:30] cassandra::start_rpc: 'false' [14:08:36] joal --^ [14:08:40] AHHHH ! [14:08:42] You have it [14:09:02] I recall having talked with Alex about us not needing thrift [14:16:52] joal: I'd say to enable it only on aqs100[456] for the moment [14:19:07] elukey: sounds good to me [14:22:19] joal: currently testing the change with the puppet compiler [14:22:26] should be ready to merge in ~10 min [14:22:31] You rock elukey :) [14:22:45] let's wait to see if it works first :P [14:23:00] :) [14:30:07] jenkins was stuck and verifications are a bit delayed [14:38:17] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2404933 (Nuria) >Confirmation that my webrequest queries make sense. You are counting... [14:39:58] INFO [main] 2016-06-24 14:39:42,177 ThriftServer.java:119 - Binding thrift service to /10.64.0.126:9160 [14:40:02] \o/ [14:40:10] we need to restart all the instances though [14:41:51] elukey: nothing uses them, we can do [14:42:00] elukey: don't you have an upgrade to do as well ? [14:42:56] awight: [14:42:58] yt? [14:43:29] joal: yeah.. but I'd prefer to reboot them next week :P [14:43:51] actually aqs1004 has linux 4.4 already [14:44:00] Ah, ok :) [14:44:28] all right, restarting all the instances one at the time.. I'll let you know when it is finished :) [14:44:48] Thanks a mil elukey !! [14:47:33] nuria_: hi! [14:47:50] awight: regarding your dataloss issues [14:47:53] Are you at wikimania by any chance? I could show you the confidential config, if that's helpful. [14:48:00] awight: no [14:48:07] k [14:48:07] awight: no wikimania [14:48:14] awight: but let me ask you [14:48:32] awight: have you tried to consume from your endpoint directly rather than looking at your db? [14:48:59] awight: to verify that data is indeed not in your kafka feed? [14:49:16] awight: you can consume with kafkacat [14:49:38] The counts I quoted come from two places--the webrequest db on stat1002, and the plaintext file that kafkatee creates on our fundraising server. [14:49:43] That's a great idea, to consume directly. [14:50:22] Is it possible to consume directly from stat1002 as well, so I have data to cross-check against? [14:51:20] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2404958 (awight) That sounds right, I see lots of image mime-types, so that's not bei... [14:54:37] aha, looks like I can! https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka#Produce.2FConsume_to_kafka [14:54:47] awight: 1002 data is all in the webrequest table [14:55:07] awight: so while you could i do not think you need to [14:55:11] but it's so slow to query :) [14:55:16] fair enough, though [14:55:29] awight: well, the stream is 150.000 per sec [14:55:40] awight: so you might find it not so efficient either' [14:56:23] awight: do consume directly from your endpoint and see whether what you see matches the stream [14:56:38] awight: we will be here for a while so you can let us know [14:56:46] wonderful, I'll do that now [15:05:46] joal: done! [15:06:02] Yay ! Teeeeeeeesting again :) [15:06:06] nuria_: Too bad--it turns out I don't have access to do this test: cannot log into the machine with kafkatee, kafkacat probably isn't installed, I don't know what our endpoints are, and the firewalls prevent me from doing this anywhere else in the fundraising cluster. I'll work on it and get back to you. [15:06:49] This was helpful though, I'm out of my dead-end for the moment. Thank you! [15:06:56] awight: do you guys have any ops? [15:07:59] nuria_: We do, Jeff_Green. He's at Wikimania so we can probably turnaround by tomorrow. [15:08:12] awight: ok, let us know, it is unlikely that even our ops can access fr cluster. [15:11:03] I think that's true. Maybe a handful of ops can get in there, but I probably shouldn't enumerate them here. Security through obscurity! [15:14:57] elukey: seems not to work either :( [15:15:16] sigh [15:15:19] same error? [15:15:26] not sure yet [15:15:57] ah snap telnet aqs1006-a.eqiad.wmnet 9160 doesn't work from an1030 [15:15:58] grrrr [15:16:11] ok, so maybe same error [15:28:05] moritzm: there seems to be a problem in the settings. I checked iptables -L on aqs100[456] and rules to allow incoming analytics network traffic for port 9042 and 9160 are the same, but from analytics1030 (IP contained in one of the subnets) I can telnet only 9042 and not 9160 [15:28:29] the thrift service is up and running, tested it using telnet in localhost [15:28:42] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2405124 (awight) On Nuria's suggestion, I'm going to coordinate with jgreen and attem... [15:29:19] I don't know if I am missing something or maybe there are some network ACL related to port 9042 and not 9061 [15:29:25] *9160 [15:34:59] hm [15:39:32] I don't think so but it might be an option [15:40:34] I tried to stop ferm on aqs1006 and then telnet from analytics1030 [15:40:38] same issue [15:40:54] elukey: have you tried on aqs1006-a? [15:41:00] yes [15:41:06] k [15:41:09] makes no sense:( [15:41:27] Maybe config for thrift and multi-instance is not done? [15:42:16] elukey@aqs1006:~$ telnet aqs1006-a.eqiad.wmnet 9160 [15:42:16] Trying 10.64.48.148... [15:42:16] Connected to aqs1006-a.eqiad.wmnet. [15:42:17] Escape character is '^]'. [15:42:50] hm - Loooks like a real firewall thing then [15:45:36] there are network level ACLs for ports [15:45:47] checking if we can add an exception [15:46:11] Woaw [15:47:53] need to file a task for netops! [15:48:18] hm, looks like my tests are gonna wait next week :) [15:48:22] (CR) Addshore: [C: -1] "-1 per question on ticket" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) (owner: Addshore) [15:59:06] joal: https://phabricator.wikimedia.org/T138609 [16:00:09] elukey: Need to correct the port number on the ticket :) [16:00:36] 2 different versions :) [16:01:06] ah snap thanks [16:01:07] done [16:01:52] elukey: standduppp [16:08:06] Analytics-Kanban: Ratention metric research - https://phabricator.wikimedia.org/T138611#2405288 (Nuria) [16:09:02] Analytics-Kanban: Retention metric research - https://phabricator.wikimedia.org/T138611#2405300 (Nuria) [16:15:05] Analytics-Kanban: Test cassandra compactions on new AQS nodes - https://phabricator.wikimedia.org/T135145#2289668 (JAllemandou) [16:15:35] actually nuria_ I had forgotten about the existing task: https://phabricator.wikimedia.org/T126243 [16:15:55] joal:k