[06:53:17] 10Analytics, 10EventBus, 10Services, 10Availability (Multiple-active-datacenters), 10User-mobrovac: WANObjectCache relay daemon or mcrouter support - https://phabricator.wikimedia.org/T97562#3977687 (10Joe) 05Resolved>03Open [07:03:57] 10Analytics, 10EventBus, 10Services, 10Availability (Multiple-active-datacenters), 10User-mobrovac: WANObjectCache relay daemon or mcrouter support - https://phabricator.wikimedia.org/T97562#3977706 (10Joe) I'm reopening this since the status of the FLOSS mcrouter project in the last year has been dire:... [08:55:16] (03CR) 10Joal: "2 small comments" (032 comments) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/411090 (owner: 10Ottomata) [08:56:19] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3977850 (10Lea_WMDE) @addshore Can you think of a reason why we have 5109 data points for a one-month time... [09:01:19] (03CR) 10Joal: "1 mini-nit" (031 comment) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [09:11:06] (03CR) 10Joal: "Ideas for being even more generic - I think we're close to final merge :) This is AWESOME!" (032 comments) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410942 (owner: 10Ottomata) [10:46:00] just created https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus [10:46:08] it seems that netflow puts a lot of data :D [10:46:16] maybe it is way too much for a sigle partition [10:55:22] !log increased topic partitions for netflow to 3 [10:55:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:56:55] is pmacct so smart to use all partitions ? [11:00:03] in theory, iirc, librdkafka should by default to the round robin macig [11:00:06] *magic [11:17:05] ahhh there you go, on our version of pmacct kafka_partition is not defaulting to -1 [11:17:08] sigh [11:28:18] yesss now it works \o/ [11:33:36] also increased Camus mapred jobs from 1 to 3 [11:33:51] great elukey :) [11:34:10] elukey: I'm assuming netflow topic now receives data from all routerS? [11:34:44] joal: still not all, but more than 1 :) [11:34:45] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus?refresh=5m&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kafka_jumbo&var-kafka_broker=All&var-topic=netflow [11:34:55] elukey: :) [11:34:58] YAY ! [11:35:23] \o/ [11:38:35] also joal I fixed two issues that prevented oozie/hive-metastore to get prometheus metrics [11:38:50] <3 [11:38:55] will wait for the next restart to apply jmx monitoring to them [11:39:14] sadly hive server is still pending :( [11:43:49] joal: yesterday Andrew gave to me a tl;dr about the work that you guys are doing, really great :) [11:51:48] elukey: the SparkRefine is a cool project indeed ! [11:58:03] 10Analytics-Kanban, 10Operations, 10monitoring, 10netops, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3978245 (10elukey) After the last round of patches nfacctd/pmacct are sending events to Kafka using three topic partitions rath... [12:01:07] elukey: quick question for you: do the analytics worker nodes have access to conf100X? [12:03:08] joal: nope [12:03:13] only master nodes [12:03:38] Ah - This explains why my slider trial fails :) [12:03:56] Do you think it'd be feasible to have access from the workers? [12:04:05] Or should I create a separate clusteR? [12:04:35] it is a bit scary since in there we have all metadata about hadoop HA and the Kafka Clusters :D [12:05:06] elukey: I knew you'd be reasonable and paranoid :) [12:05:42] elukey: Shall I use a test zk I'd start on stat1004 for instance? [12:05:44] hahahaah [12:06:39] should be fine yes [12:07:18] I'm gonna try that [12:07:22] Thanks :) [12:08:04] joal: I am also extra paranoid since we don't have a good way (yet) to restrict zk usage on conf100X based on credentials [12:08:14] and from now on in there we'll store the Kafka ACLs [12:08:51] so if we open the doors to the worker nodes it might be problematic [12:09:08] makes total sense elukey [12:09:25] this is my feeling though, we can test/revise/amend/etc.. after some testing [12:09:37] if slider will be a super cool thing to use it makes sense [12:10:11] elukey: I'm rying it first, so no rush :) [12:39:14] * elukey lunch! [13:06:44] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3978460 (10Addshore) >>! In T182008#3977850, @Lea_WMDE wrote: > @addshore Can you think of a reason why we... [13:14:44] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3978498 (10Lea_WMDE) I should have written 30 * 1500, but what I mean is: We know we have way more edit con... [14:23:56] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3978694 (10Addshore) Interesting, so there is https://meta.wikimedia.org/wiki/Schema:EditConflict and the l... [14:40:00] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Stream: Increase kafka event retention to 14 or 21 days - https://phabricator.wikimedia.org/T187296#3978746 (10Pchelolo) Currently on kafka-main machines the disk utiliation is really low, so I think we can easily do it without k... [14:40:05] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Stream: Increase kafka event retention to 14 or 21 days - https://phabricator.wikimedia.org/T187296#3978748 (10Pchelolo) Currently on kafka-main machines the disk utiliation is really low, so I think we can easily do it without k... [14:42:20] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3978750 (10Addshore) So, it looks like we cannot record the text of the edit conflict in event logging, at... [14:47:06] 10Analytics, 10EventBus, 10Services (later): Investigate why disk usage on Kafka nodes is 2 times lower in codfw - https://phabricator.wikimedia.org/T187554#3978767 (10Pchelolo) [15:21:17] elukey: I didn't submit the EL change, just +2-ed it [15:21:44] about that constant, I agree it's only useful for that function, but it's good to bubble out constants like that, which might need to be tuned to affect performance, in case we need to pull them out into external config [15:23:14] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3978840 (10Cmjohnson) [15:23:49] milimetric: ack, thanks :) [15:24:40] elukey: let me know if you want me to merge and deploy it, I can do it next week [15:24:57] or today if you're all feeling lucky about Friday :) [15:25:50] milimetric: I am eager to see it running but it would go against my do-not-merge-on-fridays mantra :D [15:26:09] I think I can manage the deployment on Monday, it should be a scap deploy + el restart right? [15:26:42] yeah, normal EL deploy, but I meant 'cause it's my ops week [15:26:46] I won't be around monday for the holiday [15:26:54] but I'll be here Tuesday, and happy to do it [15:27:45] ah snap not on Monday right, don't want to break anything on a holiday either :) [15:27:52] let's do it together on tue ok? [15:28:41] k, cool, I'll try to be on early [15:29:04] milimetric: nah whenever you want, there is no rush :) [15:29:31] nice! Broken disk on an1057 swapped! [15:41:42] !log add analytics1057 back in the Hadoop worker pool after disk swap [15:41:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:41:58] RECOVERY - Hadoop DataNode on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:42:46] hello an1057, welcome back [16:14:34] 10Analytics, 10EventBus, 10Services (done): ChangeProp workers die if they can't connect to redis - https://phabricator.wikimedia.org/T179684#3979072 (10Pchelolo) p:05Triage>03High This happened again, so reopening. A couple of notes: 1. The situation is better after deploying the fix, only on `KafkaCon... [16:14:43] 10Analytics, 10EventBus, 10Services (done): ChangeProp workers die if they can't connect to redis - https://phabricator.wikimedia.org/T179684#3979075 (10Pchelolo) 05Resolved>03Open [17:30:26] oh milimetric I figured out the thing so nvm [17:30:38] k, cool fdans [17:51:02] 10Analytics, 10EventBus, 10Services (watching): Enable multiple topics in EventStreams URL - https://phabricator.wikimedia.org/T187418#3979316 (10Pchelolo) [17:53:14] * elukey off! [18:02:08] joal, shouldn't this dashboard have data? https://grafana.wikimedia.org/dashboard/db/analytics-hive [18:02:40] mforns: I think this is in rework by elukey [18:02:41] nuria_: are we oneononeing? [18:02:48] 10Analytics-Kanban, 10Operations, 10monitoring, 10netops, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3979339 (10Nuria) Are we planing to use tranquility to move the he data into druid or rather just kafka-> camus-> hive? [18:02:49] ok ok [18:02:49] yes, on ahngout [18:06:22] gone for diner team [18:56:25] (03PS6) 10Milimetric: [WIP] Saving in case laptop catches on fire [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408848 (https://phabricator.wikimedia.org/T184759) [19:13:13] hey folks! Let's say I wanted to get pageview analytics for wikitech. Where would you suggest I look first? [19:13:29] Specifically, I want to know who is using ToolForge documentation. [19:18:30] Maybe something like piwik [19:20:28] 10Analytics, 10Cloud-VPS, 10EventBus, 10Patch-For-Review, 10Services (watching): Add page-related topics to EventStreams - https://phabricator.wikimedia.org/T187241#3979609 (10Pchelolo) [19:26:11] halfak: no clue :( [19:29:18] 10Analytics, 10EventBus, 10Services (doing): ChangeProp workers die if they can't connect to redis - https://phabricator.wikimedia.org/T179684#3979625 (10Pchelolo) Right now it's problematic to dig the logs because it's hard to filter all the logs by worker id to understand the sequence of events happening w... [19:31:28] halfak: wikitech is not behind varnish [19:31:39] Gotcha. I worried that that was the case. [19:31:42] halfak: so no pageview analytics exists [19:31:56] halfak: best you could do is apache logs [19:31:59] piwik? [19:32:11] not sure, depnds on traffiq [19:32:55] piwik is good for say 1 million req per month of reqs per day but not much beyond that [19:33:10] halfak: so maybe yes? [19:33:26] 1 million per month say [19:33:29] Gotcha. Thanks. I'll take that back to the DOCS SIG ) [19:33:39] :) [19:33:58] halfak: a dedicated piwik might be able to handle it, let me see higest site on piwik now, i think is ios app [19:35:11] halfak: ya, those are different but looks like there are ~ 100.000 uniques per day [19:35:26] halfak: so probably can be done [19:35:34] halfak: with a dedicated piwik [19:42:09] Makes sense. [19:42:41] halfak: let us know what you find [19:48:07] (03PS4) 10Nuria: Fix issues with numer formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/409714 (https://phabricator.wikimedia.org/T187010) [19:48:13] (03CR) 10jerkins-bot: [V: 04-1] Fix issues with numer formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/409714 (https://phabricator.wikimedia.org/T187010) (owner: 10Nuria) [20:05:31] Gone for tonight a-team :) [20:05:37] byeeeeeeee [20:05:40] :] [22:48:58] mforns[m]: still there? cc milimetric [22:51:36] DarTar: can you give me an opinion? [23:08:22] oh sorry, what’s up nuria_ [23:36:54] milimetric: take a look at this and tell me what you think [23:37:17] https://usercontent.irccloud-cdn.com/file/zIPyfO4d/Screen%20Shot%202018-02-16%20at%202.41.22%20PM.png [23:37:30] https://usercontent.irccloud-cdn.com/file/tR9nR4RW/Screen%20Shot%202018-02-16%20at%202.41.41%20PM.png [23:45:08] (03PS5) 10Nuria: [WIP] Fix issues with numer formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/409714 (https://phabricator.wikimedia.org/T187010) [23:45:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Fix issues with numer formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/409714 (https://phabricator.wikimedia.org/T187010) (owner: 10Nuria)