[00:02:12] 10Analytics: Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days - https://phabricator.wikimedia.org/T120036#2895150 (10Nuria) Preliminary data is temporarily at: https://analytics.wikimedia.org/dashboards/standard-metrics/#projects=eswiki,itwiki,enwik... [00:08:57] 06Analytics-Kanban, 06Operations, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2895170 (10Nuria) @JKatzWMF Besides documenting this fact as one on the dataset (super thanks for reporting!) I do not think there is any... [00:09:15] 10Analytics, 06Operations, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2895171 (10Nuria) a:05mforns>03None [00:18:23] 10Analytics, 06Operations, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2895208 (10Nuria) >Is that testing framework also planned to work with central notice/banners, or is that a separate infrastructure? couldn't say w/o knowing how central banner infarstructure works.... [00:29:12] 06Analytics-Kanban, 15User-Elukey: Valgrind tutorial for periodical mem usage reviews - https://phabricator.wikimedia.org/T147438#2895224 (10Nuria) Valgrind added to README of tests: https://gerrit.wikimedia.org/r/#/c/328381/1/README.md [00:30:10] 06Analytics-Kanban: Varnishkafka testing framework - https://phabricator.wikimedia.org/T147432#2895226 (10Nuria) [00:30:12] 06Analytics-Kanban, 15User-Elukey: Valgrind tutorial for periodical mem usage reviews - https://phabricator.wikimedia.org/T147438#2895225 (10Nuria) 05Open>03Resolved [00:30:34] 06Analytics-Kanban: decommission outdated instances on labs - https://phabricator.wikimedia.org/T153193#2895231 (10Nuria) 05Open>03Resolved [00:31:29] 06Analytics-Kanban, 15User-Elukey: Webrequest dataloss registered during the last kafka restarts - https://phabricator.wikimedia.org/T152674#2895233 (10Nuria) 05Open>03Resolved [00:31:44] 06Analytics-Kanban: Varnishkafka testing framework - https://phabricator.wikimedia.org/T147432#2692790 (10Nuria) 05Open>03Resolved [00:40:51] (03Draft2) 10XXN: More options for the number of shown rows of resultset. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/328602 [00:51:10] 10Analytics, 10EventBus, 13Patch-For-Review, 06Services (watching): Check eventbus Kafka cluster settings for reliability - https://phabricator.wikimedia.org/T144637#2895309 (10Pchelolo) [00:59:22] (03CR) 10Nuria: "Couple small nits but looks like it is ready to go." (034 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/327237 (https://phabricator.wikimedia.org/T120131) (owner: 10Fdans) [02:53:26] 10Analytics, 10EventBus, 13Patch-For-Review, 06Services (watching): Check eventbus Kafka cluster settings for reliability - https://phabricator.wikimedia.org/T144637#2895400 (10Pchelolo) @Ottomata I've found an excellent article that answers your first question. Highly recommended: http://126kr.com/article... [07:42:02] 06Analytics-Kanban, 10LDAP-Access-Requests, 15User-Elukey: Add Francisco Dans to the wmf LDAP group - https://phabricator.wikimedia.org/T153847#2895593 (10elukey) 05Open>03Resolved a:03elukey Added `fdans` to `wmf`. [09:02:22] hi Analytics people. :) if you're around, can you let me know where I can see the code or the definition used to compute agent_type in webrequest logs? [09:03:18] leila: o/ [09:05:06] hi elukey. [09:06:31] leila: maybe https://github.com/wikimedia/analytics-refinery-source/blob/release/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L51 could be a starting point ? [09:07:08] got you. thanks, elukey. [09:07:36] I am farly ignorant about the refinery source, but I would bet on refinery source java files for what you need [09:07:41] *fairly [09:26:19] elukey: o/ [09:45:04] joal: o/ [10:08:44] * elukey afk for a couple of hours [10:43:44] (03PS11) 10Joal: Add mediawiki history spark jobs to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) [10:51:57] hello team :] [11:14:07] Hello mforns ! [11:14:34] mforns: If you don't mind, could you have a look at the overview.md file in the patch above ? [11:14:53] mforns: I'd like to have your opinion on the clarity of my explanations ;) [11:15:06] joal, hi! [11:15:17] sure, looking [11:15:25] mforns: Thanks mate [11:16:54] joal, give me some minutes, though, I'm in the middle of something [11:17:18] mforns: no rush, take your time (it's also long and complex reading ;) [11:44:14] Hi fdans :) [11:44:25] hello joal! [11:44:45] I have the same request for you as the one I asked mforns : could you have a look at the overview.md file in the patch above ? [11:44:55] this one: https://gerrit.wikimedia.org/r/325312 [11:45:06] of course [11:45:38] fdans: I'm interested to know if my writings makes sense :) [11:57:45] fdans: whenever you have time, can you try to access yarn.wikimedia.org and pivot.wikimedia.org? [12:04:13] (03CR) 10Fdans: "Just a couple of tiny things, but everything makes sense to me and it does a great job at illustrating the process :)" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) (owner: 10Joal) [12:05:22] elukey both are working, thank you Luca!!! [12:05:33] \o/ [12:05:56] fdans: check https://wikitech.wikimedia.org/wiki/LDAP_Groups when you have time (you are in 'wmf') [12:31:47] joal: o/ [12:31:50] do you have a minute? [12:31:57] Hi elukey [12:32:06] elukey: I always have time for you :) [12:32:31] \o/ [12:32:53] I was reviewing https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?from=now-30d&to=now to add 90% and 100% heap utilization thresholds [12:33:06] (you can see the yellow and red bars/backgrounds) [12:33:20] and then I noticed the pattern from the 17th/18th onwards [12:33:35] for the node managers [12:33:36] 10Analytics, 10Analytics-Wikistats, 07Tamil-Sites: Categories created using Tamil words not recognised in stats - https://phabricator.wikimedia.org/T6537#2895947 (10MarcoAurelio) [12:33:53] increase in heap usage and Old Generation GC count [12:34:29] the node manager are hitting the 90% jvm heap usage [12:34:44] and I think that the frequent old GC count is the thing that prevents OOMs :D [12:34:58] 10Analytics, 10Analytics-Wikistats, 07Tamil-Sites: Categories created using Tamil words not recognised in stats - https://phabricator.wikimedia.org/T6537#2896012 (10MarcoAurelio) [12:37:00] elukey: I think the cluster has been busy this last week :) [12:37:19] oh yes but I just want to make sure that it is not due to something weird ongoing [12:37:39] elukey: I don't think so, I have not seen anythin problematic [12:37:42] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?from=now-90d&to=now [12:37:58] we haven't seen this pattern during the past three months [12:38:23] right elukey [12:38:42] (03CR) 10Mforns: Add mediawiki history spark jobs to refinery-job (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) (owner: 10Joal) [12:39:15] joal, wrote some comments on the overview.md, great docs, thanks a lot!!!! [12:39:21] elukey: thanks, checked that doc out and added ldap info to the onboarding article [12:39:25] mforns: Thanks a lot for reading ! [12:39:46] elukey: still thinking on the nodemanager thing [12:39:58] elukey: batcave for baindump? [12:41:02] joal: 2 min! Coffee and I'll join :) [12:41:07] sure elukey [13:02:19] (03PS5) 10Fdans: Standardized UDF naming [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/327237 (https://phabricator.wikimedia.org/T120131) [13:16:49] elukey: could be related to that bug https://issues.apache.org/jira/browse/PARQUET-353, and the fact that we use spark external shuffle service [13:18:00] elukey: spark external shuffle service isa yarn nodemanager auxilliary service, so included in it's heap [13:18:18] and this external service job is to write files [13:22:25] hmmm - cdh 5.5.2 uses parquet 1.5, not referenced in the mentioned bug [13:25:54] yeah I suspect it is something simialr [13:25:57] *similar [13:26:06] hm [13:35:32] ok, so, it's weird [13:36:42] cd [13:36:44] oops [13:57:30] elukey: I *might* have a track [13:57:58] elukey: and it seems correlated with the parquet thing [13:59:23] still trying to figure out how to check the heap, might need to generate a heap dump and the use eclips mat [13:59:38] hm [14:00:49] elukey: using kmap? [14:00:52] jmap sorry [14:01:44] elukey: let's try togther (I'd need to sudo yarn, which I can't) [14:04:21] I am trying on an1034 but I get 2422: Unable to open socket file: target process not responding or HotSpot VM not loaded [14:04:45] batcave? [14:05:13] let me try a bit more [14:07:19] ahhh I need sudo -u yarn [14:08:07] started! [14:08:15] /tmp/an1034.hprof is the target [14:08:30] ok cool [14:09:00] Let's analyse it with jhat after :) [14:09:53] all right done [14:10:02] you should be able to read it [14:10:06] hello! [14:10:15] o/ [14:10:21] If you need additional eyes on that heap dump, I'm here... [14:10:26] thanks a lot :) [14:11:00] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:11:17] mmmm [14:11:28] I worked on 1034 :D [14:11:48] that looks like magic... [14:12:56] ouch java.lang.OutOfMemoryError: GC overhead limit exceeded [14:13:00] joal --^ [14:13:05] yeah, just saw that lzia [14:13:07] elukey: [14:13:09] sorry [14:14:00] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:14:29] just restarted it [14:19:39] elukey, gehel: jhat currently analysis heap dump [14:19:46] will let you know when news arise [14:20:26] super [14:21:04] parsing it here as well... [14:21:09] \o/ [14:21:21] distributed jvm heap dump analysis [14:21:21] looks like >500Mo taken by org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics [14:21:29] sounds suspicious... [14:22:18] seems that the GC root of that is a timer... [14:23:00] gehel: what are you using? [14:23:05] MAT [14:25:16] * elukey executing MAT as well.. [14:25:34] 280Mo of io.netty.buffer.PoolChunk (seems like a lot, but not sure what hadoop is doing) [14:25:56] I am more concerned about the 500Mb of metrics [14:26:58] yeah, the PoolChunks are probably preallocated and don't increase. That's only 36 instances... [14:27:03] elukey: no parquet thingy here .... [14:27:34] (brb in a bit) [14:29:16] might look a bit like https://issues.apache.org/jira/browse/YARN-5296 [14:29:52] true gehel [14:38:09] thanks gehel ! [14:38:22] np [14:38:22] mat just died for OOM [14:38:24] haahahah [14:38:31] probably I'll need to increase its xmx [14:38:40] gehel: we use hadoop 2.6, so not sure, but clearly possible [14:38:46] yep, its default is really low [14:39:27] joal: I'd be more happy to do a rolling restart [14:39:34] now that we have a heap dump [14:39:37] agreed elukey [14:39:50] joal: at least, both look like something is started from a timer, collect metrics and is not released fast enough... [14:39:59] * gehel knows nothing about hadoop... [14:40:01] elukey: gehel yup [14:40:50] * gehel is going back to other things... [14:40:55] ping me if I can help! [14:40:59] gehel: thanks ! [14:50:14] elukey: can't find any info on our hadoop version :( [14:50:36] elukey: will take a break, need to light the fire [14:50:51] joal: going to restart in the meantime [14:53:21] gehel: mat is really nice! (now that I can use it..) [14:53:24] thanks a lot for the help! [14:53:34] no problem! [14:53:55] that's a good thing about Java... there are a ton of nice tools around it... [15:03:32] gehel: only in this channel could you get away with saying something positive about Java [15:03:50] try that in -operations and you'll lose your geek license :) [15:03:54] * gehel nods sadly ... [15:04:55] is there any language about which you are allowed to say nice things in -operations? [15:05:06] gehel: it's a shame you're unable to make our meetings, this coming one elukey and I were planning to serve truffles and big red, at the last one we handed out pinarello road bikes [15:05:23] gehel: python seems mostly uncontroversial [15:05:59] which meetings? [15:06:12] the cassandra-standup [15:06:49] Oh yeah... the few I joined were good! I learned a ton! [15:07:05] I'll try to join again, but not for the next few weeks [15:07:45] gehel: i'm just joking :) [15:08:00] poorly, once again, but joking [15:08:26] too bad, the pinarello road bike sounded like a good way to get me to join... [15:08:28] not that it wouldn't be great to have you there, but i understand if you can't [15:08:32] :P [15:08:36] oh that part is totally true [15:08:46] which is why we're serving big red this time around [15:08:51] we kind of blew our budget [15:09:09] big red ? (http://www.red.com/) [15:09:29] ^ that might work to make me join as well! [15:09:32] http://www.bigred.com/ [15:09:58] much less interesting :( [15:10:51] LOL [15:11:10] it's a lot of sugar, but if you are someone who regularly attends the meetings, you can always hop on your road bike to work it off [15:15:44] 10Analytics, 10Analytics-Wikistats, 06Editing-Analysis: Update active editor metrics to use consensus definition - https://phabricator.wikimedia.org/T153702#2896228 (10ezachte) The one update I could apply is allowing redirects be included in edit counts. I was wondering today, does it change anything that... [15:16:35] elukey: caring the oozie stuff while you restart [15:17:25] joal: ah snap sorry I was trying to avoid these :( [15:17:27] lunch! [15:17:47] 10Analytics, 10Analytics-Wikistats, 06Editing-Analysis: Update active editor metrics to use consensus definition - https://phabricator.wikimedia.org/T153702#2896230 (10ezachte) @Millimetric I'm not sure but I believe we said any change in the definition will always be reapplied to earlier months if feasible,... [15:28:12] !log changed firewall rules to allow only $ANALYTICS_NETWORKS (rather than the broader $INTERNAL) for the Yarn UI http service (an1001) and the hive metastore (an1003) [15:28:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:28:19] joal: --^ [15:28:28] no ops [15:28:34] k elukey [15:28:35] but let me know if you see weirdness [15:28:38] sure [15:37:57] restarts done [15:38:04] the heap size dropped considerably [15:38:20] I am going to open a phab task to track the leak [15:38:23] elukey: yeah, no wonder [16:00:52] 06Analytics-Kanban, 06Operations, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896288 (10elukey) [16:00:57] mforns_away: standduppp? [16:01:31] 06Analytics-Kanban, 06Operations, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896303 (10elukey) [16:03:15] 10Analytics, 10Analytics-Cluster, 15User-Elukey: Monitor cluster running out of HEAP space with Icinga - https://phabricator.wikimedia.org/T88640#2896317 (10elukey) While reviewing this task, I opened https://phabricator.wikimedia.org/T153951 :D [16:03:34] 10Analytics, 10Analytics-Cluster, 15User-Elukey: Monitor cluster running out of HEAP space with Icinga - https://phabricator.wikimedia.org/T88640#2896321 (10elukey) a:03elukey [16:10:09] 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Monitor cluster running out of HEAP space with Icinga - https://phabricator.wikimedia.org/T88640#2896334 (10Nuria) [16:15:07] 06Analytics-Kanban, 06Operations, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896357 (10elukey) p:05Triage>03Normal [16:15:29] 06Analytics-Kanban, 06Operations, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896288 (10elukey) `15:51 !log restarting the yarn node manager java daemons on all the Hadoop worker nodes due to suspect memory leak` [16:35:12] elukey: |th [16:35:25] * elukey is confused [16:35:25] elukey: Thanks again for the nice intuition :) [16:35:46] joal: thanks for the support! [16:36:02] I believe that we are getting better and better in managing our cluster \o/ [16:37:10] elukey: You definitely are ;) [16:37:34] learning from the masters :) [16:37:46] elukey: you know I don't manage, I just use as much as I can without being noticed ;) [16:44:09] elukey: are you familiar with the process of creating the key material for Cassandra? [16:50:49] (03CR) 10Nuria: [C: 032] Standardized UDF naming [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/327237 (https://phabricator.wikimedia.org/T120131) (owner: 10Fdans) [16:50:55] urandom: nope :( [16:51:51] elukey: OK, we need to do that as a part of T153880, and I mentioned to godog that you might be interested in such opportunities [16:51:51] T153880: Cassandra/RESTBase test environment (MVP) - https://phabricator.wikimedia.org/T153880 [16:52:23] he's got some info on that on wikitech, but TTBMK, he's the only one that has actually done it so far [16:52:40] i can't do it because it requires access to private.git [16:52:51] joal: for the loading of data into druid then your recommendation for teh poc for discovery is to use a file+ python loader correct? [16:53:16] elukey: so TL;DR, do you want in on this? [16:53:52] joal: the data we want to load is described here [16:53:54] elukey: and this is that doc, i think: https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates [16:54:07] nuria: I even think I'd go for a manual update of the json file instead of going through python [16:54:32] nuria: wherE? [16:55:39] joal: sorry, here: https://gerrit.wikimedia.org/r/#/c/327845/ [16:56:26] nuria: first issue: in here https://gerrit.wikimedia.org/r/#/c/327845/3/oozie/maps/druid/tiles_table.hql, you should use JSON, not sequence file [16:57:03] nuria: Because in here https://gerrit.wikimedia.org/r/#/c/327845/3/oozie/maps/druid/load_map_tiles.template.json, you tel it to use a JSON parser [16:57:51] mforns thoughts? https://www.dropbox.com/s/zuawj419mpfd5mp/Screenshot%202016-12-22%2011.57.37.png?dl=0 [16:58:05] https://www.irccloud.com/pastebin/FQ2bAJke/ [16:58:17] joal: is this sufficient? [16:58:19] ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' [16:59:11] Nope nuria, there a bit more than that - follow example here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/druid/daily/generate_daily_druid_pageviews.hql [16:59:13] joal: and also "STORED AS TEXTFILE" [16:59:22] urandom: it doesn't seem super difficult [16:59:55] nuria: things not to forget from that file: What's at the top (compression param (2 lines) and jar added (1 line)) [16:59:59] nope, it should be as simple as editing that yaml manifest, running the script, and then checking in the results [17:00:50] joal: ok, i will make another table from bearloga's following that example [17:01:23] also nuria, another question, on druid metrics deifnition side this time: in metrics spec, there is 'tiles', being a 'count' over field name 'tiles' [17:01:36] joal:yes [17:01:53] nuria is the original field precomputed in the table? [17:02:00] If so, then the def should be different [17:02:19] elukey: again, godog is planning to do this, i just thought i'd mention it to you, in case you wanted the opportunity to see what was involved [17:02:28] joal: it is precomputed yes " tiles BIGINT COMMENT 'Number of tiles successfully requested" [17:02:45] nuria: right, sorry for not picking it there [17:04:26] So, definition should use 'longSum' instead of 'count' in type [17:04:50] joal: k, updating patch in both counts [17:06:13] (03PS4) 10Joal: Add oozie job loading MW history in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/328154 (https://phabricator.wikimedia.org/T141473) [17:06:16] nuria: patching my thing as well :) [17:06:36] But actually nuria, there is only one count, no? [17:11:58] nuria: as an example: https://phabricator.wikimedia.org/P4671 [17:12:21] joal: I cannot view that patch [17:12:36] joal: it says "members of analytics can view" . yaya! [17:18:28] sorry nuria, updated [17:19:11] nuria: I'm gone for a few minutes and see my son, will be back for our 1-1 and after [17:19:30] joal: sounds good [17:32:26] nuria did you want to meet today or tomorrow for 1:1 [17:32:32] got an invite change showing today? [17:33:06] ashgrigas: sorry, did not receiuve a remainder, joining! [17:52:54] * elukey afk! [18:18:34] 10Analytics, 10ChangeProp, 10EventBus, 06Reading-Web-Backlog, and 4 others: Subscription: Trending service should be able to subscribe to edits in real time - https://phabricator.wikimedia.org/T145553#2896623 (10phuedx) [18:35:20] 10Analytics, 06Operations, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896712 (10JKatzWMF) @Nuria @mforns I think having an alternative with the typo'd version makes a lot of sense. These metrics are used as a pro... [18:36:54] 10Analytics, 06Operations, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896722 (10mforns) a:03BBlack [18:37:22] 10Analytics, 06Operations, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2732531 (10mforns) Assigned the task to @BBlack , so that he can give his opinion on this. [19:30:27] 10Analytics, 06Operations, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896906 (10Nuria) @JKatzWMF Do ping @BBlack about the impact of the change in your metrics, on our end there are no code changes needed to proces... [19:43:02] (03PS4) 10Nuria: [WIP] POC of loading tile data into pivot [analytics/refinery] - 10https://gerrit.wikimedia.org/r/327845 (https://phabricator.wikimedia.org/T151832) [19:45:44] (03PS5) 10Nuria: [WIP] POC of loading tile data into pivot [analytics/refinery] - 10https://gerrit.wikimedia.org/r/327845 (https://phabricator.wikimedia.org/T151832) [19:46:48] joal: let me know if this looks better to load data into druid: https://gerrit.wikimedia.org/r/#/c/327845/5/oozie/maps/druid/generate_druid_map_tiles_dataset.hql [20:01:30] 10Analytics, 06Operations, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2896977 (10JKatzWMF) Ok thanks! [20:12:40] Hery nuria, I didn't check the fields, but the overall structure for daily generation is looking good :) [20:13:00] nuria: Do you need me to stay and help? [20:23:10] nuria: no news, I'm going to leave: ) [20:23:15] See you tomorrow a-team !