[07:46:15] happy 2019 people :) [08:04:16] Happy new year y'all :) [08:04:41] joal: bonjour! [08:04:55] Boujour elukey! [08:05:00] How are you/ [08:05:02] ? [08:05:11] really good, I loved these days of vacation [08:05:16] and you?? all good? [08:05:33] All good - Vacations were indeed very needed and enjoyed :) [08:06:29] helloooo team [08:06:52] Good morning fdans :) Happy happy ! [08:10:21] o/ [08:27:56] 10Analytics, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Wire ORES scoring events into Hadoop - https://phabricator.wikimedia.org/T209732 (10JAllemandou) > @JAllemandou I didn't have time to chase down the responsible code, but wanted to let you know that the user redactions look good em... [09:07:38] joal: do you think that we could test https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning with analytics1028? [09:08:15] I'd love to have those nodes out of the analytics cluster asap to deploy the testing cluster [09:08:26] (03CR) 10Joal: [C: 04-1] "See comment inline - Easy change :)" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481223 (https://phabricator.wikimedia.org/T153821) (owner: 10BryanDavis) [09:08:28] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - DB size - https://phabricator.wikimedia.org/T212763 (10TheSandDoctor) [09:08:46] elukey: For sure :0 [09:09:52] elukey: Having bumped the cluster computation power by ~40% has made the heavy uniques jobs flow like a breeze :) [09:10:18] So no overload now, therefore let's work on it :) [09:10:52] super [09:11:14] so IIRC our puppet code only ensures that the hosts exclude file is present [09:11:22] elukey: let me know if there is something I can with [09:11:31] sure [09:13:09] elukey: I have no experience in decommissioning nodes - I'll however follow your process with gready learning attention [09:14:03] elukey@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs dfsadmin -refreshNodes [09:14:06] Refresh nodes successful for an-master1001.eqiad.wmnet/10.64.5.26:8020 [09:14:09] Refresh nodes successful for an-master1002.eqiad.wmnet/10.64.21.110:8020 [09:14:45] and http://localhost:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING (via tunnel) is goood [09:15:15] elukey: hdfs UI tells me no decoms yet [09:15:25] Oh sorry - not treu [09:15:37] decommissionning - not yet decomissionned [09:15:40] Sounds good :) [09:16:13] Man, this word - decommissioning - is prone to me throwing my keyboard by the window before the end of the process [09:16:21] ahahahah [09:16:31] I am going to do the same with yarn and then leave it running [09:16:38] Awesome :) [09:16:46] !log decom analytics1028 from hdfs/yarn [09:16:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:20:27] PROBLEM - Hadoop NodeManager on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [09:20:56] ah! [09:21:01] interesting [09:21:31] 2019-01-02 09:16:58,916 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: Removing state store due to decommission [09:21:37] nice! [09:22:35] we should also get alarms for under replicated blocks etc.. probably [09:28:55] I wonder if the namenode would be smart enough to replicate without alarming, since the process is intented [09:33:33] I don't think so, we alarm on under replicated blocks and the namenode is rightfully showing that some activity is ongoing [09:33:37] our fault :) [09:34:19] ok :) [09:46:32] joal: interesting! We alarm on missing / corrupt blocks [09:46:41] not under replicated [09:46:45] :) [09:46:48] not sure if it was intended or not [09:46:55] in theory we should also alarm for under replicated [09:47:25] mmmmm [09:47:34] elukey: hm - I'm trying to think of a good reason to do so [09:48:17] maybe for a very long time, worth to check if something happens for so long [09:48:41] but in theory if a worker is kaput we know it from other alarms [09:48:51] yup [09:49:41] but they might not trigger for any number of reason (downtime that we forgot/misconfigured, etc..) [09:49:57] anyway, doesn't seem super urgent to add but let's keep thinking about it [09:50:02] ack [09:50:07] paranoia is always a good friend :P [09:50:32] I'm not sure it is elukey - Shall I be paranoid about my paranoia? [09:52:03] joal: metaparanoia might be too dangerous due to the risk of recursion :P [09:52:25] As most metas :D [10:11:46] elukey: one thougth about blocks replication - Should we stop the balancer when we do some ops on HDFS? [10:21:47] joal: I think that it is safe to leave it running, we don't have anymore a huge activity and we are going to remove one/two nodes maximum at the time [10:22:03] works for me elukey :) [10:22:20] I don't expect explosions but if you are worried I'll follow your instinct and disable :) [10:22:39] elukey: I was wondering of the cost of reorganizing blocks while also trying to have them back to correct rep-factor [10:23:33] makes sense yes [10:24:17] one thing is very interesting [10:24:18] hdfs 70775 2.3 3.6 3798792 1177184 ? Sl 2018 935:06 [10:24:27] this is the balancer's process on an-coord [10:24:32] it has been running since ages [10:25:04] hm [10:25:10] very interesting indeed ! [10:27:43] it doesn't seem healthy to me [10:28:10] I'd kill it now and wait for a new run tomorrow [10:28:39] ok elukey - Let's keep an eye tomorrow on the new run [10:31:14] !log killed all hdfs-balancer processes (one running since ages ago in 2018) [10:31:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:32:09] another thing to follow up - how long is the balancer suppose to run? [10:32:28] maybe we should get an alert if say it has been running for a day [10:32:52] could be an option for the systemd timer stuff [10:33:00] add an icinga check for process time [10:33:16] (tunable of course) [10:39:07] elukey: on a regular period, I assume less than 1-day should be good - On a hardware-moving period, could be very different: when we add or remove nodes, blocks move I think [10:39:42] yep I agree, but those are "special" operations [10:43:16] yes [11:36:15] * elukey lunch + errand! [13:19:44] 10Analytics: Add is_pageview as a dimension to the 'webrequest_sampled_128' Druid dataset - https://phabricator.wikimedia.org/T212778 (10Tbayer) [13:21:48] 10Analytics: Add is_pageview as a dimension to the 'webrequest_sampled_128' Druid dataset - https://phabricator.wikimedia.org/T212778 (10Tbayer) This is not super high priority, but per [[https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20181114.txt |a brief discussion]] with @nuria some weeks ago it... [14:37:06] Hi all, happy new year! [14:39:49] (03CR) 10Ottomata: [WIP] Import ORES scores into an archival table (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [14:40:27] Hi dsaez - Happy new year to you as well :) [14:47:14] 10Analytics, 10Research: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10bmansurov) a:03bmansurov [14:48:05] helloooo! :] [14:48:32] (03CR) 10Ottomata: Allow for custom transforms in DataFrameToDruid (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [14:51:40] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10Ottomata) Whitelist what URI? [14:54:26] ottomata, mforns o/ [14:54:41] heya luca :] [14:55:37] o///// [15:00:15] ehyyy [15:13:42] (03CR) 10Mforns: Allow for custom transforms in DataFrameToDruid (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [15:17:18] 10Analytics, 10Serbian-Sites: Serbian Wikipedia edits spike 2016 - https://phabricator.wikimedia.org/T158310 (10Liuxinyu970226) [15:18:30] (03CR) 10Ottomata: Allow for custom transforms in DataFrameToDruid (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [15:30:08] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10Jdlrobson) http://ru.m.wikiyy.com [15:40:15] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10Ottomata) Hm, do you mean blacklist? We don't want to collect this data at all, right? [15:42:40] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic: Support all Product Analytics data needs in the Data Lake - https://phabricator.wikimedia.org/T212172 (10nettrom_WMF) >>! In T212172#4842701, @Neil_P._Quinn_WMF wrote: >>>! In T212172#4840377, @Milimetric wrote: >> First, I agree with @Nuri... [16:05:45] (03PS3) 10Ottomata: HiveExtensions normalize should convert all bad chars to underscores [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477614 [16:06:18] hi joal when you have a sec could you look over ^ real quick? [16:06:19] simple chaange [16:26:14] elukey: ops sync y/n ? i got nothing new :) [16:26:34] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic: Support all Product Analytics data needs in the Data Lake - https://phabricator.wikimedia.org/T212172 (10Milimetric) >> The ultimate purpose of collecting this data is to personalize new users' experiences based on their background and inte... [16:26:47] ottomata: I am in the cave but we can skip, the only thing worth to mention is that I started the decom of the first hadoop node today with joal [16:26:52] nothing more :) [16:27:29] +1 sounds great ! [16:27:30] ok! [16:30:54] (03PS2) 10BryanDavis: Add wikitech to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481223 (https://phabricator.wikimedia.org/T153821) [16:39:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10Milimetric) I took a look at the pull request, https://github.com/ua-parser/uap-core/pull/368, it looks like it fixes the bots regexes by replacing unlimited * or + re... [16:43:25] (03CR) 10Joal: [C: 03+1] "Looks good - Could be worth a simple unit-test, just in case, but can be merged as is :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477614 (owner: 10Ottomata) [16:48:30] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10phuedx) >>! In T212330#4849302, @Ottomata wrote: > We don't want to collect this data at all, right? In the case of the ReadingDepth instrumentation, no.... [16:48:57] (03CR) 10Ottomata: "I modified the normalize test so it would test for other bad chars too." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477614 (owner: 10Ottomata) [16:50:20] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10Ottomata) It is probably a good idea in general to have a whitelist of domains that we control from which we accept events. I'll add this to the Modern Ev... [16:50:28] (03CR) 10Joal: [C: 03+2] "Wow sorry didn't get that - Merging" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477614 (owner: 10Ottomata) [16:50:42] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10Ottomata) Actually on quick second thought...I think that's not possible? How would apps send events? [16:55:42] (03Merged) 10jenkins-bot: HiveExtensions normalize should convert all bad chars to underscores [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477614 (owner: 10Ottomata) [16:57:31] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM - Merging." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481223 (https://phabricator.wikimedia.org/T153821) (owner: 10BryanDavis) [17:01:24] milimetric: yoohoo [17:04:37] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) [17:04:42] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10Ottomata) [17:41:03] joal: if you are ok when the analytics1028's decom is done I'd start with an1029/30 [17:41:10] to leave it running for the night [17:56:40] milimetric: i solved my problem probably in a safer way, but the cause is very strange and unknown [18:02:47] ottomata: ok from your side to decom two more worker nodes? [18:02:55] (replication finished after 1028's decom) [18:03:07] elukey: +1 [18:03:12] super [18:03:40] !log decom analytics10(29|30) from HDFS/Yarn [18:03:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:47:29] * elukey off! [18:47:40] will keep checking the decom status but looks good :) [18:49:33] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10Jdlrobson) >>! In T212330#4849302, @Ottomata wrote: > Hm, do you mean blacklist? We don't want to collect this data at all, right? yep sorry for confusio... [19:04:59] ok ottomata, I got a few minutes if you still want to talk about the solution or strangeness [19:07:06] ya lets' milimetric [19:07:07] bc [19:07:11] omw [19:44:08] 10Analytics, 10Analytics-Cluster, 10DBA, 10Operations: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [19:54:49] (03PS14) 10Mforns: Allow for custom transforms in DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) [20:03:09] milimetric: when you come back I have some ideas but am still very confiused [20:10:35] ottomata: that's my whole life man [20:10:45] haaha [20:13:50] 10Analytics, 10Readers-Web-Backlog (Tracking): [Bug] Many JSON decode ReadingDepth schema errors from wikiyy - https://phabricator.wikimedia.org/T212330 (10Tbayer) See also {T197971} for a similar issue (as well as the somewhat explanations at T188804). Agree that it would be great to have a general solution... [20:45:01] (03CR) 10Ottomata: "Nice, I like this ListMap here. Please add a ton of comments and function docs explaining how this works. I think I mostly get it, but i" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [21:02:07] (03CR) 10Mforns: "@ottomata" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [21:08:23] joal: hiiiii [21:08:33] if you are still around i'd love some help with intellij and spark again...i am failing... [21:11:37] Hi ottomata [21:11:40] What's up? [21:11:52] joal hang on i might have something...maybe an antlr generated sources thing... [21:12:01] wow [21:12:17] ottomata: In that case, I'm afraid I'll be of no use :) [21:25:50] ottomata: if not now, will be tomorrow :) [21:26:18] yay joal [21:26:20] its ok i got it i think [21:26:21] it was that [21:26:27] luckily stackoverflow tlo the rescue [21:26:29] i thought it was just me [21:26:34] maaaan - Where do we use ant? [21:26:37] spark [21:26:38] does [21:26:42] not ant [21:26:43] antlr [21:26:45] Ah - compiling spark [21:26:49] :S [21:26:49] which is a parse tree source generator? [21:26:50] i guess? [21:26:55] i had to manually run antlr for a submodule [21:27:01] and then add the generated folder as a source directory [21:27:09] k [21:27:41] indeed ottomata: https://en.wikipedia.org/wiki/ANTLR [21:28:57] ottomata: Good luck, see you tomorrow [21:42:03] thanks joal byyye! [22:28:29] ottomata: I'm back, phew traffic [22:28:52] I'll be working the rest of the night till my 1am meeting [22:33:27] 1am meeting! crazy! [22:33:37] ok cool, i have to run soon...but want to BC real quick? [22:34:46] milimetric: ^ [22:35:00] yes, omw [23:41:37] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is CRITICAL: CRITICAL