[02:05:41] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3809775 (10Milimetric) @GoranSMilovanovic you can query either wmf_raw.mediawiki_user or wmf.mediawiki_user... [02:53:24] (03PS7) 10Milimetric: [WIP] Saving in case laptop catches on fire [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408848 (https://phabricator.wikimedia.org/T184759) [03:09:58] (03PS8) 10Milimetric: [WIP] Saving in case laptop catches on fire [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408848 (https://phabricator.wikimedia.org/T184759) [06:58:23] bearloga: cleaned! [06:58:57] elukey: thank you very much!!! [07:21:40] people we are rebooting db1108 (analytics-slave.eqiad.wmnet) to update mariadb and the linux kernel [07:21:56] !log reboot db1108 (analytics-slave.eqiad.wmnet) for mariadb+kernel updates [07:22:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:33:49] we'll do db1107 (el master) in a couple of hours [08:29:37] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#3988334 (10elukey) We are trying to back up everything to HDFS before deleting the /srv/stat1002-a dir, but it would be great to delete data if possible to avoid wasting space: ``` elukey@st... [08:45:53] 10Analytics-EventLogging, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3988342 (10Krinkle) [08:56:53] 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3988353 (10elukey) Before proceeding, https://phabricator.wikimedia.org/T187022 needs to be closed. [09:06:12] 10Analytics, 10Analytics-Wikistats: Data mismatch for number of Wikidata editors - https://phabricator.wikimedia.org/T187806#3988363 (10ChrisPins) If I am right, this points out to a major difference between WikiStats and WikiStats v2: - All sums in the different filters/ splits of WikiStats v2 seem to includ... [09:40:03] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3988380 (10GoranSMilovanovic) @Milimetric Thank you, Dan. The JsonRefine is what is [[ https://lists.wikime... [09:50:20] so I am currently trying to copy ellery's stat1002-a user dir to /wmf/data/archive/backup/misc/stat1002-a/user_dirs_from_stat1002/ellery to free space on stat1005 [09:50:33] that is a bit short on space [09:50:46] using hdfs dfs -copyFromLocal, but it takes ages [09:55:32] files are big (except for some small .git etc.. related ones) so the hdfs block overhead shouldn't be that heavy [10:18:11] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#3988461 (10elukey) To free some space, now ellery's stat1002 user dir is available in /mnt/hdfs/wmf/data/archive/backup/misc/stat1002-a/user_dirs_from_stat1002/ellery [10:50:35] all right stat1005's srv dir is ~91% used now, much better [11:22:04] Hi A Team!# [11:22:20] I have an EventLogging question... [11:23:03] .. Some EventLogging that we introduced for edit conflicts and our new conflict resolution tool tried to record the text submitted by the user at point of edit conflict, this text blob / string can be quite big and when too big it doesn't seem to make it into the system [11:23:25] it is easy to see why it doesnt land in the mysql tables as they are VARCHAR(1024) or something similar [11:23:34] but they also don't seem to land in hadoop :( [11:23:39] Any thoughts? :D [11:24:02] addshore: o/ [11:24:10] what is the schema name? [11:24:52] *looks* [11:24:56] https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict [11:25:04] also, what do you mean that it doesn't land in hadoop? You don't see it in Hive? [11:25:14] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL - hadoop-hdfs-namenode-heap-usaage is 5914120765 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [11:25:30] yeah nothing to worry about --^ [11:25:32] well, the schema is there, but for records that I assume the text is long, those records are not in hadoop [11:25:33] alarms too sensitive [11:26:25] addshore: yep, but how do you verify it? I am asking to tracking down where is the issue.. we pull raw data from kafka via Camus and then a tool called JsonRefine puts it into Hive [11:26:36] *to track down [11:26:41] So, https://phabricator.wikimedia.org/T182008#3978750 is the comment that I first noticed this in [11:27:01] I was wondering if the message is too big for kafka? [11:27:25] So the only way to verify that data is missing, is we should have a record in https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict for every record in https://meta.wikimedia.org/wiki/Schema:EditConflict [11:27:29] but we do not :) [11:28:03] hadoop seems to contain roughly the same number of rows as mysql does [11:28:30] I was kind of thinking maybe the mysql insert fails first and then it never attempts to do the hadoop insert? but as I don't know the system i have no idea :D [11:28:45] I can file a ticket if needed :) Dashing to a meeting now [11:30:02] addshore: it is possible that eventlogging is not validating correctly the event and it discards it [11:30:14] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL - hadoop-hdfs-namenode-heap-usaage is 5937156214 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [11:31:05] another one? weird [11:32:45] addshore: yeah I can see some logs in the eventlogging processors rejecting some events for that schema [11:33:07] (Invalid \uXXXX escape: line 1 column 1615 (char 1614)) [11:33:34] not a huge amount though [11:35:38] now alarms [11:38:14] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL - hadoop-hdfs-namenode-heap-usaage is 5873131070 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [11:41:14] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL - hadoop-hdfs-namenode-heap-usaage is 5873832638 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [11:41:34] ufff [11:42:53] merging a fix [11:43:15] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL - hadoop-hdfs-namenode-heap-usaage is 5873832638 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [11:44:03] weird that this alarm keeps sending alerts [11:44:06] it should send only one [11:45:21] all right icinga should behave now [11:50:05] mmm maybe not [12:07:23] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL - hadoop-hdfs-namenode-heap-usage is 6030265576 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [12:10:04] is icinga still out of sync? [12:10:40] lemme disable notifications [12:10:41] mah [12:30:25] helloooo [12:30:58] joal, yt? I have a problem with scala DataFrames... :[ [12:58:00] mforns[m]: wednesday is kids day for me, will be on later tonight [12:58:21] elukey: wanted to double check with you the alerts were fake - looks ok :) [13:01:59] sadly yes, sorry for the spam [13:07:32] elukey: (Invalid \uXXXX escape: line 1 column 1615 (char 1614)) interesting [13:11:53] so infact it is not an issue with length but instead an issue with this character? [13:12:56] addshore: this is what I've found, but it might be a red herring, how many events do you think you are loosing? [13:13:12] hundreds or thousands probably [13:13:32] I can probably go and dig up a pretty exact number later [13:13:51] are these logs in logstash? [13:14:29] not sure, never checked there! [13:14:49] I had a hunt around in logstash before but couldnt find anything, where do you find these magical logs? :D [13:15:00] on the eventlog1001's host :) [13:15:16] addshore: do you have a timeframe also about those events? [13:16:24] so, they will have been happening since roughly 1/2 was through January up until now [13:21:28] ah ok ok! [13:49:13] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3989072 (10Milimetric) yes, @GoranSMilovanovic, that's it. But my main point was about the mediawiki table... [14:18:42] elukey: I wront a post it not for me to write a ticket [14:21:41] o/ [14:22:46] yar elukey kafkatee [14:22:56] what should we do? i can take some time right now to try to make it multi instance in puppet [14:23:04] ottomata: morning :) [14:23:05] we could run a manual kafkatee on oxygen that consumes from jumbo for a while [14:23:08] hiiii [14:23:18] until all topics are switched over [14:23:19] I didn't have a lot of time to check since hhvm didn't behave well today [14:23:23] oo nasty [14:23:28] yeah :( [14:23:45] well, tldr kakfatee puppet needs a refactor if we are going to make it work with 2 kafka clusters at the same time [14:23:51] which, i started working on [14:24:00] but will probably keep us from doing webrequest misc today :( [14:26:53] * elukey nods [14:27:03] okok.. I saw the code review, it seems fine! [14:27:30] ottomata: I mean https://gerrit.wikimedia.org/r/#/c/413056/ [14:28:21] yeah, that just gets rid of the submodule [14:28:34] can't see any reason to keep it as a submodule, no one is ever going to use our kafkatee puppet module [14:28:41] and iwanted to use base::service_unit [14:28:58] I would argue that all our modules should become part of operations/puppet :D [14:29:07] so let's start with kafkatee [14:29:15] \o/ [14:29:19] shhhhh [14:29:25] not cdh [14:29:25] I'd be in favor of the multi instance solution rather than the manual one [14:29:27] people use that one! [14:29:30] or at least they used to [14:29:40] well even all but cdh would be a win :D [14:29:41] ok elukey on it... [14:59:16] 10Analytics-Cluster, 10Analytics-Kanban: Refactor kafkatee module to support multi instance - https://phabricator.wikimedia.org/T187890#3989250 (10Ottomata) p:05Triage>03Normal [15:16:19] ottomata, hiiiii [15:16:43] do you have 5 mins to look at an x-files scala/spark problem? [15:20:29] ooohhhhh... I think I got that.. [15:23:14] 10Analytics-Tech-community-metrics, 10Developer-Relations: For new authors on C_Gerrit_Demo, provide a way to access the list of Gerrit patches of each new author - https://phabricator.wikimedia.org/T187895#3989396 (10Aklapper) p:05Triage>03Low [15:24:39] 10Analytics-Tech-community-metrics, 10Developer-Relations: For new authors on C_Gerrit_Demo, provide a way to access the list of Gerrit patches of each new author - https://phabricator.wikimedia.org/T187895#3989396 (10Aklapper) [15:26:26] mforns: hiiyaaa [15:26:30] heyyyy [15:26:40] ottomata, I found the thing [15:27:02] I was getting a very weird error when creating a dataframe with sqlContext.createDataFrame [15:27:09] I learned that [15:27:42] you can not reference a dataFrame inside the distributed code that processes its elements [15:27:47] i.e: [15:28:12] sqlContext.createDataFrame( [15:28:13] dataFrame.rdd.map { row => [15:28:13] sanitizeRow(dataFrame.schema, whitelist, row) [15:28:13] }, [15:28:13] dataFrame.schema [15:28:13] ) [15:28:19] this fails [15:28:25] oh huhh interesting [15:28:32] mforns: i guess you can put the schema in a val? [15:28:37] yeeees [15:28:51] that was the solution... that took me like 2 hours to figure out [15:28:53] aye [15:28:59] yeah good oollll spark [15:29:07] only joseph knows its secrets [15:29:43] makes sense, but the error message was just a nullPointerException [15:30:12] thanks anyway [16:05:01] someone wanna brainbounce about this sqoop import [16:05:04] I'm spinning in circles [16:06:12] joal / mforns ^ [16:06:33] milimetric, yep! bc? [16:06:37] omw [16:06:51] ok, be there in 1 [16:17:16] 10Analytics-Data-Quality, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Please review: new WMDE public data set on stat1005 - https://phabricator.wikimedia.org/T187606#3989618 (10Lea_WMDE) @GoranSMilovanovic could you provide me with a direct link to that file? Thanks! [16:40:40] (03PS9) 10Milimetric: Refactor sqoop jobs and add from_timestamp [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408848 (https://phabricator.wikimedia.org/T184759) [17:02:11] urandom: another clash with our standup, just realized, sorry :( [17:03:16] elukey: yeah; i noticed today my note from last time that we needed to find a new time [17:03:59] urandom: totally my fault, didn't follow up :( [17:05:47] (03PS2) 10Mforns: [WIP] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [17:22:52] (03CR) 10Milimetric: "poke @joal, this is ready for review then" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408848 (https://phabricator.wikimedia.org/T184759) (owner: 10Milimetric) [17:45:09] ottomata: https://gist.github.com/jobar/83845555a9ad0c2dddb531fbbea15c84 [17:57:46] * elukey off! [17:57:53] elukey: ! [17:57:56] kafkatee patch ok with you? [17:58:18] argh sorry forgot to finish it [18:01:23] ottomata: jumbo-analytics? is it right? [18:02:56] oh [18:02:58] that was for testing [18:03:02] no not right good catch [18:03:04] will change it back [18:03:11] next patch after will set up a different instance [18:04:17] ottomata: the only thing that I found a bit odd is the rsylog/logrotate config in the module [18:04:35] but it might be ok, the rest looks super fine [18:04:52] so if you have tested it and it works, please go ahead :) [18:07:47] gtg now for a couple of hours, will check later on! sorry :( [18:08:10] ok thanks! [18:17:35] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#3990211 (10Halfak) I've dropped my usage to 44GB [18:18:35] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#3990218 (10Halfak) Woops. Caught something else I don't need. Down to 12GB [18:54:36] Heya milimetric - We got pinged on T187374 [18:54:36] T187374: Measure how many unique people visit the Timeline - https://phabricator.wikimedia.org/T187374 [18:54:50] milimetric: Do you have the right to configure Piwik? [18:55:16] yes joal I’ll do that [18:55:22] Many thanks :) [18:55:32] milimetric: --^ :) [18:55:42] milimetric: another question for you [18:55:49] shoot [18:56:13] milimetric: we have quarterly goals deps on us - I had no clue about where we are standing on that - Do you have anything for me? [18:56:28] The two things are: [18:56:29] Improve, adjust, or create features geared at the needs identified in New Editors research project. [18:56:33] New Editors Experience:Analytics [18:56:57] Side-talk to ottomata in the meantime [18:57:09] ottomata: I think we're good to go for spark 2.2.1 [18:57:30] greaaat [18:57:46] ottomata: every wmf table has been "infered and saved", and if people have issues I know what it is, so I'm happy for us to move [18:57:55] joal: the team that was doing the New Editors project put it on hold I thought, but we met with them and the ball was in their court to determine what they needed help with [18:57:59] awesome, thanks joal [18:58:28] ottomata: I'll actually put a note on a wiki page on this thing :) [18:59:20] milimetric: Ok - Thanks for the update :) [19:09:40] ottomata: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark [19:10:01] ottomata: I let you merge yout jsonrefine branch? [19:14:24] joal k ya will do [19:14:26] want me to do now? [19:14:43] ottomata: if you do it before tomorrow I'll work on it tomorrow morning :) [19:14:53] No rush for today though [19:14:54] ok will do [19:14:55] great [19:25:39] 10Analytics, 10InteractionTimeline, 10Anti-Harassment (AHT Sprint 15): Measure how many unique people visit the Timeline - https://phabricator.wikimedia.org/T187374#3973443 (10Milimetric) @dbarratt Nuria's on vacation this week, here's your piwik snippet, you're site 14! :) ```