[09:28:44] 10Analytics, 10User-Elukey: Investigate Hadoop HDFS ACLs - https://phabricator.wikimedia.org/T246755 (10elukey) Did a little test today: ` hdfs dfs -setfacl -m default:group:analytics-privatedata-users:r-x /user/elukey/test elukey@an-tool1006:~$ hdfs dfs -getfacl /user/elukey/test Picked up JAVA_TOOL_OPTIONS... [09:28:53] first acl tests --^ [09:33:55] wow they work :) [09:36:22] 10Analytics, 10User-Elukey: Investigate Hadoop HDFS ACLs - https://phabricator.wikimedia.org/T246755 (10elukey) Another example that looks very good: ` elukey@an-tool1006:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -ls /user/elukey/test-no-acls ls: Permission denied: user=analytics, access=RE... [09:47:14] Are Graphite metrics available from Spark? I unfortunately have a use for joining EventLogging with StatsD numbers. [09:48:00] awight: not that I know, but in theory graphite can be queried [09:48:39] +1 it wouldn't be so bad to build separate CSVs if I have to. [09:49:01] * addshore reads up [09:49:12] sounds like your accessing some sort of wmde technical wish meteics? ;P [09:49:15] awight: one qs - is the eventlogging schema fixed? [09:49:16] *metrics [09:49:28] or not .... [09:49:53] elukey: Yes, thanks! Actually, the data I want is in that schema but unfortunately hadn't fixed in time to cover the window I need. [09:49:56] awight: I am asking since we stopped refine to avoid alarms etc.., please tell us when to re-enable [09:49:59] ah [09:50:27] so should we re-enable refine? [09:50:41] addshore: hehe, you got it. StatsD makes me sad, especially since we don't have it set up to do multidimensional columns. [09:50:49] elukey: Yes, I think it should be safe now. [09:50:54] all right doing it [09:51:04] At least, the schema passes validation for me. [09:51:05] awight: indeed, lots of the things we have there should move over to eventlogging when the need arises [09:51:38] addshore: It's too bad that we have to be so reactive about the metrics though, they would have been useful months ago. [09:53:45] elukey: Is there any chance of backfilling those metrics since last c. Wednesday? [09:54:30] awight: my understanding is that they refer to a schema that doesn't match so it will likely fail [09:54:37] but something might be done, I'll ask [09:56:01] elukey: Not worth any acrobatics, but JFYI the code using the new schema went out with last week's train, and we enabled the TwoColConflict feature as default on a few wikis on Wednesday, so the timing is awkward. [09:56:17] If it happens for free that's great, but don't worry about writing backfill scripts or anything. [09:57:46] !log remove TwoColConflictExit from eventlogging's refine blacklist [09:57:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:58:40] awight: ah ok so from Wednesday onward the data has the right schema? [09:58:47] if so yes we can definitely refine [10:00:41] awight: can you give me an exact UTC start timeframe? [10:00:51] I mean right after deploy of the fix [10:01:01] I'll run refine from the very next hour onward [10:06:11] elukey: Apologies, I was wrong about all of the deployment dates I just mentioned. The fix was only deployed as of *yesterday*, at 20:00+0 [10:15:40] awight: ack, if you need the data up to now I can force refine, otherwise it starts from now on [10:16:34] elukey: Starting now is perfect, thanks again! [10:17:18] np! :) [10:17:21] * elukey early lunch! [10:19:53] enjoy! [10:28:42] 10Analytics, 10VPS-project-codesearch: Add analytics/* gerrit repos to code search - https://phabricator.wikimedia.org/T249318 (10Addshore) [10:31:41] 10Analytics, 10Wikidata, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Remove wb_terms from sqoop - https://phabricator.wikimedia.org/T249319 (10Addshore) [10:32:40] (03PS1) 10Addshore: Remove wb_terms from sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585720 (https://phabricator.wikimedia.org/T249319) [10:33:13] 10Analytics, 10Wikidata, 10Patch-For-Review, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Remove wb_terms from sqoop - https://phabricator.wikimedia.org/T249319 (10Addshore) [10:34:02] 10Analytics, 10Wikidata, 10Patch-For-Review, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Remove wb_terms from sqoop - https://phabricator.wikimedia.org/T249319 (10Addshore) [11:19:32] (03PS1) 10Fdans: Replace numeral with numbro and fix bytes formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/585725 (https://phabricator.wikimedia.org/T199386) [11:20:07] (03CR) 10jerkins-bot: [V: 04-1] Replace numeral with numbro and fix bytes formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/585725 (https://phabricator.wikimedia.org/T199386) (owner: 10Fdans) [11:41:00] new challenge - trying to build superset 0.36.0rc3 via upstream tarball [11:41:35] there is a 72h window to comment the soon-to-be release, so if I can test in staging and report back it could be easier to ask upstream to fix bugs [11:47:22] awight: very weird, refine is still failing [11:47:23] `selections` array schema must specify the items type field [11:50:06] (03PS1) 10Elukey: Test superset 0.36.0rc3 (release candidate) [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/585733 [11:54:37] elukey: Maybe there are lingering old schema IDs in the data? [11:55:19] The new schema definitely includes "items" in selections. [11:55:32] Is it possible to blacklist just a specific revision ID? [11:56:27] I think there's often a long tail with JavaScript... Does this error crash the job, or is it recoverable and just drops events? [11:57:03] (I'll verify that the correct schema is live...) [11:58:28] awight: there could be also the option that oneOf is not correctly parsed by the validator [12:00:13] (I verified that a conflict gets the new schema on production, fwiw) [12:00:53] elukey: I can imagine that, but the error seems very specific. [12:02:08] Can I peek at the raw data to be refined somewhere? [12:02:24] Kafka, on stat*? [12:02:29] awight: yes this is why I am asking.. the first thing in the array spec is oneOf [12:03:31] lemme check the topic [12:04:05] Theoretically, I could collapse the schema to just a JSON string for "selections", but if the problem is a long tail of this bad revision, then it won't help I guess. [12:06:09] awight: from statxxx something like [12:06:09] kafkacat -b kafka-jumbo1001.eqiad.wmnet:9092 -t eventlogging_TwoColConflictConflict -C -c -10 [12:06:15] +1 awesome [12:06:41] eventlogging_TwoColConflictExit perhaps [12:07:02] yes sorry [12:07:19] Is it possible that the job is trying to backfill? [12:08:34] last error is for hour 2020/04/03/08 [12:08:52] basically all from 2020/04/02/08 onward [12:09:02] so yes it started with yesterday's data [12:09:16] but it has not being able to refine anything [12:13:09] elukey: The last bad data has "dt": "2020-04-02T20:11:43Z" and offset 2906. I donno if that helps? [12:13:26] Everything past that is showing the new schema ID [12:14:29] I used, kafkacat -L -b kafka-jumbo1001.eqiad.wmnet:9092 -t eventlogging_TwoColConflictExit -f '%o %s\n' -C -c 132 -o 2907 | grep -c 19905420 [12:16:10] awight: ack it is helpful.. so part of the alerts are explained [12:16:42] but last error seems 2020/04/03/08 [12:16:45] I'm happy that it was such a short "long tail" of cached JavaScript, really. [12:16:53] oooh noes [12:17:04] 8am UTC today? [12:17:21] yes [12:17:24] I feel silly for needing a "oneOf" [12:17:42] so I'll keep refine errors monitored, weird [12:18:05] I think that if some data inside one specific hour doesn't parse, then refine declares defeat [12:18:37] so it might be as you were saying that some data still coming from js clients with caches or similar are still sending bad data [12:18:51] I have the refinery-source repo locally--would you mind pointing me to a page describing how to run the job? [12:19:16] I doubt it's bad data in the sense we've been looking for, since the old schema ID doesn't show up any more. [12:19:41] Maybe there's an easy way to purge those 2906 stale events in Kafka? [12:19:43] awight: how much data did you check? [12:19:53] Everything from offset 2907-now [12:20:44] refine for eventlogging is run via systemd timer on an-coord1001, but you don't have perms to read eventlogging's raw data [12:20:51] it is pulled from kafka to hdfs via camus [12:21:19] Oh for sure I would just be playing around locally, to tell if the new "oneOf" schema is fatal [12:21:30] ahh sorry [12:21:37] locally = on my laptop [12:22:29] Just as a professional skeptic, I'm slightly suspicious that the refine job might be trying to process all the stale events, even if it has some kind of timestamp filter. [12:22:46] Like, maybe it processes the event and then checks the timestamp... [12:23:18] awight: well no it reports specific hours that are failing [12:23:29] okay :-) ty [12:23:48] ah yes this is why: since = 26 [12:23:57] we process now - 26 hours by default [12:24:07] this is why it checked prev data [12:24:20] cool, sounds like an explanation! [12:24:34] I'm totally fine waiting until next Monday to restart processing, whatever works for you [12:25:57] awight: I'll open a task if more errors will pop up, adding you in Cc so we'll debug. [12:26:06] what do you think? [12:26:13] in case I'll stop refine again for the weekend to avoid noise [12:26:51] I'll also add steps to repro etc.. [12:27:29] So, no processing over the weekend? That's fine for me. I can just ping you on Monday if I don't see anything in the events.* table. [12:27:47] super [12:28:09] sorry again about all the inconvenience--I'll try to add that edit-time validation for the Schema namespace! [12:30:48] (03PS2) 10Elukey: Test superset 0.36.0rc3 (release candidate) [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/585733 [12:31:11] awight: absolutely no problem! [12:31:43] 10Analytics, 10Analytics-EventLogging: Validate JSON-schema before allowing saves in the Schema namespace - https://phabricator.wikimedia.org/T249333 (10awight) [12:31:49] I am also not a great expert in EL so somebody else might pop up later on explaining the problem and why I am a n00b [12:32:22] Likewise :-), I could have know that a fancy schema was a bad idea... [12:35:05] especially if I was going to write bad JSON-schema anyway! [12:35:12] :D [12:37:07] (03PS2) 10Fdans: Replace numeral with numbro and fix bytes formatting [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/585725 (https://phabricator.wikimedia.org/T199386) [12:42:31] 10Analytics, 10Analytics-EventLogging: Validate JSON-schema before allowing saves in the Schema namespace - https://phabricator.wikimedia.org/T249333 (10awight) I was wrong about the premise of this task. There *is* JSON-schema validation happening, but it seems to have some gaps. Will post a minimal test ca... [12:58:14] 10Analytics, 10Analytics-EventLogging: Validate JSON-schema before allowing saves in the Schema namespace - https://phabricator.wikimedia.org/T249333 (10awight) We're using a custom validation library, and a custom json-schema schema. However, the EventLogging server uses a vanilla draft 3 validator, so IMO t... [13:29:38] I am able to build superset from upstream rc tarball, and I am hitting https://github.com/apache/incubator-superset/issues/8808 [13:29:41] sigh [13:30:42] need to go out a bit to buy some stuff for my parents, will be back in 30/60 mins max hopefully [13:33:02] (03CR) 10Joal: [V: 03+2 C: 03+2] "Great :) Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585720 (https://phabricator.wikimedia.org/T249319) (owner: 10Addshore) [13:33:40] 10Analytics, 10Analytics-EventLogging: efSchemaValidate permits unrecognized additional fields - https://phabricator.wikimedia.org/T46454 (10awight) In distant hindsight, it might have been nicer to include additional validation for EventLogging schemas, to enforce that the `additionalProperties` key is presen... [13:47:04] ty joal ! [13:47:15] np addshore - thanks for cleaning up :) [13:47:26] IMO considering we should be the only people to have ever used that data there, feel free to drop it whenever [13:48:03] ack addshore :) [14:09:16] 10Analytics, 10Analytics-EventLogging, 10Patch-For-Review: Validate JSON-schema before allowing saves in the Schema namespace - https://phabricator.wikimedia.org/T249333 (10awight) [14:39:05] joal: bonsoir [14:56:19] Hi elukey :) [14:56:48] joining batcave now elukey if you wsant to talk :) [14:58:15] elukey: Thanks a lot for the blacklist [15:39:52] sorry foer missing standup! marcel and I were in an interview [16:22:44] * addshore is failing to query mediawiki_revision_score [16:24:59] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Vertical: Virtualpageview datastream on MEP - https://phabricator.wikimedia.org/T238138 (10Nuria) Sounds good, added the searchsatisfaction ticket to goals. [16:38:31] (03CR) 10Nuria: [C: 03+1] "Looks good, rememeber to add to train etherpad as job needs to be restarted." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585432 (https://phabricator.wikimedia.org/T246186) (owner: 10Elukey) [16:39:52] addshore: what is the issue? [16:40:10] my query writing knowledge :D I ended up figuring it out :) [16:40:15] now looking at clickstream instead [16:40:27] addshore: k, click stream data? [16:40:45] Yus, or planing on it, is that in hadoop? im currently reading https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream [16:41:13] addshore: click stream is a dataset , not a table [16:41:22] aaaah, wahaha, okay :P [16:41:44] addshore: let me make sure that there is no intermediary table to query , i think there is not [16:44:52] addshore: no, there is no table, it is just a dataset creation job: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala [16:45:00] okay! thanks! [16:46:08] joal: this is what i am about to execute in an-acoord [16:46:12] https://www.irccloud.com/pastebin/tVr4sXNr/ [16:46:22] reading nuria [16:46:31] joal: with sudo -u analytics [16:46:50] nuria: --^ + kerberos-run-command analytics [16:47:38] joal: oohhhhh [16:47:39] nuria: quite some folders will be created, I suggest using a different folder than your home one, /user/nuria/20200403_sqoop [16:47:47] joal: k chnaging [16:48:05] nuria: I also suggest changing the job name to make it explicit it is yours [16:48:36] joal: k will do as well [16:50:12] nuria: last - Please change the log file to one explicit about 1-off /var/log/refinery/sqoop-mediawiki-imagelinks-20200403.log [16:50:15] thanks a lot for that :) [16:50:19] nuria: --^ [16:50:27] The rest looks ok :0 [16:50:40] joal: i did changed logfile butr will add dates [16:50:54] Thanks ) [16:51:26] addshore: With spark you can read the clickstream data if you need (value separated) [16:51:31] it is present in hadoop [16:51:45] if its only a dump I dont think it will give me the value I desired ;) [16:51:46] joal: and after i need to repair tables with msck [16:51:57] joal: and is there anything else needed so i can read the tables? [16:51:59] addshore: /wmf/data/archive/clickstream [16:52:18] nuria: I don't think so :) [16:52:44] nuria: check the folder before table creation (to be sure), and then indeed repair table, and that hsould be it [16:52:50] joal: k, starting [16:53:03] nuria: you'll know that repair table works since it tells you partitions added [16:53:22] Leaving for diner with family - will reconnect during weekend [16:53:26] Bye team [16:54:33] joal: one more question if i may [16:54:38] joal: argh, nvm [16:57:23] ottomata: any ideas why this would return a permits error? [16:57:25] ottomata: [16:57:28] https://www.irccloud.com/pastebin/KXlOV8gK/ [16:59:35] (03CR) 10Nuria: "I should have caught this, my apologies" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/585582 (owner: 10Joal) [17:10:34] hahhahahahahaaa [17:11:10] So, my covid related page view work had a bug in it, I was leaving spaces in some titles instead of transforming them to _s [17:11:35] peak related pageviews for covid from my queries now goes from 1.9 million in a day to 9.9 million [17:11:47] From this https://usercontent.irccloud-cdn.com/file/R4qx10MQ/image.png [17:12:07] To this https://usercontent.irccloud-cdn.com/file/NXu7l3O2/image.png [17:14:02] * elukey off o/ [17:37:55] ottomata: ping ping [17:46:42] * i think i figured it out thnaks to helpful docs* [17:46:48] *thanks [17:48:59] cool nuria - The shebang was it ? [17:58:49] joal: it was teh fact that i need to run sudo -u analytics kerberos-run-command analytics hdfs dfs -ls before anything to get credentials [17:59:05] joal: also need to modify my homedir so it is writable by analytics, doing that now [18:06:50] joal: i killed on teh ongoing command with ctrl c [18:07:02] joal: maybe not the best, is there anything else i should do? [18:50:36] hi nuria [18:50:46] on interview, sorry [18:50:50] nuria: np [19:20:29] nuria: ah hi missed your ping! [19:20:29] sorry! [19:20:31] am here [19:20:37] also I need a brain bounce about some eventlogging stuff [19:27:59] ottomata: yesssir [19:28:08] nuria: bc? [19:28:11] ottomata: k