[00:29:22] Amir1: what do you want? i think so [00:29:35] not all jobs, since the data is not really schemaed well [00:29:54] ottomata: I want to get the parameters of the jobs that were run [00:30:03] query them basically [00:30:12] see the event.mediawiki_job_* tables in hive [00:30:23] not all jobs are refineable, but many are! [00:32:32] oh thanks [00:47:05] (03PS8) 10Nuria: [WIP] Table and workflow for features computations per session [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) [02:27:05] 10Analytics, 10Research-Backlog, 10Wikidata: Copy Wikidata dumps to HDFS - https://phabricator.wikimedia.org/T209655 (10GoranSMilovanovic) @JAllemandou Do you think it would be possible to produce a new version of this data set? The latest update seems to be: `2019-10-03 09:29 /user/joal/wmf/data/wmf/mediaw... [04:19:04] 10Analytics, 10Datasets-Archiving, 10Research-Backlog: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila) a:05leila→03None [04:19:11] 10Analytics, 10Datasets-Archiving, 10Research-Backlog: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila) >>! In T182351#5710173, @ArielGlenn wrote: >>>! In T182351#5709123, @leila wrote: >> @tizianopiccardi thanks for the update and great to see that you're there. :) Please make... [07:57:43] joal: bonjour! [07:58:00] so spark code logs as DEBUG hive token stuff -.- [07:58:02] 19/12/04 07:55:52 DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for analytics/analytics1030.eqiad.wmnet@WIKIMEDIA against hive/_HOST@WIKIMEDIA at thrift://analytics1030.eqiad.wmnet:9083 [07:58:13] this is for the metaas [07:58:17] *metastore [07:58:47] 19/12/04 07:55:52 DEBUG HiveDelegationTokenProvider: Get Token from hive metastore: Kind: HIVE_DELEGATION_TOKEN, Service: , Ident: 00 2d 61 6e 61 6c 79 74 6 [08:05:02] and I don't see hive server 2 in https://github.com/eBay/Spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala [08:10:22] but there myst be somewhere [08:14:25] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/README.md [08:14:31] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/README.md [08:32:02] in fact, https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala mentions only metastore [09:06:49] so to summarize [09:07:26] 1) the DataFrameToHive code is triggered only when create/alter statements need to be execute, otherwise it doesn't [09:08:33] 2) it seems that the JDBC connection would need a special handling for credentials, the code doesn't retrieve any delegation token so it may be our responsibility to do so, or just provide a valid keytab [09:57:23] ok I just merged https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554473/ [09:57:36] and I am backfilling in the testing cluster refine navtiming [11:13:24] ok all good, works like a charm [11:36:44] !log restart mariadb on analytics1030 (hadoop test coordinator) to test explicit_defaults_for_timestamp - T236180 [11:36:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:36:47] T236180: Deploy search platform airflow service - https://phabricator.wikimedia.org/T236180 [11:36:49] * elukey lunch! [11:48:11] 10Analytics, 10Event-Platform, 10serviceops: conntrack -L - https://phabricator.wikimedia.org/T239795 (10Aklapper) [11:55:49] 10Analytics, 10Event-Platform, 10serviceops: Connection tracking on kubernetes hosts alerts - https://phabricator.wikimedia.org/T239795 (10akosiaris) 05Open→03Resolved p:05Triage→03Normal [12:01:16] 10Analytics, 10Analytics-Kanban: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10Danielsberger) Hi @lexnasser , Thank you for setting up the detailed description on wikitech. This is really great! One thing that I'm wondering about is that... [12:02:00] 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: Connection tracking on kubernetes hosts alerts - https://phabricator.wikimedia.org/T239795 (10akosiaris) [12:59:38] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10mforns) @Nuria, I believe, for now, that would be OK for them. @nettrom_WMF explained that they are aiming to make short term analyses of 270 days, and th... [14:49:49] 10Analytics, 10Event-Platform, 10Operations, 10Wikimedia-Logstash, 10observability: Move eventgate logs to new logging infrastructure - https://phabricator.wikimedia.org/T225129 (10Ottomata) 05Open→03Resolved [15:04:36] (03PS1) 10Mforns: Correct minor details in wmcs queries [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/554527 (https://phabricator.wikimedia.org/T232671) [15:12:57] (03CR) 10Mforns: "@srishakatux I was troubleshooting the queries, and I realized I had overlooked a couple details in our previous code review. Sorry for th" (0311 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/554527 (https://phabricator.wikimedia.org/T232671) (owner: 10Mforns) [15:13:26] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging to unbreak production." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/554527 (https://phabricator.wikimedia.org/T232671) (owner: 10Mforns) [15:24:50] Hi team [15:26:36] hey joal :] [15:27:06] elukey: I have found some info on HiveServer2 JDBC connection yesterday, but didn't manage to make it work [15:27:34] elukey: You were right in thinking JDBC-Kerb needs some settings we don't provide as of now in code [15:27:41] Hi mforns - all good? [15:28:28] joal: o/ - I think it is easier to just pass the keytab, it works in test [15:29:45] all goooood :] [15:35:25] elukey: awesome :) [15:42:51] 10Analytics, 10Analytics-Kanban: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10Nuria) @Danielsberger It is the other way arround, the volume of requests to upload is a lot highter. Think that a web page is one made by one text document and m... [15:46:40] Hi milimetric [15:47:05] hi joal [15:47:11] arg! sorry forgot to say hello today [15:48:18] milimetric: I'm answering the email about alerts [15:48:26] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Upgrade to Superset 0.35 - https://phabricator.wikimedia.org/T236690 (10elukey) The fix for https://github.com/apache/incubator-superset/issues/8676 has been merged to upstream, so it will likely be included in the next release. I... [15:48:53] milimetric: we should rerun the failed jobs not in sync - The problem is due to overlaoding cassandra when loading the 2 big jobs at the same time. [15:49:35] milimetric: Then a patch to move mediarequest one hour later is probably enough and not too much disruptive in term of data availability? [15:50:42] joal: I'm not totally up to speed with the other cassandra loading, is it temporary or is it going to keep happening for the next few months? [15:50:59] if it's semi-permanent, yeah, let's move it one hour later [15:51:35] and I'm not sure when to rerun those two failed jobs then, when is cassandra not overloaded? [15:51:37] milimetric: the problem is unrelated to backfilling - errors are from prod-jobs of yesterday :) [15:52:43] oh I get it now [15:52:55] sending patch / rerunning, thanks! [15:53:12] milimetric: errors occurs when refining and extracting last hour of the day finishes at similar time for "upload+mediarequest" and "text+pageview" [15:53:21] your email was clear, I got confused with your IRC even though it's almost exactly the same message - interesting [15:53:50] :D [15:54:02] I mentionned midnight in the email - could help :) [15:56:27] milimetric: I look at this graphs to check for cassandra backfiling state: [15:56:30] https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_mediarequest_per_file&var-table=data&var-quantile=99p [15:57:14] Noticeably, you can see backfilling load jobs in "write rate" graph, and we are interested now about "pending compactions" [15:58:16] As you can see, pending compactions are almost back to 0 after having gone high while loading was happening - This means the system is back to almost normal state: you restart one of the two jobs (and the other one when the first is done) [15:58:24] milimetric: --^ [15:58:24] (03PS1) 10Elukey: build_wheels.sh: use the system pip instead of the virtualenv's one [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/554541 (https://phabricator.wikimedia.org/T236690) [15:58:30] milimetric: makes sense? [15:59:28] joal: yeah, definitely, I just have to groom the tech com backlog now and will restart after [16:01:57] Thanks a lot milimetric :) [16:05:07] (03CR) 10Ottomata: [C: 03+1] build_wheels.sh: use the system pip instead of the virtualenv's one [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/554541 (https://phabricator.wikimedia.org/T236690) (owner: 10Elukey) [16:07:07] (03CR) 10Elukey: [V: 03+2 C: 03+2] build_wheels.sh: use the system pip instead of the virtualenv's one [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/554541 (https://phabricator.wikimedia.org/T236690) (owner: 10Elukey) [16:07:52] (03Abandoned) 10Elukey: WIP - Superset 0.35.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/552238 (owner: 10Elukey) [16:08:37] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: Upgrade to Superset 0.35 - https://phabricator.wikimedia.org/T236690 (10elukey) [16:09:04] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) [16:13:22] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) As FYI to everybody interested, we have tested 0.35.1 (latest upstream) in T236690, but we ended up in a bug that broke a dashboard: https://github.com/apache/incubator-super... [16:13:46] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SDC General, 10Wikidata: Create reportupdater reports that execute SDC requests - https://phabricator.wikimedia.org/T239565 (10mpopov) >>! In T239565#5706854, @Milimetric wrote: > Yay, I get to work with @mpopov :) Aw, I feel likewise! :D > * how... [16:33:29] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SDC General, 10Wikidata: Create reportupdater reports that execute SDC requests - https://phabricator.wikimedia.org/T239565 (10Nuria) Some alternatives: superset can source data from other places than druid and we have couple dashboards on top of so... [16:43:34] (03PS1) 10Mforns: Escape dollar sign in hive script for wmcs [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/554564 (https://phabricator.wikimedia.org/T232671) [16:46:27] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging to unbreak production." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/554564 (https://phabricator.wikimedia.org/T232671) (owner: 10Mforns) [17:02:11] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10AndyRussG) Hi! Here's my request for the new creds for stat100* and notebook100*, please. Username: andyrussg. Thanks so much for working on this!!!!! :) [17:15:28] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10Ejegg) Hi! here's my request for Kerberos credentials for Hadoop access on stat100X and notebook100X. My username is ejegg. [17:32:11] milimetric: today we need to deploy , ping us when you are back from meeting cc joal [17:32:39] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10elukey) >>! In T237605#5712992, @Ejegg wrote: > Hi! here's my request for Kerberos credentials for Hadoop access on stat100X and notebook100X. My username is ejegg. ` elukey@krb1001:~$ sudo man... [17:35:05] nuria: yep, was planning on deploying in 1.5 hours, at 14:00 EST [17:35:29] I'll do everything in the etherpad, let me know if there are any exceptions/additions [17:35:43] milimetric: see: https://etherpad.wikimedia.org/p/analytics-weekly-train [17:36:13] joal: i think moving the cassandra job and restart also needs to be added to train etherpad [17:39:59] yup, will do nuria [17:41:51] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10elukey) >>! In T237605#5712943, @AndyRussG wrote: > Hi! Here's my request for the new creds for stat100* and notebook100*, please. Username: andyrussg. Thanks so much for working on this!!!!! :)... [17:43:25] 10Analytics, 10Analytics-Kanban: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article - https://phabricator.wikimedia.org/T239848 (10JAllemandou) [17:43:39] 10Analytics, 10Analytics-Kanban: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article - https://phabricator.wikimedia.org/T239848 (10JAllemandou) a:03JAllemandou [17:43:51] nuria: lmk if there’s anything I could do to get started with my next project. I’m thinking of playing around with Gerrit, but if there’s something more particular you think would be helpful, I’m all ears [17:44:17] lexnasser: did you need some help today copying some stuff somewhere? [17:46:21] ottomata: Yes, the privacy manager is writing up a risk assessment right now, and I think I’ll need your help to move the files sometime today. nuria: should I wait on James’ risk assessment completion before release? [17:46:42] (we are both in a meeting atm, she'll respond more in a bit i think) [17:48:11] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SDC General, 10Wikidata: Create reportupdater reports that execute SDC requests - https://phabricator.wikimedia.org/T239565 (10Abit) Bless y'all's hearts for setting this up for us ♥♥♥ [17:50:31] lexnasser: i was thinking this one for your next task: [17:50:38] lexnasser: ta-ta-channnnn [17:51:19] lexnasser: https://phabricator.wikimedia.org/T239625 [17:51:45] lexnasser: take a look and let me know, involves programming in java (little) and adding tests/running queries [17:54:43] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10nettrom_WMF) @Nuria : I can confirm what @mforns mentions. During my conversations with him yesterday, it became clear to me that how the Growth team is u... [17:57:58] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10Nuria) @nettrom_WMF which are the schemas subjected to this 270 window? [18:00:41] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10nettrom_WMF) @Nuria : [[ https://meta.wikimedia.org/wiki/Schema:HelpPanel | HelpPanel ]], [[ https://meta.wikimedia.org/wiki/Schema:HomepageVisit | Homepa... [18:02:07] nuria: Tentatively, it sounds fitting. I'll read more into it and the referenced tickets [18:04:49] 10Analytics, 10Performance-Team, 10Research, 10Security-Team, and 2 others: A Large-scale Study of Wikipedia Users' Quality of Experience: data release - https://phabricator.wikimedia.org/T217318 (10JFishback_WMF) Due to the low impact of harm, and low opportunity and probability of malicious use of this d... [18:05:22] 10Analytics, 10Performance-Team, 10Research, 10Security-Team, and 2 others: A Large-scale Study of Wikipedia Users' Quality of Experience: data release - https://phabricator.wikimedia.org/T217318 (10JFishback_WMF) #wmf-legal can we please get someone to sign off on this? [18:20:14] * elukey off! [18:23:04] (03PS1) 10Mforns: Add funnel parameter to wmcs queries that return multiple rows [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/554585 (https://phabricator.wikimedia.org/T232671) [18:24:36] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging to unbreak production." (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/554585 (https://phabricator.wikimedia.org/T232671) (owner: 10Mforns) [18:58:24] 10Analytics: Add pertinent wdqs_external_sparql_query metrics and wdqs_internal_sparql_query to a superset dashboard - https://phabricator.wikimedia.org/T239852 (10Nuria) [19:13:20] 10Analytics, 10Research-Backlog, 10Wikidata: Copy Wikidata dumps to HDFS - https://phabricator.wikimedia.org/T209655 (10JAllemandou) New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page. ` hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tai... [19:17:09] a-team: deployment etherpad for today says "patch is here:" but doesn't link to a patch (https://etherpad.wikimedia.org/p/analytics-weekly-train), was that supposed to be the patch that delays one of the cassandra daily workflows? If so, I can do that, but just wanted to check [19:17:46] milimetric: indeed! currently writing the commi message [19:18:16] k, cool [19:18:26] I'll review/merge and deploy after then [19:19:20] (03PS1) 10Joal: Add delay to cassandra oozie loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554593 (https://phabricator.wikimedia.org/T239848) [19:19:29] got it [19:20:24] milimetric: I'm unhappy with my commit message and have not tested yet [19:27:39] (03PS2) 10Joal: Add delay to cassandra oozie loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554593 (https://phabricator.wikimedia.org/T239848) [19:30:38] milimetric: tests are succefull for the patch - would you mind reviewing for my poor english (and other less important code related stuff ;) [19:30:56] (03CR) 10Joal: [V: 03+2] "Validates on cluster" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554593 (https://phabricator.wikimedia.org/T239848) (owner: 10Joal) [19:31:00] hm, joal what about the typo here: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/554593/1/oozie/cassandra/daily/coordinator.xml [19:31:09] ... [19:31:55] milimetric: yeqh, corrected that in patch 2 :) [19:32:00] ah ok [19:32:05] only that in the 3 coords [19:32:20] milimetric: --^ you can keep your ongoing review and comments :) [19:33:36] my only comment is that the comments make it seem like you can use monthly/daily/hourly granularity and the delay will apply at the right granularity, but it's only implemented for day/hour for now [19:33:56] I'd update the comments to mention day/hour joal, otherwise looks good [19:34:30] oh neever mind, I missed the third coord [19:34:33] it's good! [19:34:34] nvm [19:34:56] (03CR) 10Milimetric: [C: 03+2] Add delay to cassandra oozie loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554593 (https://phabricator.wikimedia.org/T239848) (owner: 10Joal) [19:35:05] also milimetric, about kill/restart, I suggest killing only the mediarequest-per-file coord (not the bundle), and restart this one only, to prevent [19:35:06] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add delay to cassandra oozie loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554593 (https://phabricator.wikimedia.org/T239848) (owner: 10Joal) [19:35:33] reloading 3 days in cassandra [19:35:34] joal: I just reran the specific instance of that coord only, didn't kill anything [19:35:51] and then I'll kill and restart after deploy [19:36:17] milimetric: let's not forget to rerun the failed pageview-per-article as well [19:36:17] but yeah, I was just going to kill the coord not more [19:36:31] cool milimetric [19:36:39] joal: of course, when mediarequests finishes, it's on load-cassandra now [19:36:41] the bundle is nice but super cumbersome [19:37:00] wonder how all this would look like in flyte/airflow [19:37:04] Thanks a lot milimetric - [19:37:08]  [19:37:18] huhu [19:37:25] Gone for diner team, back once done [19:39:02] duh, of course forgot refinery-source [19:43:04] mforns: did you do this on purpose (update the 106 notes instead of adding 107 notes) or should I split the 107 notes out? [19:43:05] https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/552256/1/changelog.md [19:44:06] milimetric, hmmm.. wasn't my last deployment 106? [19:44:27] it says 107 in the git log [19:44:31] milimetric, oh... yea [19:44:34] strange [19:45:07] maybe we didn't update the changelog for 106, and I just added the docs by incrementing without looking at the number... [19:45:09] sorry [19:45:15] was not on purpose [19:45:21] can you correct please? [19:45:25] ok, np, ofc [19:45:39] (03PS1) 10Milimetric: Update changelog.md for v0.0.108 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/554601 [19:45:48] thanks :] [19:46:08] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update changelog.md for v0.0.108 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/554601 (owner: 10Milimetric) [19:56:30] joal: i have a question if you may for your patch for cassandra [19:56:49] joal data_input_delay here: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/554593/2/oozie/cassandra/hourly/coordinator.xml [19:56:54] joal: is used where? [20:08:12] !log deployed refinery source [20:08:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:14:37] ottomata: getting those same errors scap deploying refinery [20:14:43] shall I ignore and retry as before? [20:15:27] a list of things like (failed: No such file or directory (2)\nrsync: link_stat "/git-fat/72853206967f23d1980d53a9c90fc6139092c878" ) [20:15:50] rolled back [20:16:34] hmm [20:16:43] how come it always happens to you dan? [20:16:44] heheh [20:17:07] milimetric: so the first thign I do here i checkt o see if that sha is actually a real jar [20:18:23] do you know which file that is trying to get? [20:18:45] ok it is artifacts//org/wikimedia/analytics/refinery/refinery-job-0.0.108.jar [20:19:08] uh... I'm like a snail but yes, that and I'm guessing all the other 108 jars I just built [20:20:06] http://127.0.0.1:8080/repository/releases/org/wikimedia/analytics/refinery/job/refinery-job/0.0.108/refinery-job-0.0.108.jar [20:20:08] oops [20:20:09] sorry [20:20:52] ok it is, and the sha is correct [20:20:57] so [20:21:00] ok so things are ok [20:21:06] i think maybe try again dan? [20:21:10] ... [20:21:15] :) [20:21:23] I hate whatever is happening here [20:21:29] might have needed more time between your deploy and the jenkins job finishing [20:21:39] there is a cron that runs on archiva host to create the git-fat symlink [20:21:53] is everyone else like cooking dinner in between these things? I'm not going particularly fast or anything... [20:22:04] hmm let me make sure the symlink exists [20:22:24] yup it does [20:22:33] like last time, it finished in like 10 seconds this time [20:22:37] (for the canary) [20:22:49] it was created 12 minutes ago [20:22:56] makes sense, when I built [20:23:06] so at :10 past the hour [20:23:18] hm, no, the jar was uploaded 15 minutes agho [20:23:21] at :05 past the hour [20:23:34] the git-fat sha symlink was created 5 minutes later [20:23:50] ok, so I'll add "wait 5 minutes" in the deploy instructions :) [20:24:05] ya the cron runs on every 5 minutes [20:25:31] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10Nuria) I see, +1 to @mforns idea [20:34:51] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10mforns) @nettrom_WMF OK, then! I will implement a deletion timer specific to those 3 schemas, that will delete all their data from the event_sanitized d... [20:45:24] nuria: data_input_delay is not used, but it needs to be named (therefore that name) - It is needed to enforce some (or no) delay in jobs [20:46:02] joal: the part about "it needs to be named" is what i do understand [20:46:09] *do not understand [20:46:44] lexnasser: for that work you will need java 1.8 and mvn, ping me if you need help with that [20:48:59] nuria: input-event sections need a name in oozie (https://github.com/apache/oozie/blob/master/client/src/main/resources/oozie-coordinator-0.4.xsd#L105) [20:51:07] ottomata: now it's complaining about git fat on stat1007: [20:51:08] https://www.irccloud.com/pastebin/df9VgjAi/ [20:51:16] when I do the hdfs sync [20:51:23] (sudo -u hdfs /srv/deployment/analytics/refinery/bin/refinery-deploy-to-hdfs --verbose --no-dry-run) [20:52:11] did it on 1004 and it worked fine [20:52:19] so I guess something's wrong on 1007 [20:54:53] yeah, there's something really weird on 1007, git status doesn't even work, it throws permission denied errors [20:56:36] looks like that failed deploy left some bad status and references to temp files that are no longer there... [20:57:25] !log finished refinery-deploy-to-hdfs from stat1004 but something's broken on stat1007 in the /srv/deployment/analytics/refinery repo [20:57:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:00:28] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10Halfak) I need creds for stat100*. My username is `halfak` [21:01:58] 10Analytics, 10Analytics-Kanban: Create kerberos principals for users - https://phabricator.wikimedia.org/T237605 (10Halfak) I'd like to request creds for the engineers on my team as we'll (hopefully) be using hadoop a lot more soon. Usernames: `accraze` and `kevinbazira` Ping @kevinbazira and @ACraze [21:27:04] milimetric: we need to fix the issues at /srv/deployment/analytics/refinery in stat1007 right? [21:27:11] cc ottomata [21:27:46] ottomata: there is an aerror in permissions [21:27:49] https://www.irccloud.com/pastebin/Uy8MhBCE/ [21:33:45] hm [21:34:24] oh sorry milimetric missed your ping! [21:39:55] checked out master and git fat pulled [21:40:47] oh need the scap sync branch [21:40:47] doing that [21:41:20] fixed [21:42:18] ottomata: shouldn't branch be master? [21:42:22] ottomata: cause now [21:42:23] https://www.irccloud.com/pastebin/21oa5hGt/ [21:42:30] that's right [21:42:32] not master [21:42:38] scap creates a branch at deploy [21:42:40] right [21:42:43] ok, thanks ottomata [21:43:02] so the bad deploy caused this, right? [21:43:08] ottomata: did you needed sudo to fix it? [21:43:10] it's weird that it doesn't fix it when it says the deploy succeeds [21:43:26] i did sudo -u analytics-deploy, but i think i could have fixed frrom deploy server with a scap redeploy [21:43:31] i didn't do any schmod stuff [21:44:04] ottomata: k, i saw there was a permission error, but after looked and all dirs were analytics-deploy owned [21:45:35] (added to docs the importance of waiting 5 minutes and the potential consequences of not) [21:51:43] milimetric: for every thing in life, really [21:51:56] lol [21:52:04] ok, adding to docs that we should wait 5 minutes in general [22:01:32] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10Jclark-ctr) Disk arrived [22:12:09] milimetric, joal : delaying the cassandra job is fine but we should not delay the period we rae computing [22:12:11] *are [22:12:27] milimetric: so we should be computing 0..23 hours just like we were before [22:12:36] nuria: this is the way it's done [22:12:51] nuria: base data computation is not affected [22:12:59] I see, just the input events [22:13:15] it's confusing though, until now all jobs had the same input events as actual input [22:13:17] the cassandra job was waiting for 0..23 hours to be available before starting [22:13:41] maybe we should wait for 0..0(next) instead [22:14:01] milimetric: that would not stagger jobs though [22:14:03] anyway, it's late, we can talk tomorrow, nothing that can't be undone [22:14:04] now it'll wait for 0..23 hours, AND hour 23+1 (meaning a delay of & hour) [22:14:19] it would stagger, 'cause it would still wait for 0(next) [22:14:50] joal: but the period yer/month/day calculated is the same correct? [22:15:00] milimetric: actually not all jobs have same events as input - some have hourly, others daily [22:15:18] nuria: I don't understand your question [22:15:50] joal: sorry, that this did not affect the boundaries of data we are loading [22:16:09] joal: like the job starts one hour later but it is still loading the same data inytervals [22:16:13] nuria: yeah, this was my misunderstanding - joal go to sleep I got it :) [22:16:14] nope - The hive query doesn't change [22:17:02] the boundaries of the hive queries don't change - However the available-data needed for the job to start is more than the data it actually uses (1 hour moe in our example) [22:17:10] nuria: here's where my misunderstanding was, looking at the change makes sense: https://github.com/wikimedia/analytics-refinery/commit/c8de2abf5d70c6a93d737217bfc2595df5ae6f88#diff-eea685b5ab8d58b971f597afcb26e58cR117 [22:17:24] nuria: so there are two input-events [22:17:28] one is data_input [22:17:32] and one is data_input_delay [22:17:56] so oozie will wait for the union of those two [22:18:01] milimetric: ya, if you see that i asked this same question to joal earlier and i guess i totally missed his answer [22:18:04] data_input is 0..23 [22:18:11] data_input_delay is 1..0(next) [22:18:49] so when you look at the coordinator instance in hue, you'll see 1..0(next) because most likely 0(today) is already there so you won't see it listed in the "missing" [22:19:02] so it gives a slightly misleading impression that computation is happening one hour staggered [22:19:37] but really computation is happening based on data_input, and data_input_delay is merely a blocker to force the job to wait some arbitrary amount [22:20:24] clever but slightly confusing if (like me) you're not thinking of the details when looking at the job in hue [22:21:44] milimetric, nuria - We are very much used to having oozie only rely on data being present because it needs it (which makes sense) - This is not needed by construction, and we take advantage of that in that case [22:22:10] go to sleep, don't make me come to France :) [22:35:19] I could actually not go to sleep only for the pleasure of having you in France milimetric - But you're right, it's late :) [22:35:27] See you tomorrow team [22:36:46] yeah, it would've been a win/win for me too :) [22:40:43] (03PS1) 10Milimetric: Add better usage commands [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554647 [22:41:30] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "also a bump to the correct version for the wikidata spark jar" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554647 (owner: 10Milimetric) [22:44:14] (03PS2) 10Milimetric: Fix spark jar version, add better usage commands [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554647 [22:44:24] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix spark jar version, add better usage commands [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554647 (owner: 10Milimetric) [22:56:25] (03CR) 10Nuria: "Nice, thanks for doing these" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554647 (owner: 10Milimetric) [23:41:52] (03CR) 10Nuria: "Changes look good but please add a ticket number" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/554255 (owner: 10Joal)