[00:18:12] 10Quarry: Create a beta host - https://phabricator.wikimedia.org/T209119 (10zhuyifei1999) I think the current problem is that the results are stored on NFS, and you can't possibly "namespace" NFS with beta and production. [00:36:25] HaeB: see support for HSTS : https://caniuse.com/#feat=stricttransportsecurity [00:38:28] yeah, it's my understanding that HSTS is for instructing browsers to never (again) request the HTTP version. but even if the browser request HTTP, we always redirect them to HTTPS (i.e. we never server content via HTTP) [00:39:04] and would that 301 redirect be logged in the referer? [00:42:46] HaeB: we do not but browser might request page under http [00:46:58] HaeB: "we do not serve https" that is [00:48:12] yes, we went through this in detail earlier this year at https://phabricator.wikimedia.org/T188807 [00:51:44] HaeB: right, so http referrers will keep on existing until HSTS is supported everywhere [00:52:40] HaeB: rather, http referrers will exist until everything is https [00:52:53] HaeB: hopefully that makes sense [01:10:16] nuria: wait, so are we certain that a referrer is logged for HTTP --> HTTPS 301 redirects? [01:12:00] that would be useful to know, because these are counted as internal referrals (i assume), and we normally always operate under the assumption that all internal referrals are caused by users clicking on internal links [01:13:12] in particular, it would also mean that there are pageviews counted as internally referred that have no corresponding source pageview (because the initial HTTP request is not classified as a pageview in refinery) [03:08:32] my sqoop finished a few hours ago, but I was having dinner. I'll test my query that generates the logging table with comments, and run it if it works well [03:08:41] I'll leave comments here either way [04:09:21] (03PS5) 10Milimetric: [HOTFIX] [do not merge] add logging_with_comment [analytics/refinery] - 10https://gerrit.wikimedia.org/r/472508 [04:09:35] ok, latest status in Commit message of this patch ^ [04:10:14] basically, milimetric.mediawiki_logging looks good to go, but it's missing ajtwiki so I figured I'd wait for the team tomorrow to decide what to do. [04:30:15] 10Analytics: Print schema is whitelisting both session ids and page ids - https://phabricator.wikimedia.org/T209050 (10Tbayer) a:05fdans>03Tbayer I'll look into this next week with @ovasileva . [06:28:15] There are some R packages I want to use that are incompatible with the version of R I have on notebook1004 [06:28:28] the version on notebook1004 is 3.3.3 [06:28:49] which is fairly old. I would hope for version 3.5. [06:56:44] hello :) [07:12:02] groceryheist: hi! So in theory we use the version of R that is shipped by stretch [07:12:24] but IIRC other people experimented with solutions with R packages [07:14:00] ah [07:14:10] well I worked around my recent issue [07:14:19] ah nice, can I ask how? [07:14:19] but R is moving pretty fast these days :) [07:14:40] https://stackoverflow.com/questions/51982174/using-r-package-effects-in-r-version-3-4-4 [07:15:20] basically I used the package in the answer to find an old version of the packages I wanted [07:15:44] ah okok :) [07:59:50] Morning elukey [08:00:52] o/ [08:01:20] elukey: I'd need some help if possible :) [08:02:23] elukey: batcave? [08:03:31] joal: sure [09:34:08] elukey: hive job finished, checking data correctness before moving and relaunching oozier [09:34:15] ack [09:46:16] elukey: data correctness confirmed: from snapshot 2018-09 to 2018-10--corrected, number of logging rows grows on wikis, and number of rows with comment vs without comment is also correct (a bit more in both, very similar) [09:46:24] elukey: asking permission to move data :) [09:47:48] elukey: hdfs dfs -mv /user/milimetric/wmf/data/raw/mediawiki/tables/logging/snapshot=2018-10 /wmf/data/raw/mediawiki/tables/logging/snapshot=2018-10 [09:48:26] +2 [09:48:43] let's also check permissions after the move [09:50:27] elukey: currently updating ownership, will check permissions [09:52:46] elukey: permissions look good [09:52:54] elukey: restarting oozie jobs? [09:53:41] joal: sure! [09:53:42] elukey: also, I will manually add a _SUCCESS file in /wmf/data/raw/mediawiki/tables/change_tag/snapshot=2018-10 so that mediawiki_load job get unstauck [09:53:48] yep [09:54:39] ok, starting with adding file [09:57:34] And restarting oozie jobs [09:59:34] Jobs restart confirmed ! [10:00:30] Now we can monitor the reconstruction here: https://yarn.wikimedia.org/proxy/application_1540803787856_40187/ [10:01:06] hum - there is an issue :( [10:04:40] spark seems to have a problem reading the avro files generated by hiuve :( [10:05:24] indeed [10:05:39] No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension [10:06:39] yeah I was reading that as well [10:07:50] should /wmf/data/raw/mediawiki/tables/change_tag/snapshot=2018-10 contain only the flag files? [10:08:31] joal: --^ [10:08:38] yes elukey [10:08:58] We have not sqooped that data but the load job was still expecting it to be laucnhed [10:09:30] elukey: I'm gonna patch the mediawiki-history oozie job and launch it from y folder for this run to workj [10:09:42] okok [10:15:16] elukey: still failing :( [10:15:34] elukey: need to drop - Will start-again when back [10:16:35] ack [10:33:35] 10Quarry: Provide a more intuitive way to design DB queries, as Quarry is not ideal for complex ones... - https://phabricator.wikimedia.org/T208839 (10ShakespeareFan00) The provision of some basic notes on SQL syntax, linked from the Quarry pages , might also be helpful. There is a Wikibook (in English) here ht... [12:05:10] elukey: \o/! mediawiki-history job successfully restarted with the parameter prefix [12:05:38] Thing to know for the future: to pass hadoop parameters to spark, prefix them with spark.hadoop [12:05:57] For instance: --conf spark.hadoop.avro.mapred.ignore.inputs.without.extension=false [12:06:22] job tracking: https://yarn.wikimedia.org/proxy/application_1540803787856_40425/ [12:06:47] nice! [12:23:39] joal: everything ok? [12:23:50] Hi milimetric - so far so good [12:23:59] milimetric: some issues on the way, the things are moving [12:24:01] you're using the thing I joined last night? [12:24:12] milimetric: your join failed Ithink [12:24:23] milimetric: I had to rejoin, but the raw data was here [12:24:38] I also manually sqooped atjwiki ;) [12:24:46] oh! nice [12:24:51] I was too tired last night to do that [12:25:38] ok, great, I have a doctor's appointment for baby, but I can stay home if you think you need me [12:26:10] milimetric: thanks mate :) [12:26:18] Baby sitting mediawiki-history [12:27:57] ok, lemme know if you need anything like brainbouncing [12:32:09] \o/ ! No errors in generated user data :) [12:32:19] ok, that's better :) [12:41:59] :) ok, cool, gonna go to doctor with baby then, but I can still jump on a call if needed [13:02:35] * elukey lunch! [14:03:48] team: meadiawiki-history job failed [14:03:56] I'm investigating [14:47:46] Ok I think I have a working version (so far) [14:54:41] 10Analytics, 10Readers-Web-Backlog: Print schema is whitelisting both session ids and page ids - https://phabricator.wikimedia.org/T209050 (10ovasileva) [14:59:46] heya team [14:59:53] o/ [15:53:53] joal: wanna talk about what went wrong? [15:54:13] just got back from the doctor [15:55:09] Heya milimetric - I can explain :) [15:58:00] milimetric: there is a type mismatch between our avro definition for hive tables and the sqooped data [15:58:33] oh no, I'm sorry :( [15:58:43] I used the same create table definition for that reason, how'd it sneak by? [15:58:58] milimetric: The mismatch is in the 'right' side when data is read (int --> long°, but since we used the schema to generate the data, we generated long instead of int :) [16:00:17] not sure I follow, but that's ok, we can't talk more later [16:00:27] I'm gonna think about how we can do this long term for next month [16:00:55] milimetric: That's a good idea - The sqooping from labs is really not a possible solution (you were absolutely right, far too long) [16:01:11] yeah, I had a bad feeling, was hoping I was wrong :) [16:01:28] Nope you were right [16:06:45] milimetric: thanks for putting some thoughts onto that [16:41:26] nuria, re differential privacy, I think in this case it could be effective even with small data, because for the data set to be disclosing (with 100% certainty) we need to exactly match the number of editors per wiki/country/month with the number of editors in the wiki databases, thus if the number varies just 1 unit, we are introducing quite an uncentainty factor... [16:45:38] also, if we decide to go for partial publication, one criteria that we can follow is not publicizing countries that are smaller than X sqKM [16:45:51] or any size measure [16:46:39] mforns: mmmm if you have 10-20 editors in spain of bambaran wikipedia, and 10 on mali * i think* a differential privacy guarantee will tell you that if you remove 1 editor record you will still be able to conclude that there are 10-20 editors in spain of bambaran wikipedia and 10 on mali. if you match this with the number of active editors for bambaran and there are 15 I know all those are located in either [16:46:40] spain or mali so not sure (but i might be missing something) how the gurantee will help here with such a small data [16:48:29] I though differential privacy would give you concrete (not necessarily accurate) numbers, rather than intervals [16:50:16] we can discuss this after standup maybe [16:53:18] joal: about the puppet change - do you want me to merge it now? [16:53:28] (but then tomorrow it might fail) [16:53:40] (not a big deal of course, just saying :) [16:54:20] 10Analytics, 10Product-Analytics, 10Reading-analysis: [EventLogging Sanitization] Update EL sanitization whit-elist for field renames in EL schemas - https://phabricator.wikimedia.org/T209087 (10Tbayer) a:05Tbayer>03None @mforns: I assume "you" in the task description refers to me (since you assigned the... [16:55:01] 10Analytics, 10Product-Analytics, 10Reading-analysis: [EventLogging Sanitization] Update EL sanitization whit-elist for field renames in EL schemas - https://phabricator.wikimedia.org/T209087 (10Tbayer) PS: and (in the name of the team) thanks for catching this! [16:59:44] mforns: yeah, I think the approach is great, because it's the relationship between editor populations that matters in the use of this data [16:59:52] 10Analytics, 10Product-Analytics, 10Reading-analysis: [EventLogging Sanitization] Update EL sanitization whit-elist for field renames in EL schemas - https://phabricator.wikimedia.org/T209087 (10mforns) @Tbayer I should have looked who that schema's maintainer was, sorry for that. I intended it just as a he... [17:00:28] but it's also important to keep the numbers tracking the same patterns as the real numbers, does differential privacy allow for that? I should read up more [17:04:14] elukey: let's wait tuesday - monday we do the upgrade, and tuesday we merge and monitor that [17:28:56] 10Analytics, 10DBA, 10Data-Services: Not able to scoop coment table in labs for mediawiki reconstruction process. - https://phabricator.wikimedia.org/T209031 (10Nuria) [17:43:23] 10Analytics, 10DBA, 10Data-Services: Not able to scoop coment table in labs for mediawiki reconstruction process. - https://phabricator.wikimedia.org/T209031 (10Nuria) [18:10:59] 10Analytics, 10Anti-Harassment (AHT Sprint 33): 👩‍👧 Track how often blocked user attempt to edit - https://phabricator.wikimedia.org/T189724 (10Nuria) [18:33:39] (03PS1) 10Ladsgroup: ClickstreamBuilder: Decode refferer url to utf-8 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/472700 (https://phabricator.wikimedia.org/T191964) [18:35:27] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clickstream dataset for Persian Wikipedia only includes external values - https://phabricator.wikimedia.org/T191964 (10Ladsgroup) ^ I made this patch but I basically had no way to test things. Please double check and if possible run it for a short period o... [18:37:49] (03CR) 10Nuria: "Thanks for doing this, let's run this code before calling fix good." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/472700 (https://phabricator.wikimedia.org/T191964) (owner: 10Ladsgroup) [18:38:17] (03CR) 10jerkins-bot: [V: 04-1] ClickstreamBuilder: Decode refferer url to utf-8 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/472700 (https://phabricator.wikimedia.org/T191964) (owner: 10Ladsgroup) [18:46:57] (03PS2) 10Ladsgroup: ClickstreamBuilder: Decode refferer url to utf-8 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/472700 (https://phabricator.wikimedia.org/T191964) [18:47:01] 10Analytics, 10Product-Analytics, 10Reading-analysis: [EventLogging Sanitization] Update EL sanitization whit-elist for field renames in EL schemas - https://phabricator.wikimedia.org/T209087 (10Nuria) Reassiging to @bearloga who is working with android team. [18:47:24] 10Analytics, 10Product-Analytics, 10Reading-analysis: [EventLogging Sanitization] Update EL sanitization whit-elist for field renames in EL schemas - https://phabricator.wikimedia.org/T209087 (10Nuria) a:03mpopov [18:51:01] * elukey off! [18:51:35] milimetric: once the last change gets deployed for EL module are there any issues pending or we are done with refactor of schemas versus el code? [18:52:15] 10Analytics, 10Community-Tech, 10Event Tools, 10Grant-Metrics: Review category queries - https://phabricator.wikimedia.org/T206783 (10jmatazzoni) [18:53:09] (03CR) 10Nuria: [C: 032] "Looks good, think idea of checksum is a good one. I cannot think of more paths to add to "undeletable" databases but let's run it by team" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/471279 (https://phabricator.wikimedia.org/T199836) (owner: 10Mforns) [18:53:55] nuria: ideally all the clients would switch to use the subscriber module [18:54:37] and if that happens, we could at a point in the future disable the ResourceLoader modules completely (right now we publish one module per schema) [18:54:58] so, this last change emptied out those modules, so they don't take up almost any size anymore [18:55:15] instead of having the schemas themselves, they just contain a couple of pieces of info [19:08:28] (03CR) 10Mforns: "Thanks! Let's not merge yet though please, I'm adding unit tests." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/471279 (https://phabricator.wikimedia.org/T199836) (owner: 10Mforns) [19:31:41] 10Analytics, 10DBA, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Reedy) [19:47:40] 10Analytics, 10DBA, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Milimetric) @Krenair: there were a handful of discussions, but the gist of the reasoning is that we want to pull data *after* it's sanitized. If we... [20:08:31] 10Analytics, 10DBA, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Milimetric) @Marostegui and @jcrespo, I want to touch base with one or both of you on this, it turns out to be a tricky problem from an analytics po... [20:09:21] nuria: just wrote the comment, couldn't find any other fancy solution, ultimately comment sanitizing needs to happen on each and every row in those tables, so the best we can do I think is a view with just logging and revision comments, and sqooping that whole thing. [20:09:33] I'll make a task with that as the proposed solution [20:13:14] 10Analytics: Long term solution for sqooping comments - https://phabricator.wikimedia.org/T209178 (10Milimetric) [20:19:59] 10Analytics: Update log_namespace, page_namespace from bigint to int - https://phabricator.wikimedia.org/T209179 (10Milimetric) [20:23:18] (03PS1) 10Milimetric: Add ajtwiki to the prod sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/472717 [20:23:24] (03CR) 10Milimetric: [V: 032 C: 032] Add ajtwiki to the prod sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/472717 (owner: 10Milimetric) [20:54:31] 10Analytics, 10DBA, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Nuria) pinging @JAllemandou and @elukey Let's please follow up with dbas early next week on this ticket. [21:06:41] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Backlog (Later), 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Smalyshev) So, any new data abou... [21:57:18] what's the delay for data to show up in wmf.webrequest? [22:22:48] SMalyshev: couple hours [22:22:59] SMalyshev: anything i can help you with? [22:36:29] nuria: I'm just trying to figure out some things related to wdqs... not sure whether there's a way to see requests as they come in [22:37:26] HaeB: regrading referrers, you can see http refers and 301s here: https://bit.ly/2RLddvT , as you can see amounts are tiny (tens of thousands of requests) [22:38:08] SMalyshev: you can consume from kafka directly but for webrequest is probably not the best idea given volume [22:38:27] SMalyshev: also [22:38:37] nuria: yeah I need very specific tiny subset to getting it from kafka firehose probably won't work [22:42:41] HaeB: whether referrers are carried on 301s is browser-dependent but there are no internal requests we are missing due to 301s cause our intra links are not under 'http' [22:43:29] SMalyshev: there is also: https://turnilo.wikimedia.org/#webrequest_sampled_128 [22:43:58] nuria: yeah I know but this one may be too coarse... I don't get much data there [22:44:10] SMalyshev: so, at this time, we are about 1:30 mins behind from realtime in webrequest [22:44:26] nuria: ok, thanks [23:19:05] nuria: I've gotta run for dinner with the research folks, but the job is done successfully [23:19:06] yay [23:19:11] now I guess is the verify and load [23:19:17] milimetric: all right! [23:19:23] milimetric: say hi [23:19:27] :) will do [23:49:45] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clickstream dataset for Persian Wikipedia only includes external values - https://phabricator.wikimedia.org/T191964 (10Nuria) Tested job but failed, looking: https://yarn.wikimedia.org/cluster/app/application_1540803787856_42410