[00:09:31] (03PS1) 10Milimetric: Add atj.wikipedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/359081 [00:09:49] (03CR) 10Milimetric: [V: 032 C: 032] Add atj.wikipedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/359081 (owner: 10Milimetric) [00:47:57] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819#3350439 (10AndrewSu) > We could, however (with some work) capture usage of certain property, or item, or property-item combination, i... [01:02:39] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3350469 (10kaldari) @Nuria: I have created https://meta.wikimedia.org/wiki/Research:Wikipedia... [01:14:33] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3350481 (10Ottomata) Page creation is just revision create with rev_parent_id = 0, no? [02:17:52] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3350519 (10kaldari) @Ottomata: Nevermind, I see we can use revision-create where rev_parent_i... [04:47:46] (03CR) 10Nuria: "Alaready done in https://gerrit.wikimedia.org/r/#/c/359081/. Thanks!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/359062 (https://phabricator.wikimedia.org/T167720) (owner: 10Reedy) [04:51:26] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819#3350566 (10Nuria) >To incentivize them to contribute, we have to give them even better metrics of community usage/impact that they ca... [04:55:26] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819#3350567 (10AndrewSu) >>! In T143819#3350566, @Nuria wrote: >>To incentivize them to contribute, we have to give them even better metr... [10:36:19] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3350950 (10ema) >>! In T118365#3349563, @Nuria wrote: > mmm... looking at pageview API dashboard I can see some of lawful traffic (spikes we could have handled) seems to have b... [10:48:13] 10Analytics, 10Operations, 10Ops-Access-Requests: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3350991 (10ema) [10:48:21] 10Analytics, 10Operations, 10Ops-Access-Requests: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3351003 (10ema) p:05Triage>03Normal [12:30:32] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351181 (10BBlack) That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hosted server at Hetzner in DE. [12:35:27] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351184 (10ema) >>! In T118365#3351181, @BBlack wrote: > That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hos... [13:13:48] everyone's so quiet today! [13:20:14] Question from WMDE: we've used Pivot to track the TWL 2017 Campaign banner impressions, and were able to get the data from May 29 to June 01 only. We wonder whether the data were ever complete; in the meantime, the data for this banner are not available from Pivot anymore. Can anyone advise on this? Thank you very much. [13:37:07] 10Analytics, 10Analytics-Wikistats: Remove links in wikistats to minnan.wikipedia.org - https://phabricator.wikimedia.org/T107250#3351353 (10Liuxinyu970226) [13:37:38] fdans: I'm still working on the breakdown, I think crossfilter is cool but I wonder if it wouldn't be easier to just work with a plain recordset, it doesn't really get us much yet. [13:38:38] but I think we should stick with it for now, it's easily replaced [13:38:53] and maybe I'm wrong, maybe we'll need all the weird filter intersections and caching stuff it does [13:39:20] milimetric: that last thought is the one I keep whenever I'm doing these crazy reducers [13:39:49] yeah, you kind of have to box with the way it does .group on dimensions [13:39:57] I don't really think it makes sense... [13:40:10] like... "there's surely a layer of complexity here that I'm not touching and it probably makes all this worth it" [13:40:40] actually... I'm a little lost... something's not right [13:40:42] wanna hangout? [13:40:56] I'm picking up something I was working on last night and it doesn't make sense [13:41:06] sure, i'm on my way [13:47:01] (hi!) [13:47:29] mforns: yt? [14:38:37] ok fdans, I pushed [14:38:47] take a look and let's talk before standup if you want [14:38:52] ottomata, back from lunch [14:38:59] nice, looking milimetric [14:40:41] mforns: so, in order to do purging, your code relies on a top level timestamp field, correct? [14:40:51] does it rely on this being in a particular format in mysql? [14:41:07] ottomata, it assumes it's mediawiki format [14:41:13] and also it has an index [14:41:41] without index, it will be a lot slower [14:42:11] ok. grr. it just sucks that jrm.py is mw-el-analytics specific [14:42:21] ottomata, aha [14:42:25] milimetric: batcave! [14:42:38] mforns: , what about dateutil.parser.parse [14:42:39] ? [14:42:44] if you used that, would it have to be mw format? [14:42:47] i think parse could figure it out... [14:43:08] ottomata, aha, yea the code can be altered to support different formats I guess [14:43:10] hmm, you know, sigh, this data is just the mysql data, [14:43:19] i could just add a timestamp [14:43:20] hmm [14:43:29] if it doesn't exist [14:43:36] ottomata, doesn't it have a timestamp? [14:43:38] i was thinking about setting it to the value of meta.dt [14:43:44] omw fdans [14:43:46] it does, but its not called 'timestamp' [14:43:56] it is meta_dt, and in 8601 format [14:43:56] how is it called? [14:44:00] I see [14:44:15] ottomata, well, it could be a parameter of the script [14:44:16] eventbus schemas don't have the top levle capsule, they have the subobject meta schema [14:44:34] I see [14:44:46] mforns: maybe it could be a list of parameters to look for timestamps, in order [14:44:53] if timestamp, use that, if meta_dt, use that, etc. [14:44:54] ? [14:44:59] we could do like: --timestamp-fields=timestamp,meta_dt [14:45:07] yea, and then use them in that order? [14:45:12] if they exist? [14:45:16] yes, for example [14:45:23] meta_dt doesn't have an index though...i could fix that [14:45:28] aha [14:45:48] but, to do it generically have to add indexexes to all date-time fields in a schema [14:46:06] i guess i could add config too [14:46:15] --index-fields=timestamp,meta_dt [14:46:15] ottomata, how are these tables? are they big? [14:46:15] :/ [14:46:27] mforns: i think the biggest is revision-create, it'll be as big as edit i guess [14:46:31] about 20 events / sec [14:46:47] aha [14:47:47] ottomata, I didn't get the --index-fields=timestamp,meta_dt ? [14:48:07] thank you ottomata! [14:48:39] ema :) [14:48:51] you can wait 30 minutes, ooorrr run puppet on analytics1001 and stat1004 and/or stat1002 :) [14:49:04] mforns: that would be foir the mysql consumer [14:49:05] * ema can't wait and runs puppet [14:49:21] to tell it which fields it should add indexes on when it creates tables, if those fields exist in the schema [14:49:41] ottomata, I see [14:49:44] right now, timestamp gets an index because its jsonschema format is utc-millisec [14:49:52] aha [14:49:53] and that gets mapped to {'type_': MediaWikiTimestamp, 'index': True} [14:50:13] i could do the same for date-time formats [14:50:34] but then all date-time fields (there are 2 or 3) would ahve indexes [14:50:37] ottomata, would this be a change that blocks you, or could you wait until start of next quarter? [14:50:48] I se [14:50:50] see [14:51:09] mforns: i'm doing this work on the side for https://phabricator.wikimedia.org/T150369 [14:51:24] soooo, i think it shouldn't hold block you from proceeding with purging [14:51:34] i'm just trying to make it easy for kaldari to get to some eventbus data in mysql [14:51:51] mmh, spark-shell still doesn't work properly even though I'm now a member of analytics-privatedata-users [14:51:59] spark-shell --master yarn --executor-memory 4G --driver-memory 4G --executor-cores 1 [14:52:03] [...] [14:52:08] :16: error: not found: value sqlContext import sqlContext.implicits._ [14:52:59] ? ema where are you running that? [14:53:03] stat1004 [14:53:37] ottomata, what I've been doing since yesterday is to change the EL purging script to not use uuids, but I had to do major changes, and still need to rewrite the tests [14:53:44] ema did you run puppet on analytics1001? [14:53:48] too? [14:53:54] ottomata: nope, doing that now [14:53:56] k [14:54:11] aye [14:54:24] sigh, this purging things sucks :/ [14:54:56] on the other hand, the version Luca wrote works fine, and I think the concern of the dbas (limit offset) will not have a bit impact [14:55:16] we can talk is PS [14:55:50] ottomata: still no luck [14:55:50] :16: error: not found: value sqlContext [14:55:54] import sqlContext.sql [14:58:41] looking.. [14:59:05] you don't have an hdfs home dir, but puppet on analytics1001 should have created it [15:00:00] you did not run puppet on analytics1001! [15:00:04] Notice: /Stage[main]/Cdh::Hadoop::Users/Exec[create_hdfs_user_directories]/returns: 2017-06-15T14:59:33 hdfs dfs -mkdir /user/ema && hdfs dfs -chown ema:ema /user/ema [15:00:04] haha [15:00:08] ema try again! :) [15:00:32] ottomata: I did! [15:00:57] hmm, i guess you do the proper offering dance before you did [15:02:03] ottomata: uhuh, it works! thank you :) [15:02:29] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3351654 (10ema) 05Open>03Resolved a:03ema Done! [15:03:02] gr8 :) [15:04:02] ema: have fun ;) [15:07:55] joal: after this morning's workshop I'm gonna be super proficient! [15:08:00] :D [15:13:13] milimetric: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines [15:13:20] • Time should always be stored in a field called timestamp, in ISO [15:13:26] why 'timestamp' in ISO 8601? [15:13:34] our convention is to refer to 8601 fields as dt [15:14:14] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3351679 (10Milimetric) Just FYI, there are a good amount of rev_parent_id = 0 that do not rep... [15:15:59] ottomata: that's fine, but I was thinking if we remove the capsule, should we allow people to use "timestamp"? [15:16:06] I can change it to dt [15:17:40] done [15:18:45] ping ottomata can you come back to standup? [15:44:42] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351748 (10Nuria) Thanks for the prompt response, when the number of changes I did not see when these took effect, it is true that we do not see on our end 429s at all times, bu... [15:46:01] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351751 (10Nuria) If you look at 404s however, looks like the throttling had a positive effect on removing "garbaage-y" traffic. [15:46:37] ping fdans : groskinnn [15:46:48] omg sorry [15:50:49] 10Analytics-Kanban, 10Operations, 10User-Elukey: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3351771 (10Nuria) [15:51:00] 10Analytics-Kanban, 10Operations, 10User-Elukey: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3345083 (10Nuria) Puting on kanban for @elukey to look at [15:53:50] 10Analytics: Refactor puppet code for the Hadoop Analytics cluster to roles/profiles - https://phabricator.wikimedia.org/T167790#3351778 (10Nuria) p:05Normal>03Low [15:55:59] 10Analytics, 10Analytics-Wikistats: Remove links in wikistats to minnan.wikipedia.org - https://phabricator.wikimedia.org/T107250#3351788 (10Nuria) p:05Normal>03Low [15:58:44] 10Analytics, 10Analytics-Wikistats: Remove links in wikistats to minnan.wikipedia.org - https://phabricator.wikimedia.org/T107250#1490654 (10Nuria) let's wait until we do kafka upgrade. [15:58:59] 10Analytics: Send burrow lag statistics to statsd/graphite {hawk} - https://phabricator.wikimedia.org/T120852#1862859 (10Nuria) p:05Normal>03Low [15:59:30] 10Analytics-Kanban: Measure portal and hovercard pageviews - https://phabricator.wikimedia.org/T162618#3351819 (10Nuria) [16:05:42] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#3351862 (10Nuria) @tgr, this will benefit from changes happening on tagging of requests. We can tag requests that need to be "copied" easily and i think it will be trivil... [16:06:32] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#3351863 (10Nuria) We think this work can happen next quarter. [16:09:40] 10Analytics, 10MediaWiki-API, 10RESTBase-API, 10Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#3351902 (10Nuria) Ping @tgr what is the status of this? [16:11:36] 10Analytics, 10MediaWiki-API, 10RESTBase-API, 10Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#3351906 (10Nuria) [16:11:40] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#3351905 (10Nuria) [16:11:54] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#2944303 (10Nuria) Linking to task T142139 cause i think is realted, @tgr let us know otherwise [16:15:16] 10Analytics, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#3351942 (10Nuria) This sounds like googlebot crawling the app and sending traffic as a user would, I do not see anything that we can do on our end to prevent that, fixe... [16:15:39] 10Analytics, 10Research-and-Data: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207#3351950 (10Nuria) [16:15:41] 10Analytics, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#3351949 (10Nuria) [16:16:16] 10Analytics, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#2293216 (10Nuria) Action item for analytics is to verify that indeed all this requests are coming from apps. [16:16:41] 10Analytics-Kanban, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#3351952 (10Nuria) [16:18:19] 10Analytics: Quantify false positives when filtering for number of distinct user agents per page in top pages computation - https://phabricator.wikimedia.org/T146911#3351965 (10Nuria) [16:18:21] 10Analytics, 10Research-and-Data: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207#3351964 (10Nuria) [16:22:14] 10Analytics, 10Analytics-Wikistats, 10Wikimedia-Site-requests: Add li: Wikibooks to Wikistats - https://phabricator.wikimedia.org/T165634#3351993 (10Ooswesthoesbes) Alright, that is promising. [16:25:41] 10Analytics: Put data needed for edits metrics through Event Bus into HDFS - https://phabricator.wikimedia.org/T131782#2178434 (10Nuria) p:05Normal>03Low [16:27:41] 10Analytics: Meta-statistics on MediaWiki history reconstruction process - https://phabricator.wikimedia.org/T155507#3352011 (10Nuria) p:05Normal>03High [16:34:51] 10Analytics, 10Analytics-Cluster: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841#3352022 (10Nuria) [16:36:00] 10Analytics-Cluster, 10Analytics-Kanban: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841#2987445 (10Nuria) [16:36:39] 10Analytics-Cluster, 10Analytics-Kanban: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841#2987445 (10Nuria) Looks like this is couple hours of work and its benefit is clear. [16:39:27] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Kafka mirror maker failures when kafka brokers are restarted - https://phabricator.wikimedia.org/T157705#3013836 (10Nuria) As part of kafka upgrade mirrormaker will get a revamp [16:40:28] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Make oozie work with spark jobs that use HiveContext - https://phabricator.wikimedia.org/T94596#3352052 (10Ottomata) [16:40:30] 10Analytics: Unlock Spark with Oozie - https://phabricator.wikimedia.org/T159961#3352054 (10Ottomata) [16:43:05] a-team, hangouts not responding for me [16:43:21] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352065 (10Ottomata) Are these just historical artifacts, or is it possible for newly created... [16:44:47] 10Analytics, 10Patch-For-Review: Sort inconsistency in AQS timestamp behavior - https://phabricator.wikimedia.org/T160311#3352066 (10Nuria) Do we need to version api for such a change? (it will be a breaking change) [16:45:49] 10Analytics: Serbian Wikipedia edits spike 2016 - https://phabricator.wikimedia.org/T158310#3352069 (10Nuria) 05Open>03Resolved [16:46:55] a-team, no way I can connect to the batcave... [16:47:06] 10Analytics-Kanban: Update undocumented EventLogging mediawiki hooks - https://phabricator.wikimedia.org/T158331#3352074 (10Nuria) a:03Ottomata [16:47:17] are you guys still in da cave? [16:47:31] we're done mforns [16:47:37] ok [16:48:05] fdans: I've gotta get lunch and sort out my computer, but let's aim at finishing 30-ish points this week. So far we got 13 [16:48:39] I think I can do the AQS API one, but it's a lot simpler than originally thought so maybe I'll move it down to 5 (still counts as "finishing" 8 as far as our plan is concerned) [16:49:12] haha sure [16:49:22] so then we'd need another 8 pointer or so. Depending on how you do with Detail, I can grab that after you're done tomorrow or do something else [16:49:24] milimetric: I think we can do 30 [16:49:36] k, let's sync up tomorrow morning again [16:49:37] detail's going goood [16:49:45] good good, then maybe I'll grab something else [16:49:59] I'm at this beautiful stage of starting to become "one with the js framework" [16:50:17] I like vue [16:51:24] 10Analytics, 10Analytics-Wikistats, 10Wikimedia-Site-requests: Add li: Wikibooks to Wikistats - https://phabricator.wikimedia.org/T165634#3352079 (10Nuria) FYI that this is in deprioritized because any work on wikistats old ui is deprioritized while work continues on new UI. [16:53:14] honestly I think vue 2.0 is a lot closer to react than I initially realized [16:53:15] 10Analytics-Kanban: Measure portal and hovercard pageviews - https://phabricator.wikimedia.org/T162618#3168814 (10Nuria) [16:53:35] because they got rid of computeds bubbling out of the children and now it really just works like react with a little light reactivity sprinkled on [16:53:39] 10Analytics-Kanban: Measure portal pageviews - https://phabricator.wikimedia.org/T162618#3168814 (10Nuria) [16:53:49] but that's fine, works for me [16:53:58] 10Analytics: Measure portal pageviews - https://phabricator.wikimedia.org/T162618#3168814 (10Nuria) [16:55:35] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352095 (10Niharika) I could be wrong but here's the queries I ran on recentchanges on enwiki... [17:00:45] ottomata: sorry, i logged myself out!!! [17:00:49] ottomata: duh [17:17:50] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352160 (10kaldari) Hmm, not sure what to make of the results from recentchanges. That's real... [17:21:59] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352175 (10MusikAnimal) I'm not sure about `recentchanges` but going by `rev_parent_id = 0` i... [17:35:40] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352258 (10Ottomata) Ya @Niharika it might be worth checking the revision table instead of re... [17:37:35] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352264 (10kaldari) Here's a page that has 9 revisions (out of 12) with `rev_parent_id = 0`:... [17:55:43] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352337 (10RobH) [17:59:07] heya mforns still around? [17:59:13] ottomata, yea [17:59:17] wazzup [17:59:43] just talked to nuria a bit about eventbus in mysql purging stuff [17:59:47] i think we don't need to worry about it [17:59:51] the data there is 'public' ish anyway [17:59:56] in that we'd expose it in eventstreams anyway [17:59:57] so [18:00:00] aha [18:00:06] the tables will be created in the same database [18:00:09] but will have different schemas [18:00:21] can you put together your list of schemas to purge from the mysql meta info db? [18:00:22] e.g. [18:00:36] select tables where database = log and table has field timestamp and id (or uuid) [18:00:37] ? [18:00:49] ottomata, actually, after doing some performance tests of the purging script I'd say we'll need to change the code not to use uuids anyway [18:00:56] thats good! [18:00:56] :) [18:02:02] ottomata, don't understand your question about list of schemas? [18:02:42] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352404 (10Ottomata) This will be a totally different cluster than the nodes in 1001-1003, or the 1012-1022ish nodes in the analytics cluster. Can we someho... [18:03:06] mforns: [18:03:07] ok [18:03:13] if tables exist in log db [18:03:20] aha [18:03:21] that do not have a timestamp or an id fileld [18:03:25] how will your script handle that? [18:03:32] break [18:03:34] :] [18:03:45] right so we need a way to tell it whihc tables to consider, or which ones to ignore [18:03:51] I see [18:03:52] we could provide a blacklist of tables to ignore in config [18:03:52] OR [18:04:02] we could do a little db reflection and examine the schemas of the tables [18:04:09] so, if table does not have the fields you need to do purging [18:04:10] skip it [18:04:21] do those tables have a schema_revision name structure? [18:06:22] meta_schema_uri [18:06:27] which looks like [18:06:39] 'mediawiki/revision/create/1' [18:08:05] ottomata, the table names are quite different, we could use their format to distinguish them [18:08:09] with a regexp or so [18:08:11] true [18:08:40] if you wanna do a quick and dirty, that's fine with me :) [18:08:46] we have to remember to check if we add new tables in the future... [18:08:46] the table will look like that: [18:08:50] mediawiki_revision_create_1 [18:09:02] oh! with underscores [18:09:04] ok [18:09:09] all the ones i'll be importing (for now) will start with mediawiki_ [18:09:14] I see [18:09:48] mmmm, no they are too similar, no? [18:10:09] those and EL tables? [18:10:10] if an EL user creates a new schema named Mediawiki... [18:10:12] yes [18:10:17] well, lower case? [18:10:21] yea [18:10:24] maybe since wikipages are upper case it'll be ok? [18:10:25] dunno [18:10:34] i mean, yeah, it'll probably work, but there might be cases where it breaks [18:10:39] the list of tables i will be importing for now is small [18:10:42] 3 or 4 [18:11:09] aha [18:12:51] ottomata, yea, I think it would be better to do introspection, as you mentioned [18:13:42] table must have a field named timestamp, and at least a field prefixed with 'event_' [18:14:26] mmm but still there can be problems... [18:15:00] yeah, but that will probably cover it [18:15:27] aha [18:16:50] Hey nuria_ [18:17:11] mforns: and you can get the tables from information_schema db [18:17:12] select COLUMN_NAME from COLUMNS where TABLE_SCHEMA='log' and TABLE_NAME = 'NavigationTiming_10076863'; [18:17:14] e.g. [18:18:24] or even [18:19:06] select TABLE_NAME from COLUMNS where TABLE_SCHEMA='log' and COLUMN_NAME in('id','timestamp'); [18:19:20] select DISTINCT TABLE_NAME from COLUMNS where TABLE_SCHEMA='log' and COLUMN_NAME in('id','timestamp'); [18:19:55] oh no, that's all tables with at least one of those fields [18:19:57] something like that ^ [18:20:40] joal: yessir [18:20:59] nuria_: Wanna double check new per-domain values and send email? [18:21:13] joal: sure. batcave? [18:21:25] nuria_: very noticeable bump on offset only, noticeable only on estimate [18:21:28] sure [18:22:26] a-team - So that we don't forget - http://www.commitstrip.com/en/2017/06/15/party-time/?setLocale=1 [18:24:01] haha [18:25:24] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352537 (10Nuria) Also, FYI To @kaldari that edit count is being added to data lake, when yo... [18:27:32] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352570 (10brion) I have the impression rev_parent_id isn't reliable but don't offhand recall... [18:28:29] ottomata: i dislike the just incremeenting the number for a difference service cluster [18:28:40] ive been sitting here since your update trying to think if a better hostname [18:28:54] so its no different than the other kafka servers, except its a different service shard? [18:29:01] or is it inherently a different kind of cluster? [18:29:29] we tend to try to increment hostname numbers sequentially [18:29:42] you guys just bucked the trend in kafka and we should avoid adding to its differences [18:29:48] (imo) [18:32:33] robh heheh, we KINDA bucked the trend on that one, there was an excuse! [18:32:37] butya, one sec... [18:32:57] yeah i know the excuse i jsut dont think it was worth the difference, but its done so no reason to worry about the past [18:33:08] i just rather not continue to have you deviate on the standards if we can help it [18:33:19] im updating the task with a slightly more elequently phrased version [18:34:55] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352624 (10RobH) @Ottomata that is not how we denote different clusters for any other hostnames on the cluster, so it seems bad to have kafka/analytics diffe... [18:35:13] im not exactly sure how your kafka systems differ [18:35:34] are they like our DB systems in that they are all mostly identical software stacks but just service different service shards? [18:35:35] 10Analytics, 10Operations, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#3352626 (10Nuria) FYI, Our privacy policy does mention we do not honor DNT. [18:39:50] ottomata: also if you still like your idea of incremetnin the number and faidon or mark says its ok, then it overrules the fact it deosnt match anything else in our standards. i dont have a personal stake in this, but i just try to keep things consistent unless my manager tells me otherwise ;] [18:40:21] so please please dont think im personally upset or invested in this, i most certainly am not ;] (people tend to think i care about this far more than i actually do) [18:40:34] i just end up having to talk about it a lot since i make the racking tasks =P [18:40:59] so you can disagree with me, get them to approve it, and im not going to be mad at you! [18:41:09] (hey sorry, with you in a min...) [18:41:12] no worries [18:41:22] actually, ok so [18:41:27] chris is out of the datacenter this afternoon so he wont be getting to racking this until tomorrow [18:41:30] the kafka clusters are logically different [18:41:38] 10Analytics, 10Operations, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#3352638 (10Nuria) https://wikimediafoundation.org/wiki/Privacy_policy/FAQ#DNTFAQ , if we were to do it i just found recently about the w3 api on this regard: https:/... [18:41:42] they are sorta used for different purposes [18:42:08] so not like our db systems, so i can see why you wouldnt like just incremeenting numbers in a close range [18:42:14] Done nuria_ [18:42:15] Yay ! [18:42:27] i'd advise we differ the hostname then [18:42:36] perhaps with the kafka-sc[ab] or something else like it? [18:42:41] the nodes in a cluster can't be joined with another cluster [18:42:42] hmmm [18:42:44] not a bad idea [18:42:46] and [18:42:51] we do have to name the cluster logically [18:42:53] in config, etc. [18:42:57] this one we haven't picked the name yet [18:43:00] yeah but would likely make more sense [18:43:02] but it may be 'aggregate' [18:43:03] for human readable [18:43:10] joal: "Analytically yours," jajajaja [18:43:11] https://etherpad.wikimedia.org/p/analytics-ops-kafka [18:43:14] maybe 'jumbo' [18:43:14] :) [18:43:16] maybe 'mothership' [18:43:16] :) [18:43:18] haha [18:43:44] im just boring [18:43:48] i like kafka-sc[ab] hehe [18:43:59] well, SC is service cluster? [18:44:02] not really what these are [18:44:08] (SC is service cluster, right?) [18:44:12] yeah [18:44:16] but, yeah, maybe we need a prefix that denotes they are kafka nodes [18:44:34] but a suffix that names the cluster succinctly [18:44:57] let's use 'main' as the example [18:44:58] well, service cluster just means a cluster dediacateed to a spefific service [18:45:00] since that cluster exists [18:45:02] or even analytics too [18:45:02] ottomata: greg? [18:45:07] so [18:45:10] haha [18:45:17] kafka-greg [18:45:22] if we had named the main or analytics clusters this way [18:45:25] what would we have called them [18:45:28] ka-main1001? [18:45:31] so i'd rename the main kafka cluster to kafla-sca1XXX, the next on e [18:45:32] kafka-main1001 [18:45:33] ? [18:45:37] since service cluster doesnt mean serices team [18:45:42] if kafka preceeds it [18:45:46] but anthing like that seems fine [18:45:58] riiiight, but i thought the SC just meant that those clusters ran multiple services [18:46:03] kafka-main1XXX, kakfa-whatever this is [18:46:11] kafka-analytics1001 [18:46:11] ? [18:46:12] true, yeah i have no issue iwth kafka-main [18:46:24] it'd be nicer to type something shorter...buuuut hm [18:46:34] kafka-an1XXX [18:46:35] should we do like we do with cache and db hosts? [18:46:46] makes sense, an can mean analytics [18:46:46] ka-main1001 [18:46:51] kamain [18:46:55] kaanalytics [18:46:56] yuck [18:46:57] i think its better to shorten the cluster of kafka [18:46:59] than remove kafka [18:47:02] kafka-agg [18:47:05] kafka-ana [18:47:07] kafka-abbreviation [18:47:08] yeah [18:47:08] kafka-anal [18:47:09] haha [18:47:13] do you remember [18:47:16] i was avoiding anal inteiotnal [18:47:18] that we tried to get yall to name the analytics nodes that [18:47:19] analinterns [18:47:20] hahah [18:47:20] yeah [18:48:12] so the primary cluster leverages kafla to do X, and this will leverage for Y, so i think kafka-x and kafka-y, and abbreviate the x and y as best as possible [18:48:22] kafka-main1XXX, kafka-an1XXX? [18:48:39] and all remanenet kafka1XXX are for kafka-main right? [18:48:50] yes [18:48:58] milimetric: your schema guidelines are awesome :) [18:49:09] that makes a lot more sense to me, but if you wanna discuss in team and update the task later that is also fine =] [18:49:17] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352668 (10Niharika) I ran @Milimetric's query on enwiki with an additional rev_timestamp whe... [18:49:17] ok, we may just go with long name [18:49:25] i'm usually a fan of them, just not so much in node names for some reason [18:49:33] if you do, just know the physical label wil be kafka-an [18:49:33] maybe we'll call this new cluster jumbo [18:49:36] not kafka-analytics [18:49:40] too long for the front label [18:49:40] :) good [18:49:41] kafka-jumbo1001 [18:50:06] thats fine by me, it meets the rest of the cluster standards afaikt [18:50:11] ok [18:50:15] erggghh [18:50:16] remove the rogue t at the end of that [18:50:16] ok well [18:50:16] heh [18:50:30] that means that we need to decide on the cluster name soon [18:50:32] before they are provisioned [18:50:36] luca is out til next week [18:50:40] i def need his input on it [18:52:04] haha, joal, greg [18:52:05] oh man [18:52:16] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352673 (10RobH) IRC Update: Otto is going to chat with the rest of the folks involved in analytics, but we're leaning towards the following: kafka1XXX => d... [18:52:17] kafka-ggreg [18:52:18] kafka-ag [18:52:22] maybe ag is good enough [18:52:23] ottomata: well, at worst case [18:52:26] we can rack them with asset tags only [18:52:40] but then no setup can happen other than the bare onsite minimum [18:52:47] but thats good enough for us to get them remotely accessible [18:53:03] just need to confirm my racking proposal is fine, and what vlans these may need [18:53:12] ok cool [18:53:14] thanks robh [18:53:15] i said put them all in different racks, and spread across all 4 rows [18:53:25] i'll read the ticket more and respond [18:54:14] cool [18:54:16] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352682 (10RobH) If we cannot settle on hostnames before Chris goes to rack, we can set these up with asset tag mgmt dns entries only, and not put the hostna... [18:54:37] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352683 (10Ottomata) > Need input from @Ottomata on which vlans these 6 new hosts will use, as it will help determine row. Not in analytics vlan. These shou... [18:57:49] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352687 (10aaron) I'm not sure what makes it diverge, but maybe the population scripts could... [18:59:15] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3352689 (10mforns) I did some performance tests. I executed (by hand) the mentioned SELECT/LIMIT/OFFSET query on analytics-store.eqiad.wmnet the exact s... [19:02:34] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352700 (10Ottomata) I'll have to check with @elukey to finalize a cluster name. That'll have to wait til next week, sorry. FYI, our brainbounce of names i... [19:04:36] joal: oh, i left a comment somewhere about unique device naming [19:04:41] don't remember where or if there was a response [19:04:48] somethign about per domain vs per project ambiguity [19:05:25] ottomata: I have seen it but decided that, given the same discussion had happened three times, I was not going to have it again :-P [19:06:44] haah ok, sorry, i missed the discussion! [19:06:48] is it somewhere i can read about it? [19:06:49] ottomata: issue with per-domain, per-project, project-class, host and so is that they all mean the same and different things depending on who says it [19:07:31] ottomata: I think we had it at graking or post-standup, and I thought you were there (my bad) [19:07:47] • unique_devices_project_wide_daily stores unique devices counts per project split by country per day [19:08:00] (it is possible i just wasn't listening :o ) [19:08:03] so my fault completely [19:08:07] huhuhu [19:08:09] so project wide is per project? [19:08:23] ottomata: project-wide is what I like to call project-class: *.wikipedia [19:08:26] and per_domain is per domain [19:08:26] ? [19:08:45] and per_domain is per detailed domain: en.m.wikipedia.org [19:08:54] shoudl the project table be called [19:09:00] unique_devices_per_project_daily [19:09:00] ? [19:09:03] for consistency? [19:09:10] (if I am too late, then I am too late...sorry) [19:09:15] ottomata: I hear you [19:09:19] just don't understand project_wide [19:09:27] ottomata: it's not too late [19:09:59] ottomata: per_domain is done now, and project_wide was trying to make sure that there were a difference between domain and project [19:10:22] because, for instance, in the AQS api, we use project for what is called domain here [19:10:32] hmmm [19:10:44] right, project is more like mediawiki database, right? [19:10:50] or, more mapped to [19:10:54] oh [19:10:55] no [19:10:59] wikipeida [19:10:59] OHHH [19:11:00] i get it [19:11:01] ok [19:11:02] so [19:11:03] ottomata: originally with pageview, yes, with uniques, not anymore [19:11:05] wikipedia is project wide [19:11:09] correct [19:11:10] en.wikipedia is a project [19:11:17] en.m.wikipedia.org is a domain [19:11:18] ? [19:11:30] well, not exactly [19:11:43] values not exact, but ideas? [19:11:55] in unique world, a project is a top domain (wikipedia, wiktionnary etc) [19:12:20] and a domain is a detailed domain (en.wikipedia.org, en.m.wikipedia.org) [19:12:26] ok [19:12:37] and 'english wikipedia' is not directly a project then? [19:12:47] But in pageview world, we call project the mediawiki_db entity [19:13:06] right, e.g. 'english wikipedia', enwiki [19:13:08] ok [19:13:14] so yeah, there is confusdion to be made here - we spent at least two sessions discussing around that [19:13:19] i'm sorry joal [19:13:26] you can ignore me if you like [19:13:34] yeah, so that's what I had always thought was a project, since that's how we defined it a while ago, i thought [19:13:37] I wanted to bring a new term for the project-wide notion (project_class was suggested) [19:13:40] like uhhh, projectview, right? [19:14:23] yeah rats [19:14:26] there project == aa.wikipedia [19:14:29] oof [19:14:31] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install new kafka nodes - https://phabricator.wikimedia.org/T167992#3352759 (10RobH) [19:14:34] so, we have a conflated definition of project? [19:14:50] don't worry ottomata, I now know the names and their context, but I aggree there is room for confusion - Idea was to try to have name that makes sense for people outside the analytics world, and project_class was not one of those [19:15:07] correct ottomata - project means different things in different contexts [19:15:10] project_class == project_wide == wikipedia [19:15:11] ? [19:15:18] yessir [19:15:27] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install new kafka nodes - https://phabricator.wikimedia.org/T167992#3352337 (10RobH) [19:15:35] and we name its dimension "project" in the table [19:15:41] ok, i see. who usually refers to 'wikipedia' as a project? vs. 'en wikipeida'? [19:16:29] milimetric and nuria_ said the communities and people less technicall do so [19:16:33] I trusted them : [19:16:36] :) [19:17:32] ottomata: If you want we can revive this naming thread for project_wide :) [19:17:36] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install new kafka nodes - https://phabricator.wikimedia.org/T167992#3352801 (10RobH) [19:18:20] ottomata: yeah, even people at wmf say "Wikipedia and its sister projects" [19:18:48] joal: i like the term project wide to describe what you got [19:19:02] for example: https://wikimediafoundation.org/wiki/Our_projects [19:19:03] i think maybe i don't like that the field 'project' in the table is inconsistent with our uses of project elsewhere [19:19:12] Then we rename the pageviews :) [19:19:52] boy the naming.. thsi was a hard one [19:19:57] :D [19:20:09] sorryyyyyy for being a late namecomer [19:20:25] for getting amnesia you mean? [19:20:33] :) [19:20:40] cause you were there on that discussion [19:21:08] i have the memory of a goldfish [19:21:11] cc ottomata [19:21:15] ya, i can relate [19:21:28] Oh ! a new sofa !@ [19:21:37] jajak [19:21:46] mine is more like memento [19:22:09] mine's worse - i only think I remember everything but i can never be sure [19:22:23] so anyway, to sum up, we have a project field on this new table that is not the same as project on other tables [19:22:29] milimetric: I have that one on directions in cities [19:22:31] because, otherwise folks outside of analytics will be confused? [19:22:49] q, is there a right answer/definition as to what project 'should' be? [19:23:02] did we get it wrong years ago when we picked 'en.wikipedia' to be a project? [19:23:18] ottomata: from milimetric's link, the correct definition is the one of uniques [19:23:30] oh missed the link, looking [19:23:31] ottomata: looks like a project as in pageview is incorrect [19:23:40] hm yeah [19:23:49] oof ok. [19:24:05] I think you're incorrectly assuming that there's consistency [19:24:13] :D [19:24:13] project is used to refer to both en.wiki and wiki [19:24:31] yeah [19:24:32] "Wiktionary is a project to create a multilingual free content dictionary in every language. This means each project seeks to use a particular language to define all words in all languages." [19:24:39] even on that page it does [19:24:41] yep [19:24:59] Well at least if there's no consistency, may we have availability and partition tolerance [19:25:07] hahaha [19:25:10] no! [19:25:11] haha [19:25:13] :d [19:26:03] ok, so all suggestions/ideas i'm making now should be taken with the fact that I am aware that we are far along in this project (har har), so i'm not saying we should do something [19:26:08] just want to at least know what the ideal is [19:26:17] so, if project is not consistently defined externally [19:26:20] of the analytics team [19:26:26] it seems like we had defined it years ago [19:26:31] and had been using it in a certain way [19:26:35] why would we change now? [19:26:43] project_class sounds pretty good to me........... [19:27:07] ottomata: I liked it as well, but it's obscure to external hears [19:27:28] external ears such as? ones that cannot be educated with docuemntation? [19:27:33] maybe our definition will be come the defacto ones [19:28:34] "As of November 2016, there are over 183,000 entries in 89 Wikiquote language projects" [19:29:33] q: IF we were to use the higher level definition of project, e.g. wikipedia, what would we (like) to change the pageview field name to? [19:29:36] project_variant? [19:29:57] pageview_info['project'], [19:29:57] pageview_info['language_variant'], [19:30:01] hmm language_variant [19:30:07] another q: [19:30:23] are there types of project variants other than 'language'? (this i'm sure we must have talked about before) [19:31:05] oof [19:31:07] wait [19:31:07] yeah, there are language variants [19:31:09] in pageview_info [19:31:11] in webrequest [19:31:11] map containing project, language_variant and page_title [19:31:15] ottomata: language_variants is not a project [19:31:15] like zh-[variant] [19:31:19] ahhhh [19:31:20] ok phew [19:31:21] correct [19:31:25] ok, so our name is concistent [19:31:26] got it [19:31:27] sorry [19:31:29] np [19:31:33] it's tricky ! [19:32:00] ok, yeah, so if we had to design these schemas now brand knew knowing everything that we do [19:32:03] what would we choose? [19:32:08] new* [19:32:30] hm [19:32:58] project = wikipeida, project_variant = zh, language_variant = 'xx' [19:32:58] OR [19:32:59] project_class = wikipedia, project = zh.wikipedia, language_variant = 'xx' [19:32:59] ? [19:33:05] A) or B) or something else? [19:33:12] I think I'd use wiki as what a pageview call project, and project for the top level one, but wiki sounds weird [19:33:31] it maps to a media wiki 'instance' database [19:33:36] so kinda makes sense [19:33:38] but yeah is weird [19:33:47] also, who knows, maybe we'll have projects that aren't wikis! :p [19:33:53] :) [19:34:15] so joal, if you had to choose, you'd pick project as high level [19:34:18] e.g. wikipedia [19:34:18] I'd go for B, but I'm biased toward the existing names [19:34:24] i'm also biased that way [19:34:36] especially since it isn't well defined elsewhere [19:34:40] but we have it well defined :) [19:34:53] But from the examples milimetric gave, I'd go for A with a different (better) name for wiki/project_variant [19:35:02] the examples milimetric gave? [19:35:09] on that page? [19:35:09] the link [19:35:09] yup [19:35:12] yeah, but that page isn't consistent either [19:35:16] true [19:35:44] We, the a-team, will bring naming consistency to this world of infamous inconsistent namers ! :-P [19:35:50] :D [19:35:55] it refers to projects top level and language level [19:36:00] it also refers to language level as 'edition' [19:36:03] which is kinda nice [19:36:03] hehe [19:36:07] yes I know, I have heard it this way many times [19:36:13] C) project = wikipedia, edition = en.wikipedia [19:36:15] edition is nice [19:36:35] agree, but it does not match my biases [19:36:41] project_edition [19:36:42] mine either [19:36:49] I liked project_class [19:36:50] project_class is pretty good [19:36:53] yes [19:37:00] and is backwards compatibile :) [19:37:11] we've done too much mathematics and group theroy [19:37:24] or OO programming? [19:37:29] hehe [19:37:39] class WikipediaProject [19:37:44] new WikipeidaProject('en') [19:38:22] ok, joal, so......>.>>>>>>. why not project_class then? because someone else will be confused (again, tell me if I am too late for this) [19:38:26] en.wikipedia % project == wikipedia [19:38:27] (i can drop it) [19:39:12] ottomata: project_wide, data is computed but not public, so we can still change - but I lost that same battle for the sake of external clarity [19:39:48] ottomata: I'm the one having the same bias you have, so you'd need to convince nuria_ and milimetric :) [19:39:48] with nuria? heheh [19:39:59] i'm fine with the term project wide too= [19:40:04] as a dataset name [19:40:06] but not a field name [19:40:22] so, internally our tables could aybe be [19:40:34] per_project_class with a field name project_class [19:40:39] but API can say project wide [19:40:40] ? [19:40:55] milimetric: ? [19:41:06] i thougth milimetric was on our side from comments above... :) [19:41:46] hey what [19:41:49] what's going on [19:42:09] we wanna use "project_class" for "wikipedia"? [19:42:17] everywhere? or just in once place ottomata ? [19:43:00] milimetric: AFAIK we only have that place so far [19:43:20] milimetric: i just dont' want to use project to mean different things in different places [19:43:33] everywhere else, project is 'en.wikipedia' [19:45:29] ok, cool, so it's project_class? not project_type or something else? [19:46:15] I support, either way, that's fine but we should pick one and use it always [19:46:28] we should also look through our code and maybe refactor anything there [19:46:53] agree, project_class sounds nicer [19:46:56] looking at some webrequest fields [19:47:08] it would be nice if we had a little bit of a loose consistency between what 'class' vs 'type means' [19:47:11] agent_type [19:47:13] referer_class [19:47:24] class to me refers to a larger classification of the thing [19:47:29] which makes sense here [19:47:43] what's the project class of the en.wikipedia project? wikipedia [19:48:23] ottomata: agreed about agent_type --< agent_class [19:48:52] haha [19:48:52] https://en.wikipedia.org/wiki/Class_(biology) [19:49:44] milimetric: I am for project_class, if you and joal are. also, you can tell me to stop bothering yall about this! I shoudl have spoke up sooner [19:49:53] i just did a review somewhere and realized something was inconsistent [19:49:58] and i got out my bike shed building tools [19:50:02] maybe that's just want i'm doing today thoguh [19:50:06] :D [19:50:07] i've bike shed 2 things before this already [19:50:14] hehe [19:50:22] good use of tools out! [19:50:23] oh man, we're gonna need more bikes :) [19:50:25] haha, i never got a chance to start coding today, so my head wasn't down and i noticed things [19:50:48] ottomata: we cannot use project_class wth external world [19:50:58] nuria_: why? and. that's fine. :) [19:51:13] the external world needs a differencitation between these two things [19:51:17] because it does not mean anything to anyone outside this group [19:51:20] they already have pageview and projectview [19:51:21] it's a bit abstract, that's why I was thinking project_type [19:51:25] where those things mean 'en.wikipedia' [19:51:42] wouldn't it be worse for hte external world if we conflated the meaning of project in our APIs? [19:52:35] Our apis as in column database names ? or pageview api fields? [19:52:46] i was referring to pageview api fields, and datasets [19:53:38] nuria_: https://wikimedia.org/api/rest_v1/ [19:53:55] https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_top_project_access_year_month_day https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_top_project_access_year_month_day [19:53:56] oops [19:54:03] GET /metrics/pageviews/top/{project}/{access}/{year}/{month}/{day} [19:54:05] there [19:54:17] project is en.wikipedia [19:54:18] no? [19:54:30] ok, so project as in "en.wikipedia.org" you think is fitting [19:54:38] GET http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100 [19:54:42] but not as a column name for project-wide unqiues [19:54:48] but not as a column name for project-wide uniques [19:54:55] i think its what we have already, and shouldn't change it [19:54:59] so yeah [19:55:34] ok, then the name "project_wide" for *.wikipedia Uniques is fine [19:55:46] i think its on ok dataset name [19:55:47] although [19:55:48] unique_devices_per_project_class [19:55:51] sounds better and more consistent [19:56:02] but [19:56:04] unique_devices_project_wide_daily is ok [19:56:09] as long as the field in that table (and in the API) [19:56:12] is not project [19:56:22] but 'project_class' (or something else if you don't like it) [19:57:23] if you can live with project_wide for teh dataset name let's just change the column name [19:57:28] joal: just so i am sure, en.m.wikipeida.org is a possible domain, right? in this per_domain table? [19:57:47] nuria_: ok, here i am not so opinionated, but talk with me about that for just a bit more [19:58:10] why is project_wide better than per_project_class? espcially given that we have per_domain as a dataset name already? [19:58:50] correct ottomata - en.m.wikipedia.org is a domain in the per_domain table [19:59:02] ok great [19:59:17] in the pageview table, it would be translated to: project = en.wikipedia, access_method: mobile web [19:59:18] class > project > domain [19:59:36] indeed ottomata [20:02:05] ottomata: I like the idea of consistency even for table/datasets names: unique_devices_per_domain and unique_devices_per_project_class [20:02:11] ottomata: jajaja [20:02:13] ayayay [20:02:32] i like that better too, but nuria_ might not, just wondering why... :) [20:02:53] ottomata: because "project_class" is an empty term , what does class mean on that context? we never refer to wikipedia as a class of anything [20:03:01] ottomata: however project_wide is english [20:03:31] ottomata: not perfect but 'wide' carries a lot more meaning than "class" in thsi case [20:03:32] we never refer to it that way because the term project is conflated [20:03:47] perhaps we will refert to it that way...if we name it [20:03:49] if we name it, they will say it [20:04:06] byeeee team, cya [20:04:09] byyeyee [20:04:12] Bye mforns [20:04:17] :] [20:04:36] ottomata: no, i disagree, we do not need to invent a new concept: search "wikipedia project class" on the web [20:04:44] ottomata: but to settle it we can talk to our users [20:04:56] we do need to invent a new concept for sure. since we need to differentiate the two uses of project somehow [20:05:14] so, 'project wide' is a new concept, except it doesn't have a good instance name [20:05:22] so sure, you can say, these are project wide metrics [20:06:02] but what do you refer to the 'per' dimension part of that metric as? [20:06:07] we are saying project_class [20:06:19] but if you don't name it, then the external folks will keep conflating the terms [20:06:24] which will only lead to more confusion later [20:06:46] so, i'm suggesting to nudge the terminology towards consistency and dis-ambiguity [20:06:49] we'll have to document [20:06:54] what project_class is anyway [20:08:21] and in that docuemntation the different terms (project_class > project > domain) will be clearly outlined [20:08:37] but then we'll have datasets and apis that are named inconsistently, if we don't use the terms in the dataset names [20:09:10] again though, i'm not thaaat opinionated about this. dataset names often have to be fairly english-y and descriptive, since we can't just put all the dimensions in the name [20:13:36] ottomata: "class" sounds like developer lingo, but if iam the only one in disent please do change it [20:13:52] clearly not, its in biology! [20:13:54] and math! [20:13:59] classification [20:14:09] https://www.google.com/search?q=define+class&oq=define+class&aqs=chrome..69i57j69i60l3j69i65j69i60.1250j0j9&sourceid=chrome&ie=UTF-8 [20:17:01] ottomata: on our ecosytem a wikipedia project class is this: https://en.wikipedia.org/wiki/Category:Project-Class_physics_articles [20:18:31] ottomata: perhaps related to what you are saying as "wikipedia project class" already has a meaning related to classification [20:18:53] ottomata: but again, I might be the only one in disagreement [20:21:09] nuria_: not sure what that link is [20:22:12] nuria_: i think that the external world would appreciate consistency in dataset naming, as much as we can give it, more than they would a dumbed down name. so i thiiiiiiink unless there are strong objects, and unless you have folks in the external world with strong objections, maybe we should go with per_project_class [20:22:13] ? [20:24:01] ottomata: if joal and milimetric agree please go ahead, next time it will be great to bring this up on CR so joal doesn't need to redo a ton of work , these are changes that were needed to do this renaming: https://phabricator.wikimedia.org/T167043 [20:25:46] wait, I'm lost, I thought we were renaming a column, what is happening, what are we renaming? [20:26:30] nuria_: i did in one code review a couple of days ago, but you are right, i am very late here. [20:26:34] milimetric: we are renaming a column [20:26:40] it's kind of hard to pay attention to a conversation like this over text while coding, sorry [20:26:46] haha, i think that's why i missed this [20:26:51] i have been coding hard the last week or two [20:26:56] milimetric: so [20:27:05] in the uniques dataset [20:27:09] so we're changing last_access_uniques to per_domain_uniques, right? [20:27:10] project - project_class [20:27:13] ottomata, milimetric : agreed to change, not contesting that if joal and ottomata feel better about naming [20:27:15] yes that's fine [20:27:23] but, what I was just saying [20:27:24] is that [20:27:29] the 'project wide' dataset [20:27:30] should be called [20:27:35] per_project_class [20:27:36] not [20:27:38] "project class" [20:27:38] project_wide [20:27:41] for consistency [20:27:52] unique_devices_per_domain_daily [20:27:52] and [20:28:02] unique_devices_per_project_class_daily [20:28:30] milimetric: thoughts ^? [20:28:39] eh... so but the problem there is that "domain" and "project" now mean the same thing [20:28:46] they don't though [20:28:47] ? [20:28:54] project is "en.wikipedia.org" [20:28:56] domain == en.m.wikipedia.org [20:29:04] oh ok [20:29:06] right [20:29:07] class > project > domain [20:29:53] if someone wanted, we could also have a unique_devices_per_project_daily [20:30:10] which with this class > project > domain dillieneation, its clear what that is [20:30:19] dillineation [20:30:38] ok, that's fine with me [20:30:54] I incorrectly equated domain and project [20:31:47] you ok with 'unique_devices_per_project_class_daily' as dataset name? [20:32:57] yeah [20:36:32] ok, les do it, thanks yall. again [20:36:39] many many apologies for not having joined in this discussion sooner [20:38:55] sorry, missed some exchanged here [20:39:16] Recap on what we ahve what to change: [20:39:31] We have unique_devices_per_domain -- This doesn't change [20:40:32] We not have unique_devices_project_wide (but not available oustside)--> This gets renamed to unique_devices_per_project_class, with inner collumn named project_class instead of project [20:40:44] ottomata, nuria_, milimetric --^ [20:41:35] I'll implement that tomorrow, and will go to bed for now :) [20:42:14] Thanks ottomata for the bikeshidding [20:42:35] If you wnat some more, have a looke at that one: https://gerrit.wikimedia.org/r/#/c/359019/ [20:42:38] :-P [20:42:52] +1, thanks yall [20:43:15] joal: i'll look at that later for sure, :) [20:43:54] Bye a-team [20:44:06] nite [21:54:17] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3353236 (10kaldari) @Ottomata: From brion and Niharika's comments above, it looks like `rev_p... [22:31:46] nuria_: hi! do you have a minute to talk about T143819? [22:31:46] T143819: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819 [22:32:04] SMalyshev: let me see which one that is [22:32:32] SMalyshev: i have a few mins yes, whassup? [22:33:02] nuria_: I wanted to understand what we can do with current analytics setup in general. So, we have this query log [22:33:26] SMalyshev: aham [22:33:32] nuria_: is there some place/setup where we could make another dataset from it that would be publicly consumable? [22:33:56] SMalyshev: the query log you are referring to is on hdfs? part of webrequest? [22:33:59] e.g. P17 was accessed X times on JUne 14, etc. [22:34:06] nuria_: yes [22:34:08] SMalyshev: ya, teh output i get [22:34:41] SMalyshev: ok, so your records are distinct records of webrequest that you can identify and tag [22:35:15] nuria_: yeah basically say we develop some code that says - this request has props P17, P31 and P2048 and items Q123 and Q456 [22:35:22] what we can do with this data? [22:35:46] SMalyshev: say we mark all records that you are interested on with "wikidata" and they get split into a wikidata "partition" of webrequest , you can reda those records and compute your stats and create a dataset , does that make sense? [22:36:16] nuria_: ok, but that dataset would be still on analytics cluster and thus private, right? [22:36:47] SMalyshev: no, we have tons of public data generated on cluster, actually the majority of it is generated that way [22:37:02] nuria_: aha, ok, that's what I try to figure out [22:37:20] SMalyshev: see : https://dumps.wikimedia.org/other/analytics/ [22:37:35] nuria_: so there's a way to produce public data sets... that'd good [22:37:42] SMalyshev: from day 1 [22:37:59] SMalyshev: that we started using the cluster [22:38:20] SMalyshev: what we want to avoid is you combing webrequest (all data) just for this so i think your changes [22:38:49] SMalyshev: will benefit from the work we are doing about taging and spliting those requests in one swoop [22:39:18] yeah that's what I am thinking [22:39:39] making some kind of process that generates this log with used props/items [22:39:44] SMalyshev: code is on the works for our tagging and splitting but will share some code and docs [22:39:47] so that then people could just take it and work on it [22:40:26] SMalyshev: ya, very doable, for the kind of stuff we do that sounds like bread and butter really [22:40:50] nuria_: then more specific question. I see things like pageviews are basically fixed structure. But if we want queries with properties/items/otehr tags, one query can have many props, etc. [22:40:54] how this is handled? [22:41:39] SMalyshev: you will tag teh data you are using , see for example tagging of portal pageviews (this is wip): https://gerrit.wikimedia.org/r/#/c/353287/15/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/webrequest/tag/PortalTagger.java [22:42:08] SMalyshev: that data will get msplit into its own buccket given a tag: https://gerrit.wikimedia.org/r/#/c/357814/ [22:42:19] SMalyshev: and you operate on your data via sql statements [22:42:45] nuria_: ok, say I assign tags, but what would be the output? [22:42:57] SMalyshev: this our latest work so doing jobs like the one you want to do doesn't require combing webrequest wholly again (inefficient) [22:43:06] SMalyshev: all records tagged in a tble [22:43:18] (note I have very shallow knowledge about how this works now so please excuse stupid questions :) [22:43:44] nuria_: ok so tags can be basically anything in any amount, right? [22:43:49] SMalyshev: a mini-webrequest (same schema, less data: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Current_Schema) [22:44:01] and those are still stored in e.g. parket tables or something like that? [22:44:08] SMalyshev:yes [22:44:28] SMalyshev: the tagging work is WIP so will be stored but yes. [22:44:30] got it [22:44:42] but then we'd need some other process to export these I imagine [22:45:18] SMalyshev: ya you would need sql that computes your counts and exports, but that again is what we do for most everything [22:46:18] SMalyshev: see for example unique devices: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/unique_devices/per_domain/daily [22:46:36] SMalyshev: this workflow creates the data (metrics) and files for external use [22:47:19] SMalyshev: maybe an easier example for mediacounts: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediacounts/archive/archive_mediacounts.hql [22:47:36] nuria_: ok, cool. but those are aggregated already as I see... I wonder if it makes sense to have something like per-request data too. because people may want to do all kinds of aggregations, should we implement them on our side? [22:48:12] SMalyshev: normally datasets retained longer have to be agreggated for privacy [22:48:20] SMalyshev: but that might not be true for your data [22:48:57] yeah that's what I am wondering.... maybe we don't need non-aggregated data but what if somebody wants to correlate e.g. items with properties? [22:48:59] SMalyshev: do talk to your users, super detailed data might not be very useful in your case, for the system it doesn't matter [22:49:31] yeah surely I am just trying to figure out how it works but I think I get an idea now [22:49:44] SMalyshev: ok, good take a look and let us know [22:50:01] nuria_: do you have per chance the ticket for the tagging work so I could watch it and know when it's ready? [22:52:01] ah I think I found it T164021 [22:52:01] T164021: Create tagging udf - https://phabricator.wikimedia.org/T164021 [22:52:45] thanks!