[00:09:31] (03PS1) 10Milimetric: Add atj.wikipedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/359081 [00:09:49] (03CR) 10Milimetric: [V: 032 C: 032] Add atj.wikipedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/359081 (owner: 10Milimetric) [00:47:57] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819#3350439 (10AndrewSu) > We could, however (with some work) capture usage of certain property, or item, or property-item combination, i... [01:02:39] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3350469 (10kaldari) @Nuria: I have created https://meta.wikimedia.org/wiki/Research:Wikipedia... [01:14:33] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3350481 (10Ottomata) Page creation is just revision create with rev_parent_id = 0, no? [02:17:52] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3350519 (10kaldari) @Ottomata: Nevermind, I see we can use revision-create where rev_parent_i... [04:47:46] (03CR) 10Nuria: "Alaready done in https://gerrit.wikimedia.org/r/#/c/359081/. Thanks!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/359062 (https://phabricator.wikimedia.org/T167720) (owner: 10Reedy) [04:51:26] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819#3350566 (10Nuria) >To incentivize them to contribute, we have to give them even better metrics of community usage/impact that they ca... [04:55:26] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819#3350567 (10AndrewSu) >>! In T143819#3350566, @Nuria wrote: >>To incentivize them to contribute, we have to give them even better metr... [10:36:19] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3350950 (10ema) >>! In T118365#3349563, @Nuria wrote: > mmm... looking at pageview API dashboard I can see some of lawful traffic (spikes we could have handled) seems to have b... [10:48:13] 10Analytics, 10Operations, 10Ops-Access-Requests: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3350991 (10ema) [10:48:21] 10Analytics, 10Operations, 10Ops-Access-Requests: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3351003 (10ema) p:05Triage>03Normal [12:30:32] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351181 (10BBlack) That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hosted server at Hetzner in DE. [12:35:27] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351184 (10ema) >>! In T118365#3351181, @BBlack wrote: > That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hos... [13:13:48] everyone's so quiet today! [13:20:14] Question from WMDE: we've used Pivot to track the TWL 2017 Campaign banner impressions, and were able to get the data from May 29 to June 01 only. We wonder whether the data were ever complete; in the meantime, the data for this banner are not available from Pivot anymore. Can anyone advise on this? Thank you very much. [13:37:07] 10Analytics, 10Analytics-Wikistats: Remove links in wikistats to minnan.wikipedia.org - https://phabricator.wikimedia.org/T107250#3351353 (10Liuxinyu970226) [13:37:38] fdans: I'm still working on the breakdown, I think crossfilter is cool but I wonder if it wouldn't be easier to just work with a plain recordset, it doesn't really get us much yet. [13:38:38] but I think we should stick with it for now, it's easily replaced [13:38:53] and maybe I'm wrong, maybe we'll need all the weird filter intersections and caching stuff it does [13:39:20] milimetric: that last thought is the one I keep whenever I'm doing these crazy reducers [13:39:49] yeah, you kind of have to box with the way it does .group on dimensions [13:39:57] I don't really think it makes sense... [13:40:10] like... "there's surely a layer of complexity here that I'm not touching and it probably makes all this worth it" [13:40:40] actually... I'm a little lost... something's not right [13:40:42] wanna hangout? [13:40:56] I'm picking up something I was working on last night and it doesn't make sense [13:41:06] sure, i'm on my way [13:47:01] (hi!) [13:47:29] mforns: yt? [14:38:37] ok fdans, I pushed [14:38:47] take a look and let's talk before standup if you want [14:38:52] ottomata, back from lunch [14:38:59] nice, looking milimetric [14:40:41] mforns: so, in order to do purging, your code relies on a top level timestamp field, correct? [14:40:51] does it rely on this being in a particular format in mysql? [14:41:07] ottomata, it assumes it's mediawiki format [14:41:13] and also it has an index [14:41:41] without index, it will be a lot slower [14:42:11] ok. grr. it just sucks that jrm.py is mw-el-analytics specific [14:42:21] ottomata, aha [14:42:25] milimetric: batcave! [14:42:38] mforns: , what about dateutil.parser.parse [14:42:39] ? [14:42:44] if you used that, would it have to be mw format? [14:42:47] i think parse could figure it out... [14:43:08] ottomata, aha, yea the code can be altered to support different formats I guess [14:43:10] hmm, you know, sigh, this data is just the mysql data, [14:43:19] i could just add a timestamp [14:43:20] hmm [14:43:29] if it doesn't exist [14:43:36] ottomata, doesn't it have a timestamp? [14:43:38] i was thinking about setting it to the value of meta.dt [14:43:44] omw fdans [14:43:46] it does, but its not called 'timestamp' [14:43:56] it is meta_dt, and in 8601 format [14:43:56] how is it called? [14:44:00] I see [14:44:15] ottomata, well, it could be a parameter of the script [14:44:16] eventbus schemas don't have the top levle capsule, they have the subobject meta schema [14:44:34] I see [14:44:46] mforns: maybe it could be a list of parameters to look for timestamps, in order [14:44:53] if timestamp, use that, if meta_dt, use that, etc. [14:44:54] ? [14:44:59] we could do like: --timestamp-fields=timestamp,meta_dt [14:45:07] yea, and then use them in that order? [14:45:12] if they exist? [14:45:16] yes, for example [14:45:23] meta_dt doesn't have an index though...i could fix that [14:45:28] aha [14:45:48] but, to do it generically have to add indexexes to all date-time fields in a schema [14:46:06] i guess i could add config too [14:46:15] --index-fields=timestamp,meta_dt [14:46:15] ottomata, how are these tables? are they big? [14:46:15] :/ [14:46:27] mforns: i think the biggest is revision-create, it'll be as big as edit i guess [14:46:31] about 20 events / sec [14:46:47] aha [14:47:47] ottomata, I didn't get the --index-fields=timestamp,meta_dt ? [14:48:07] thank you ottomata! [14:48:39] ema :) [14:48:51] you can wait 30 minutes, ooorrr run puppet on analytics1001 and stat1004 and/or stat1002 :) [14:49:04] mforns: that would be foir the mysql consumer [14:49:05] * ema can't wait and runs puppet [14:49:21] to tell it which fields it should add indexes on when it creates tables, if those fields exist in the schema [14:49:41] ottomata, I see [14:49:44] right now, timestamp gets an index because its jsonschema format is utc-millisec [14:49:52] aha [14:49:53] and that gets mapped to {'type_': MediaWikiTimestamp, 'index': True} [14:50:13] i could do the same for date-time formats [14:50:34] but then all date-time fields (there are 2 or 3) would ahve indexes [14:50:37] ottomata, would this be a change that blocks you, or could you wait until start of next quarter? [14:50:48] I se [14:50:50] see [14:51:09] mforns: i'm doing this work on the side for https://phabricator.wikimedia.org/T150369 [14:51:24] soooo, i think it shouldn't hold block you from proceeding with purging [14:51:34] i'm just trying to make it easy for kaldari to get to some eventbus data in mysql [14:51:51] mmh, spark-shell still doesn't work properly even though I'm now a member of analytics-privatedata-users [14:51:59] spark-shell --master yarn --executor-memory 4G --driver-memory 4G --executor-cores 1 [14:52:03] [...] [14:52:08] :16: error: not found: value sqlContext import sqlContext.implicits._ [14:52:59] ? ema where are you running that? [14:53:03] stat1004 [14:53:37] ottomata, what I've been doing since yesterday is to change the EL purging script to not use uuids, but I had to do major changes, and still need to rewrite the tests [14:53:44] ema did you run puppet on analytics1001? [14:53:48] too? [14:53:54] ottomata: nope, doing that now [14:53:56] k [14:54:11] aye [14:54:24] sigh, this purging things sucks :/ [14:54:56] on the other hand, the version Luca wrote works fine, and I think the concern of the dbas (limit offset) will not have a bit impact [14:55:16] we can talk is PS [14:55:50] ottomata: still no luck [14:55:50] :16: error: not found: value sqlContext [14:55:54] import sqlContext.sql [14:58:41] looking.. [14:59:05] you don't have an hdfs home dir, but puppet on analytics1001 should have created it [15:00:00] you did not run puppet on analytics1001! [15:00:04] Notice: /Stage[main]/Cdh::Hadoop::Users/Exec[create_hdfs_user_directories]/returns: 2017-06-15T14:59:33 hdfs dfs -mkdir /user/ema && hdfs dfs -chown ema:ema /user/ema [15:00:04] haha [15:00:08] ema try again! :) [15:00:32] ottomata: I did! [15:00:57] hmm, i guess you do the proper offering dance before you did [15:02:03] ottomata: uhuh, it works! thank you :) [15:02:29] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3351654 (10ema) 05Open>03Resolved a:03ema Done! [15:03:02] gr8 :) [15:04:02] ema: have fun ;) [15:07:55] joal: after this morning's workshop I'm gonna be super proficient! [15:08:00] :D [15:13:13] milimetric: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines [15:13:20] • Time should always be stored in a field called timestamp, in ISO [15:13:26] why 'timestamp' in ISO 8601? [15:13:34] our convention is to refer to 8601 fields as dt [15:14:14] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3351679 (10Milimetric) Just FYI, there are a good amount of rev_parent_id = 0 that do not rep... [15:15:59] ottomata: that's fine, but I was thinking if we remove the capsule, should we allow people to use "timestamp"? [15:16:06] I can change it to dt [15:17:40] done [15:18:45] ping ottomata can you come back to standup? [15:44:42] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351748 (10Nuria) Thanks for the prompt response, when the number of changes I did not see when these took effect, it is true that we do not see on our end 429s at all times, bu... [15:46:01] 10Analytics, 10Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351751 (10Nuria) If you look at 404s however, looks like the throttling had a positive effect on removing "garbaage-y" traffic. [15:46:37] ping fdans : groskinnn [15:46:48] omg sorry [15:50:49] 10Analytics-Kanban, 10Operations, 10User-Elukey: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3351771 (10Nuria) [15:51:00] 10Analytics-Kanban, 10Operations, 10User-Elukey: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3345083 (10Nuria) Puting on kanban for @elukey to look at [15:53:50] 10Analytics: Refactor puppet code for the Hadoop Analytics cluster to roles/profiles - https://phabricator.wikimedia.org/T167790#3351778 (10Nuria) p:05Normal>03Low [15:55:59] 10Analytics, 10Analytics-Wikistats: Remove links in wikistats to minnan.wikipedia.org - https://phabricator.wikimedia.org/T107250#3351788 (10Nuria) p:05Normal>03Low [15:58:44] 10Analytics, 10Analytics-Wikistats: Remove links in wikistats to minnan.wikipedia.org - https://phabricator.wikimedia.org/T107250#1490654 (10Nuria) let's wait until we do kafka upgrade. [15:58:59] 10Analytics: Send burrow lag statistics to statsd/graphite {hawk} - https://phabricator.wikimedia.org/T120852#1862859 (10Nuria) p:05Normal>03Low [15:59:30] 10Analytics-Kanban: Measure portal and hovercard pageviews - https://phabricator.wikimedia.org/T162618#3351819 (10Nuria) [16:05:42] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#3351862 (10Nuria) @tgr, this will benefit from changes happening on tagging of requests. We can tag requests that need to be "copied" easily and i think it will be trivil... [16:06:32] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#3351863 (10Nuria) We think this work can happen next quarter. [16:09:40] 10Analytics, 10MediaWiki-API, 10RESTBase-API, 10Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#3351902 (10Nuria) Ping @tgr what is the status of this? [16:11:36] 10Analytics, 10MediaWiki-API, 10RESTBase-API, 10Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#3351906 (10Nuria) [16:11:40] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#3351905 (10Nuria) [16:11:54] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#2944303 (10Nuria) Linking to task T142139 cause i think is realted, @tgr let us know otherwise [16:15:16] 10Analytics, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#3351942 (10Nuria) This sounds like googlebot crawling the app and sending traffic as a user would, I do not see anything that we can do on our end to prevent that, fixe... [16:15:39] 10Analytics, 10Research-and-Data: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207#3351950 (10Nuria) [16:15:41] 10Analytics, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#3351949 (10Nuria) [16:16:16] 10Analytics, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#2293216 (10Nuria) Action item for analytics is to verify that indeed all this requests are coming from apps. [16:16:41] 10Analytics-Kanban, 10Easy: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#3351952 (10Nuria) [16:18:19] 10Analytics: Quantify false positives when filtering for number of distinct user agents per page in top pages computation - https://phabricator.wikimedia.org/T146911#3351965 (10Nuria) [16:18:21] 10Analytics, 10Research-and-Data: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207#3351964 (10Nuria) [16:22:14] 10Analytics, 10Analytics-Wikistats, 10Wikimedia-Site-requests: Add li: Wikibooks to Wikistats - https://phabricator.wikimedia.org/T165634#3351993 (10Ooswesthoesbes) Alright, that is promising. [16:25:41] 10Analytics: Put data needed for edits metrics through Event Bus into HDFS - https://phabricator.wikimedia.org/T131782#2178434 (10Nuria) p:05Normal>03Low [16:27:41] 10Analytics: Meta-statistics on MediaWiki history reconstruction process - https://phabricator.wikimedia.org/T155507#3352011 (10Nuria) p:05Normal>03High [16:34:51] 10Analytics, 10Analytics-Cluster: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841#3352022 (10Nuria) [16:36:00] 10Analytics-Cluster, 10Analytics-Kanban: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841#2987445 (10Nuria) [16:36:39] 10Analytics-Cluster, 10Analytics-Kanban: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841#2987445 (10Nuria) Looks like this is couple hours of work and its benefit is clear. [16:39:27] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Kafka mirror maker failures when kafka brokers are restarted - https://phabricator.wikimedia.org/T157705#3013836 (10Nuria) As part of kafka upgrade mirrormaker will get a revamp [16:40:28] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Make oozie work with spark jobs that use HiveContext - https://phabricator.wikimedia.org/T94596#3352052 (10Ottomata) [16:40:30] 10Analytics: Unlock Spark with Oozie - https://phabricator.wikimedia.org/T159961#3352054 (10Ottomata) [16:43:05] a-team, hangouts not responding for me [16:43:21] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352065 (10Ottomata) Are these just historical artifacts, or is it possible for newly created... [16:44:47] 10Analytics, 10Patch-For-Review: Sort inconsistency in AQS timestamp behavior - https://phabricator.wikimedia.org/T160311#3352066 (10Nuria) Do we need to version api for such a change? (it will be a breaking change) [16:45:49] 10Analytics: Serbian Wikipedia edits spike 2016 - https://phabricator.wikimedia.org/T158310#3352069 (10Nuria) 05Open>03Resolved [16:46:55] a-team, no way I can connect to the batcave... [16:47:06] 10Analytics-Kanban: Update undocumented EventLogging mediawiki hooks - https://phabricator.wikimedia.org/T158331#3352074 (10Nuria) a:03Ottomata [16:47:17] are you guys still in da cave? [16:47:31] we're done mforns [16:47:37] ok [16:48:05] fdans: I've gotta get lunch and sort out my computer, but let's aim at finishing 30-ish points this week. So far we got 13 [16:48:39] I think I can do the AQS API one, but it's a lot simpler than originally thought so maybe I'll move it down to 5 (still counts as "finishing" 8 as far as our plan is concerned) [16:49:12] haha sure [16:49:22] so then we'd need another 8 pointer or so. Depending on how you do with Detail, I can grab that after you're done tomorrow or do something else [16:49:24] milimetric: I think we can do 30 [16:49:36] k, let's sync up tomorrow morning again [16:49:37] detail's going goood [16:49:45] good good, then maybe I'll grab something else [16:49:59] I'm at this beautiful stage of starting to become "one with the js framework" [16:50:17] I like vue [16:51:24] 10Analytics, 10Analytics-Wikistats, 10Wikimedia-Site-requests: Add li: Wikibooks to Wikistats - https://phabricator.wikimedia.org/T165634#3352079 (10Nuria) FYI that this is in deprioritized because any work on wikistats old ui is deprioritized while work continues on new UI. [16:53:14] honestly I think vue 2.0 is a lot closer to react than I initially realized [16:53:15] 10Analytics-Kanban: Measure portal and hovercard pageviews - https://phabricator.wikimedia.org/T162618#3168814 (10Nuria) [16:53:35] because they got rid of computeds bubbling out of the children and now it really just works like react with a little light reactivity sprinkled on [16:53:39] 10Analytics-Kanban: Measure portal pageviews - https://phabricator.wikimedia.org/T162618#3168814 (10Nuria) [16:53:49] but that's fine, works for me [16:53:58] 10Analytics: Measure portal pageviews - https://phabricator.wikimedia.org/T162618#3168814 (10Nuria) [16:55:35] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352095 (10Niharika) I could be wrong but here's the queries I ran on recentchanges on enwiki... [17:00:45] ottomata: sorry, i logged myself out!!! [17:00:49] ottomata: duh [17:17:50] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352160 (10kaldari) Hmm, not sure what to make of the results from recentchanges. That's real... [17:21:59] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352175 (10MusikAnimal) I'm not sure about `recentchanges` but going by `rev_parent_id = 0` i... [17:35:40] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352258 (10Ottomata) Ya @Niharika it might be worth checking the revision table instead of re... [17:37:35] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352264 (10kaldari) Here's a page that has 9 revisions (out of 12) with `rev_parent_id = 0`:... [17:55:43] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352337 (10RobH) [17:59:07] heya mforns still around? [17:59:13] ottomata, yea [17:59:17] wazzup [17:59:43] just talked to nuria a bit about eventbus in mysql purging stuff [17:59:47] i think we don't need to worry about it [17:59:51] the data there is 'public' ish anyway [17:59:56] in that we'd expose it in eventstreams anyway [17:59:57] so [18:00:00] aha [18:00:06] the tables will be created in the same database [18:00:09] but will have different schemas [18:00:21] can you put together your list of schemas to purge from the mysql meta info db? [18:00:22] e.g. [18:00:36] select tables where database = log and table has field timestamp and id (or uuid) [18:00:37] ? [18:00:49] ottomata, actually, after doing some performance tests of the purging script I'd say we'll need to change the code not to use uuids anyway [18:00:56] thats good! [18:00:56] :) [18:02:02] ottomata, don't understand your question about list of schemas? [18:02:42] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352404 (10Ottomata) This will be a totally different cluster than the nodes in 1001-1003, or the 1012-1022ish nodes in the analytics cluster. Can we someho... [18:03:06] mforns: [18:03:07] ok [18:03:13] if tables exist in log db [18:03:20] aha [18:03:21] that do not have a timestamp or an id fileld [18:03:25] how will your script handle that? [18:03:32] break [18:03:34] :] [18:03:45] right so we need a way to tell it whihc tables to consider, or which ones to ignore [18:03:51] I see [18:03:52] we could provide a blacklist of tables to ignore in config [18:03:52] OR [18:04:02] we could do a little db reflection and examine the schemas of the tables [18:04:09] so, if table does not have the fields you need to do purging [18:04:10] skip it [18:04:21] do those tables have a schema_revision name structure? [18:06:22] meta_schema_uri [18:06:27] which looks like [18:06:39] 'mediawiki/revision/create/1' [18:08:05] ottomata, the table names are quite different, we could use their format to distinguish them [18:08:09] with a regexp or so [18:08:11] true [18:08:40] if you wanna do a quick and dirty, that's fine with me :) [18:08:46] we have to remember to check if we add new tables in the future... [18:08:46] the table will look like that: [18:08:50] mediawiki_revision_create_1 [18:09:02] oh! with underscores [18:09:04] ok [18:09:09] all the ones i'll be importing (for now) will start with mediawiki_ [18:09:14] I see [18:09:48] mmmm, no they are too similar, no? [18:10:09] those and EL tables? [18:10:10] if an EL user creates a new schema named Mediawiki... [18:10:12] yes [18:10:17] well, lower case? [18:10:21] yea [18:10:24] maybe since wikipages are upper case it'll be ok? [18:10:25] dunno [18:10:34] i mean, yeah, it'll probably work, but there might be cases where it breaks [18:10:39] the list of tables i will be importing for now is small [18:10:42] 3 or 4 [18:11:09] aha [18:12:51] ottomata, yea, I think it would be better to do introspection, as you mentioned [18:13:42] table must have a field named timestamp, and at least a field prefixed with 'event_' [18:14:26] mmm but still there can be problems... [18:15:00] yeah, but that will probably cover it [18:15:27] aha [18:16:50] Hey nuria_ [18:17:11] mforns: and you can get the tables from information_schema db [18:17:12] select COLUMN_NAME from COLUMNS where TABLE_SCHEMA='log' and TABLE_NAME = 'NavigationTiming_10076863'; [18:17:14] e.g. [18:18:24] or even [18:19:06] select TABLE_NAME from COLUMNS where TABLE_SCHEMA='log' and COLUMN_NAME in('id','timestamp'); [18:19:20] select DISTINCT TABLE_NAME from COLUMNS where TABLE_SCHEMA='log' and COLUMN_NAME in('id','timestamp'); [18:19:55] oh no, that's all tables with at least one of those fields [18:19:57] something like that ^ [18:20:40] joal: yessir [18:20:59] nuria_: Wanna double check new per-domain values and send email? [18:21:13] joal: sure. batcave? [18:21:25] nuria_: very noticeable bump on offset only, noticeable only on estimate [18:21:28] sure [18:22:26] a-team - So that we don't forget - http://www.commitstrip.com/en/2017/06/15/party-time/?setLocale=1 [18:24:01] haha [18:25:24] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352537 (10Nuria) Also, FYI To @kaldari that edit count is being added to data lake, when yo... [18:27:32] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352570 (10brion) I have the impression rev_parent_id isn't reliable but don't offhand recall... [18:28:29] ottomata: i dislike the just incremeenting the number for a difference service cluster [18:28:40] ive been sitting here since your update trying to think if a better hostname [18:28:54] so its no different than the other kafka servers, except its a different service shard? [18:29:01] or is it inherently a different kind of cluster? [18:29:29] we tend to try to increment hostname numbers sequentially [18:29:42] you guys just bucked the trend in kafka and we should avoid adding to its differences [18:29:48] (imo) [18:32:33] robh heheh, we KINDA bucked the trend on that one, there was an excuse! [18:32:37] butya, one sec... [18:32:57] yeah i know the excuse i jsut dont think it was worth the difference, but its done so no reason to worry about the past [18:33:08] i just rather not continue to have you deviate on the standards if we can help it [18:33:19] im updating the task with a slightly more elequently phrased version [18:34:55] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352624 (10RobH) @Ottomata that is not how we denote different clusters for any other hostnames on the cluster, so it seems bad to have kafka/analytics diffe... [18:35:13] im not exactly sure how your kafka systems differ [18:35:34] are they like our DB systems in that they are all mostly identical software stacks but just service different service shards? [18:35:35] 10Analytics, 10Operations, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#3352626 (10Nuria) FYI, Our privacy policy does mention we do not honor DNT. [18:39:50] ottomata: also if you still like your idea of incremetnin the number and faidon or mark says its ok, then it overrules the fact it deosnt match anything else in our standards. i dont have a personal stake in this, but i just try to keep things consistent unless my manager tells me otherwise ;] [18:40:21] so please please dont think im personally upset or invested in this, i most certainly am not ;] (people tend to think i care about this far more than i actually do) [18:40:34] i just end up having to talk about it a lot since i make the racking tasks =P [18:40:59] so you can disagree with me, get them to approve it, and im not going to be mad at you! [18:41:09] (hey sorry, with you in a min...) [18:41:12] no worries [18:41:22] actually, ok so [18:41:27] chris is out of the datacenter this afternoon so he wont be getting to racking this until tomorrow [18:41:30] the kafka clusters are logically different [18:41:38] 10Analytics, 10Operations, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#3352638 (10Nuria) https://wikimediafoundation.org/wiki/Privacy_policy/FAQ#DNTFAQ , if we were to do it i just found recently about the w3 api on this regard: https:/... [18:41:42] they are sorta used for different purposes [18:42:08] so not like our db systems, so i can see why you wouldnt like just incremeenting numbers in a close range [18:42:14] Done nuria_ [18:42:15] Yay ! [18:42:27] i'd advise we differ the hostname then [18:42:36] perhaps with the kafka-sc[ab] or something else like it? [18:42:41] the nodes in a cluster can't be joined with another cluster [18:42:42] hmmm [18:42:44] not a bad idea [18:42:46] and [18:42:51] we do have to name the cluster logically [18:42:53] in config, etc. [18:42:57] this one we haven't picked the name yet [18:43:00] yeah but would likely make more sense [18:43:02] but it may be 'aggregate' [18:43:03] for human readable [18:43:10] joal: "Analytically yours," jajajaja [18:43:11] https://etherpad.wikimedia.org/p/analytics-ops-kafka [18:43:14] maybe 'jumbo' [18:43:14] :) [18:43:16] maybe 'mothership' [18:43:16] :) [18:43:18] haha [18:43:44] im just boring [18:43:48] i like kafka-sc[ab] hehe [18:43:59] well, SC is service cluster? [18:44:02] not really what these are [18:44:08] (SC is service cluster, right?) [18:44:12] yeah [18:44:16] but, yeah, maybe we need a prefix that denotes they are kafka nodes [18:44:34] but a suffix that names the cluster succinctly [18:44:57] let's use 'main' as the example [18:44:58] well, service cluster just means a cluster dediacateed to a spefific service [18:45:00] since that cluster exists [18:45:02] or even analytics too [18:45:02] ottomata: greg? [18:45:07] so [18:45:10] haha [18:45:17] kafka-greg [18:45:22] if we had named the main or analytics clusters this way [18:45:25] what would we have called them [18:45:28] ka-main1001? [18:45:31] so i'd rename the main kafka cluster to kafla-sca1XXX, the next on e [18:45:32] kafka-main1001 [18:45:33] ? [18:45:37] since service cluster doesnt mean serices team [18:45:42] if kafka preceeds it [18:45:46] but anthing like that seems fine [18:45:58] riiiight, but i thought the SC just meant that those clusters ran multiple services [18:46:03] kafka-main1XXX, kakfa-whatever this is [18:46:11] kafka-analytics1001 [18:46:11] ? [18:46:12] true, yeah i have no issue iwth kafka-main [18:46:24] it'd be nicer to type something shorter...buuuut hm [18:46:34] kafka-an1XXX [18:46:35] should we do like we do with cache and db hosts? [18:46:46] makes sense, an can mean analytics [18:46:46] ka-main1001 [18:46:51] kamain [18:46:55] kaanalytics [18:46:56] yuck [18:46:57] i think its better to shorten the cluster of kafka [18:46:59] than remove kafka [18:47:02] kafka-agg [18:47:05] kafka-ana [18:47:07] kafka-abbreviation [18:47:08] yeah [18:47:08] kafka-anal [18:47:09] haha [18:47:13] do you remember [18:47:16] i was avoiding anal inteiotnal [18:47:18] that we tried to get yall to name the analytics nodes that [18:47:19] analinterns [18:47:20] hahah [18:47:20] yeah [18:48:12] so the primary cluster leverages kafla to do X, and this will leverage for Y, so i think kafka-x and kafka-y, and abbreviate the x and y as best as possible [18:48:22] kafka-main1XXX, kafka-an1XXX? [18:48:39] and all remanenet kafka1XXX are for kafka-main right? [18:48:50] yes [18:48:58] milimetric: your schema guidelines are awesome :) [18:49:09] that makes a lot more sense to me, but if you wanna discuss in team and update the task later that is also fine =] [18:49:17] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 5 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3352668 (10Niharika) I ran @Milimetric's query on enwiki with an additional rev_timestamp whe... [18:49:17] ok, we may just go with long name [18:49:25]