[01:03:59] Analytics, MediaWiki-extensions-Gadgets: Track the GadgetUsage statistics over time - https://phabricator.wikimedia.org/T121049#1868082 (Quiddity) NEW [01:05:01] milimetric, re: https://phabricator.wikimedia.org/T21288#1859304 I've created ^ (https://phabricator.wikimedia.org/T121049) and triaged as low priority based on kaldari's advice. Hope that's all ok. :) [02:19:16] makes sense quiddity, I added it to our backlog but please as always push us for it if it becomes important. We currently have over 300 different asks like this, we're trying to get a handle on that and all our operational overhead :) [09:56:37] Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1868826 (IKhitron) I am happy, @Edgars2007, but: 1: Maybe I'm a troll. ;-) 2: Maybe somebody else was in your account, and he is a troll. [10:27:53] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1868848 (jcrespo) I've done a quick test on db1046 and I can create TokuDB tables with no problem: ``` $ mysql -h db1046... [10:29:27] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1868857 (jcrespo) I can also see the table Test_12174936 as created: ``` mysql> SHOW TABLES like 'Test\_%'; +-----------... [11:07:37] (CR) Addshore: "5.75 hours for a full dump on stat1002..." [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/257945 (owner: Addshore) [13:39:13] Analytics-Tech-community-metrics, DevRel-December-2015: OwlBot seems to merge random user accounts in korma user data - https://phabricator.wikimedia.org/T119755#1869145 (Aklapper) >>! In T119755#1842088, @Dicortazar wrote: > I'm having a look at the data. If there's an id with loads of merges, this is us... [13:54:43] Analytics-Tech-community-metrics, Easy: Entered text in Typeahead search field nearly not visible in Firefox 42: Fix the CSS - https://phabricator.wikimedia.org/T121101#1869170 (Aklapper) NEW [13:56:47] Analytics-Tech-community-metrics, DevRel-January-2016: "Unavailable section name" displayed on repository.html - https://phabricator.wikimedia.org/T121102#1869176 (Aklapper) NEW [14:16:55] (CR) Mforns: [C: -1] "LGTM, but there's an inconsistency with the sumAggregateByUser patch. Should we address that here or there?" (5 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/174773 (https://bugzilla.wikimedia.org/73072) (owner: Mforns) [14:50:28] joal, question about the webrequest table? [15:00:25] joal: hi [15:34:02] milimetric: you can comment out the send error email workflow for now when you test I think :) [15:34:13] no! you all get to feel my pain! :) [15:34:17] jk, ok, I'll do that [15:35:54] :D you can also just send it to yourself by editing the list while testing [15:36:31] (PS1) Milimetric: Fix case sensitivity issue with country_name [analytics/refinery/source] - https://gerrit.wikimedia.org/r/258153 [15:43:31] there must be an opposite brain disorder to dyslexia [15:43:43] and *every* person working on Java and especially Oozie has that disorder [15:44:07] so, does anyone know if webrequest.agent_type includes automata yet? [15:44:08] oh? you have problems with lots of text and especially font colors that blend with the background? Lemme DUMP EVERYTHING and make it really hard to read [15:44:22] Ironholds: you mean bot? [15:44:30] milimetric, hmn? [15:44:34] agent_type can be spider, user, bot [15:44:42] and I haven't seen bot lately [15:44:46] so it got implemented? [15:44:56] because the comment says it's just spider and user and will contain automata at some point [15:45:04] bot is implemented but i don't think people self-identify with our convention since we haven't communicated it at all [15:45:18] what's our convention? [15:45:24] WikimediaBot I think [15:45:26] I don't think even I've heard of it [15:45:30] (in the UA string) [15:45:36] yeah, not what I meant; automata. Robots that aren't crawlers. [15:45:43] that makes sense, because we implemented it and there's been a task to communicate it on the backlog for like 3 months [15:45:52] not sure about that [15:45:57] I haven't seen plans [15:46:04] augh [15:46:07] but we have a UDF! [15:46:12] We're already including it in the pageviews table! [15:46:42] wait, no we're not [15:46:44] I made that up. Hah. [15:47:05] joal, does agent_type incorporate the automata UDF as "spider", say? [15:47:09] because I swear we did a load of work on this [16:10:31] Ironholds: how is 'automata' defined? [16:10:41] Ironholds: like bots doing work in wikipedia? [16:12:04] nuria, like scrapers and random people with BeautifulSoup and curl [16:12:17] there's a UDF we wrote explicitly for this and I'm wondering where it's actually used [16:12:29] like, is it factored into the spider/user determination, or? [16:12:35] Ironholds: but how would you distinguish those from crawlers without counting request ratio? [16:12:44] Ironholds: a UDF works on a per column basis [16:13:17] "doesn't meet the 'spider' heuristics, does meet other user agent standards" [16:13:22] Ironholds: what you describe i would call "bots" [16:13:33] so would I, but apparently "bots" means something different [16:13:39] I am not asking "do we have some magical way of detecting non-spider automata" [16:13:42] Ironholds: anyone with curl, but maybe this is a WMF convention [16:13:44] I am asking explicitly: we wrote a UDF for this [16:13:49] is it used anywhere [16:13:56] if so, where is it used, since I'd like to be able to rely on that data. [16:14:55] okay, found [16:15:06] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L101 sorry, misremembered the name ;) [16:15:36] so: is this UDF now factored into the agent_type field? Does anyone know? [16:17:19] Ironholds: i do not think so ( i can look) but I am not sure it should be, seems like the oddest of criteria , it [16:17:21] Analytics-EventLogging, Analytics-Kanban, EventBus, Patch-For-Review: Build tornado-sprocket python packages - https://phabricator.wikimedia.org/T121112#1869627 (Ottomata) NEW a:Ottomata [16:17:37] the oddest of criteria? [16:17:39] would make more sense to mark 'wikipedia bots' as such and bot everything else [16:18:06] Ironholds: yes, cause you will be marking as "spider" anything the bot regex misses [16:18:16] I don't get it. [16:18:30] Ironholds: ay sorry, let me explain [16:19:10] Ironholds: let me understand what you are interested in [16:19:25] I want to be able to exclude automata, broadly-construed, from our data [16:19:28] or more accurately, my managers do. [16:19:29] Ironholds: you want to disntiguish wikipedia bots from "other" bots, is that so? [16:19:33] no [16:19:36] I want to be able to identify both [16:19:49] I am asking if the "wikipedia bots" checker is factored into the "spider" tag in agent_type [16:20:04] if agent_type == 'Spider' does that mean it matched the UAParser definition or this definition or either/or or what? [16:20:20] (PS3) Milimetric: Oozie-fy Country Breakdown Pageview Report [analytics/refinery] - https://gerrit.wikimedia.org/r/256355 [16:20:24] I don't care about distinguishing them [16:20:40] I care about being able to distinguish users from inconsiderate imbeciles with BeautifulSoup [16:21:04] (PS4) Milimetric: Oozie-fy Country Breakdown Pageview Report [analytics/refinery] - https://gerrit.wikimedia.org/r/256355 (https://phabricator.wikimedia.org/T118323) [16:21:06] Ironholds: sorry, i just do not understand what automata is [16:21:11] if the BeautifulSoup users get clumped in with GoogleBot, that's fine [16:21:48] nuria: it would be easier if we had not named this UDF poorly (most of what it tracks is not spiders) [16:22:25] Ironholds: i see, i think teh correct thing would be now not to use thst code as naming is all off and confusing but let [16:22:28] let's just avoid the "automata" thing, then; I am asking if the definition of "spider", in the agent_type field, in the webrequest table, factors in the isSpider UDF [16:22:35] s try to see if we can get you waht you are interested in [16:23:17] is the isSpider UDF used in generating that field [16:23:28] I will settle for someone showing me where the code to generate that field lives so I can work it out myself ;p [16:23:31] Ironholds: let me look but if it does i think it would be incorrect [16:23:54] well, we need the things that UDF matches identified and excluded in one way, so. [16:23:57] *in some way [16:24:11] the problem is that "spider" and "user" is a false dichotomy [16:24:21] in practice it is "user" and "spider" and "bots", and we aren't representing that. [16:24:23] Ironholds: ya but naming is so off that it confuses more than helps [16:24:41] yeah, as this conversation demonstrates, but it's still better than not tracking that class of traffic [16:24:44] actually in practice we cannot tell spiders from bots [16:24:50] heh; fair [16:24:57] unless they are wikimediabots taht abide to a bot convention [16:24:58] then the clustering of those together is fine :D [16:25:45] Yes, we are working on looking at bot traffic and how much of our traffic tagged as user is actually from bots (crawlers or otherwise) [16:25:57] Ironholds: let me 1) find you whether that udf is used [16:26:03] 2) file a ticket to rename stuff [16:26:08] kk; thanks! [16:29:11] Ironholds: see: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/refine/refine_webrequest.hql [16:29:56] Ironholds: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/refine/refine_webrequest.hql#L98 [16:30:17] on my opinion that naming is all off [16:30:37] (PS6) Milimetric: Add pages edited metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/174773 (https://bugzilla.wikimedia.org/73072) (owner: Mforns) [16:30:42] (CR) Milimetric: Add pages edited metric (5 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/174773 (https://bugzilla.wikimedia.org/73072) (owner: Mforns) [16:31:03] Ironholds: BTW, take a look at this still WIP, pertains to the bigger picture of bots: [16:31:55] Ironholds: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution/BotResearch#Results [16:35:54] mforns: on the tests failing on reports, I tried stopping the queue and running them, and they work fine [16:36:19] I don't know why we need to do that though - something seems off with the queue configs [16:37:33] (CR) Mforns: Add pages edited metric (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/174773 (https://bugzilla.wikimedia.org/73072) (owner: Mforns) [16:38:42] nuria, aha, so it is integrated! [16:38:43] (CR) Mforns: [C: 2 V: 2] "Cool!" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/174773 (https://bugzilla.wikimedia.org/73072) (owner: Mforns) [16:39:06] no, it's not [16:39:09] * Ironholds sighs, kicks things [16:39:12] madhuvishy: why do we need to stop the queue? [16:39:23] milimetric, I merged your last wikimetrics patch [16:39:27] Ironholds: ay ay [16:39:29] why do we have a UDF for identifying and excluding a vast amount of necessary traffic, and we're not using it? [16:39:45] this removes a heck of a load of utility from the spider/user distinction in the pageviews API [16:39:50] these heuristics really need integrating. [16:39:55] milimetric, but I forgot to look at the non-code-related comment, about the tests failing [16:40:02] we can rename it to...isAutomata or something, but it needs to be in there [16:40:23] milimetric, just in case you want to deploy this, I wanted to be sure the tests are only failing for me [16:40:36] oh wait [16:40:40] nuria, hang on, it is included [16:40:45] WHEN ((ua_parser(user_agent)['device_family'] = 'Spider') OR (is_spider(user_agent))) THEN 'spider' [16:40:48] nuria: the tests only seem to pass with whatever queue config the tests start the queue with. [16:40:51] CREATE TEMPORARY FUNCTION is_spider as 'org.wikimedia.analytics.refinery.hive.IsSpiderUDF'; [16:40:52] milimetric, oh, I just read madhuvishy's message now [16:40:56] will try that [16:40:58] yay! [16:41:15] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1869749 (Nuria) @jcrespo: ah, it just took longer than i was expecting. Excellent then. Just confirmed that it is indeed... [16:41:21] nuria, thank you! I appreciate this was a lot of time to dedicate to answering one person's question. It is most appreciated :D [16:41:28] mforns: I replied to that as well... did i forget to submit the draft? [16:41:52] mforns: yeah, here: https://gerrit.wikimedia.org/r/#/c/174773/5/tests/test_metrics/test_pages_edited.py [16:42:02] it's normal - anytime mediawiki mappings are changed [16:42:05] Ironholds: i think we are still going to work on renaming...although ua parser uses 'spider' [16:42:22] yeah, renaming is probably the right call, but as long as it's communicated it shouldn't be a problem [16:42:29] I can just add an extra condition to my CASE WHENs, sorted. [16:42:34] milimetric, there was another comment on the main page, not file page [16:42:54] milimetric, 5 non-related tests were failing for me [16:43:09] milimetric, but madhuvishy says this gets fixed by restarting the queue [16:43:24] *stopping the queue [16:43:36] I forgot that I needed to do that [16:43:41] my baaad [16:46:45] milimetric, madhuvishy, yes it works. thx! So, patch merged [16:48:23] oh right, sorry didn't see that one [16:55:09] (CR) Nuria: [C: 2 V: 2] "Tets pass, merging." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/258153 (owner: Milimetric) [16:55:16] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1869775 (jcrespo) `CHARSET=binary`, didn't you used to create tables with utf8 charset? Most other are utf8 indeed. We ca... [16:56:56] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1869776 (Nuria) >CHARSET=binary, didn't you used to create tables with utf8 charset? we changed nothing on that regard, d... [17:00:15] a-team: standdduppppp [17:02:49] Analytics-Kanban: Troubleshooting limn1 and wikimetrics1 self-hosted puppet woes [3 pts] - https://phabricator.wikimedia.org/T120968#1869790 (Milimetric) [17:05:16] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1869800 (jcrespo) The server's default is binary, as that is what mediawiki uses, but the software used to set utf8 in the... [17:08:20] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1869809 (jcrespo) Actually I see them as utf8 myself. I am not sure what server you are querying? But the right one has th... [17:10:14] a-team: at the spark training! I worked on the sending stats to tornado using the sprockets library, and tested it on labs. Pushed an initial patch. I'm also working on the puppet module to host multiple static sites using apache with yuvi's help. there's a patch for that - but need to make a task. (waiting on next steps for wikimetrics so did this) [17:10:47] Thanks madhuvishy :) [17:11:28] thanks madhuvishy, I'm reviewing your patch now [17:16:14] ottomata: I saw your comment but I don't know who should merge it either :) [17:16:16] thx for looking though [17:16:24] what does it do? [17:16:32] I'll show you, hang on [17:16:46] My new "network_origin" function doesn't seem to be live on stat1002 yet. Is there something I need to do to request that the refinery jars are updated? [17:16:55] ottomata: if you go to wikimedia.org/api you get a 404 [17:17:13] and instead, it should be work like en.wikipedia.org/api [17:17:19] so I think this patch does some work towards that [17:17:53] bd808, yeah, they get updated on an intermittent schedule :/ [17:18:09] in the meantime, if you are trying to run queries using it, I would recommend building a JAR, importing that and loading the UDF there [17:18:33] *nod* I can do that. I was just looking to see if it was no longer needed [17:24:46] Analytics-Backlog, Reading-Infrastructure-Team, Patch-For-Review, User-bd808: Create user defined function to classify network origin of an IP address - https://phabricator.wikimedia.org/T118592#1869866 (bd808) The new UDF can be used from my homedir on stat1002 until such time as the shared refiner... [17:42:52] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1869900 (Nuria) > I am not sure what server you are querying? The slave, as i am going through 1002. >Let's schedule 2 h... [17:43:21] Analytics-Kanban, DBA, Patch-For-Review: 2 hour outage to update mysql on El slaves - https://phabricator.wikimedia.org/T121120#1869905 (Nuria) NEW a:mforns [17:44:32] Analytics-Kanban, DBA, Patch-For-Review: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1869905 (Nuria) [17:48:02] so, if not one contradicts me, we will schedule 2 hours of downtime for m4-master next Tuesday [17:48:36] ^ottomata [17:52:30] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1869947 (jcrespo) [17:55:12] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1869957 (jcrespo) It is very important to send a notice to all users with enough time in advance. As the affected services are not yet clear, please remember it should be done as soon as it is de... [17:59:45] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1869982 (jcrespo) BTW, this will only affect the MASTER (the data saving) not the slaves, that will continue to be available for querying, but not updated. [18:04:31] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1869992 (Nuria) [18:08:34] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1870019 (Nuria) p:High>Unbreak! [18:08:44] Analytics: THEME: Analyst uses an operationalized Saiku - https://phabricator.wikimedia.org/T75246#1870020 (Milimetric) [18:09:16] Analytics-Engineering, Analytics-EventLogging: Epic: Engineer has simpler way to deploy dashboard from EL data - https://phabricator.wikimedia.org/T75836#1870023 (Milimetric) Open>declined a:Milimetric Epics are not part of our process any more [18:09:23] Analytics-Engineering: EPIC: data warehouse - https://phabricator.wikimedia.org/T76382#1870029 (Milimetric) [18:09:24] Analytics-Engineering, Analytics-Wikimetrics: Epic: WikimetricsUser has all metrics reimplemented with Data Warehouse - https://phabricator.wikimedia.org/T76387#1870026 (Milimetric) Open>declined a:Milimetric Epics are not part of our process any more [18:09:30] Analytics-Engineering: EPIC: data warehouse - https://phabricator.wikimedia.org/T76382#1870033 (Milimetric) Open>declined a:Milimetric Epics are not part of our process any more [18:09:37] Analytics-Engineering, Analytics-Visualization: Epic: User sees correct comScore attribution on reportcard - https://phabricator.wikimedia.org/T75344#1870039 (Milimetric) Open>declined a:Milimetric Epics are not part of our process any more [18:12:36] Analytics-Engineering, Analytics-EventLogging: Automate pruning of sampled logs after 90 days [0 pts] - https://phabricator.wikimedia.org/T74743#1870049 (Milimetric) Open>Resolved a:Milimetric [18:13:10] Analytics-Engineering, Analytics-EventLogging: Epic: Engineer has simpler way to deploy dashboard from EL data - https://phabricator.wikimedia.org/T75836#1870056 (Milimetric) [18:13:11] Analytics-Engineering: Write new Config Script [13 pts] - https://phabricator.wikimedia.org/T76408#1870053 (Milimetric) Open>declined a:Milimetric reportupdater took the place of this [18:13:25] Analytics-Engineering, Analytics-EventLogging: Epic: Engineer has simpler way to deploy dashboard from EL data - https://phabricator.wikimedia.org/T75836#783494 (Milimetric) [18:13:26] Analytics-Engineering, Analytics-Visualization: [Volunteer] Improve Generate.py [13 pts for the Analytics Eng team] - https://phabricator.wikimedia.org/T76407#1870057 (Milimetric) Open>Resolved reportupdater took the place of this [18:13:43] Analytics-Engineering: EPIC: data warehouse - https://phabricator.wikimedia.org/T76382#1870063 (Milimetric) [18:13:44] Analytics-Engineering: raw warehouse data moved to labs, no pre-calculations [8 pts] - https://phabricator.wikimedia.org/T76383#1870060 (Milimetric) Open>Invalid a:Milimetric data warehouse was too hard for a variety of reasons [18:13:58] Analytics-Engineering, Analytics-Wikimetrics: Epic: WikimetricsUser has all metrics reimplemented with Data Warehouse - https://phabricator.wikimedia.org/T76387#1870071 (Milimetric) [18:14:00] Analytics-Engineering, Analytics-Wikimetrics: RAE implemented using Data Warehouse [13 pts] - https://phabricator.wikimedia.org/T76441#1870068 (Milimetric) Open>declined a:Milimetric data warehouse was too hard [18:14:07] Analytics-Engineering, Analytics-Wikimetrics: Story: Wikimetrics compiles target-site breakdown for remaining metrics [34 pts] - https://phabricator.wikimedia.org/T74738#1870074 (Milimetric) [18:14:08] Analytics-Engineering, Analytics-Wikimetrics: Story: Wikimetrics has connection to Data Warehouse [13 pts] - https://phabricator.wikimedia.org/T74737#1870072 (Milimetric) Open>declined a:Milimetric [18:17:31] Analytics-Cluster, Analytics-Engineering: Analytics Eng has duplicate monitoring for partitions coming through Kafka - https://phabricator.wikimedia.org/T86197#1870103 (Milimetric) Open>Resolved a:Milimetric Done with the webrequest_statistics part of refinery [18:17:32] Analytics-Backlog, Analytics-Cluster: Epic: qchris transition - https://phabricator.wikimedia.org/T86135#1870106 (Milimetric) [18:17:50] Analytics-Backlog, Analytics-EventLogging, Documentation, Epic: {epic} Product Instrumentation and Visualization {oryx} - https://phabricator.wikimedia.org/T76795#1870111 (Milimetric) [18:17:51] Analytics-Engineering, Analytics-EventLogging: Analytics Eng has architecture review of EL - https://phabricator.wikimedia.org/T78443#1870108 (Milimetric) Open>Resolved a:Milimetric We're all experts [18:21:13] Analytics-Engineering: Community has a developer doc "Getting Started with Wikimetrics" - https://phabricator.wikimedia.org/T77075#1870131 (Milimetric) Open>Resolved Handled by the README and wiki documentation [18:23:00] Analytics-Engineering, Analytics-EventLogging: Epic: Engineer has simpler way to deploy dashboard from EL data - https://phabricator.wikimedia.org/T75836#1870145 (Milimetric) [18:23:01] Analytics-Engineering: Write new Test Script for pipeline to generate visualizations from EL data - https://phabricator.wikimedia.org/T76409#1870142 (Milimetric) Open>declined a:Milimetric we're not going to do this anytime soon, but reportupdater and dashiki solve some of the problem [18:24:00] Analytics-Kanban, Community-Wikimetrics, Patch-For-Review: Story: WikimetricsUser reports pages edited by cohort {kudu} [13 pts] - https://phabricator.wikimedia.org/T75072#1870154 (Milimetric) a:mforns>Milimetric [18:25:18] Analytics-Cluster, Analytics-Engineering: analytics1032 has / mounted ro - https://phabricator.wikimedia.org/T118175#1870171 (Milimetric) a:Ottomata [18:27:21] milimetric: yall can't hear me? [18:27:30] ottomata: no! [18:27:32] weird [18:27:33] what?! [18:27:37] we don't even see you [18:27:48] ottomata: Where are YOUUUUU ? [18:27:53] you can't see me iether?! [18:27:55] i'm in the batcave! [18:28:00] Sure not ! [18:28:09] no!? [18:28:13] WE are in the batcave :) [18:28:16] I AM THERE WITH YOU [18:28:21] am I dead? [18:28:23] am i a ghost? [18:28:27] can you see me in hangout chat/ [18:28:37] don't know, you talk too much for being either dead or a ghost [18:28:40] haha [18:28:45] weeiiiirrrd [18:28:48] Can [18:28:52] Analytics-Engineering, Community-Tech, Community-Tech-fixes: Add page view statistics to page information pages (action=info) [AOI] - https://phabricator.wikimedia.org/T110147#1870191 (Milimetric) I'll remove analytics for now, please ping me or add it again if there's specific work you need us to do [18:28:53] Can't see you at all [18:29:17] Analytics-Engineering: stat1003 - git pull geowiki scripts fails - https://phabricator.wikimedia.org/T109594#1870193 (Milimetric) Open>Resolved a:Milimetric [18:29:25] ottomata: anything we could help with ? [18:29:35] haha [18:29:36] wow [18:29:42] Analytics-Engineering: Analytics-Engineering availability to support VE A/B test during hackathon - https://phabricator.wikimedia.org/T99014#1870200 (Milimetric) Open>Invalid a:Milimetric no longer relevant [18:29:43] both safari and chrome do the same thing [18:29:53] naw i'm just chillin, heard yall talking about a varnishkafka task [18:30:00] was going to say 'whaa?' but then no one heard me [18:30:04] thought you were just ignoring me [18:30:11] since I shun these meetings too often [18:30:15] huhuh [18:30:36] Analytics-Engineering: Hive error accessing block - https://phabricator.wikimedia.org/T98622#1870209 (Milimetric) Open>Resolved a:Milimetric [18:30:37] hi analytics! [18:30:38] We are doing the winter cleaning (otjers boards) [18:31:13] bd808, Ironholds : we normally do not deploy every week nor we have a fix deployment schedule, we try to have a deployment every couple weeks but you can request we do one earlier [18:31:37] do you know if we have any data of number of file uploads to Commons using UploadWizard, by day? [18:31:50] Analytics-Engineering: Pageviews definition undercounts app requests - https://phabricator.wikimedia.org/T93255#1870212 (Milimetric) Open>Resolved a:Milimetric [18:31:51] (or would i have to crunch the numbers myself if i wanted to know?) [18:32:28] i'm wondering because i crunched the numbers for the past two months, and they seem to be quite clearly decreasing. [18:32:59] nuria, when was the last one? Trying to work out if I can deprecate some code [18:33:09] and i'm wondering if it's just winter making people sleepy, every year, or a real effect. [18:33:19] Ironholds: it is on changelog, see: [18:33:32] MatmaRex: that's a good question but I don't know of any way to check [18:33:46] does UploadWizard create pages with a specific tag in the revision tags? [18:34:03] Ironholds: https://github.com/wikimedia/analytics-refinery-source/commits/master/changelog.md [18:34:28] Ironholds: but that doesn't tell you exact date of deploy , it just tells you what is deployed how [18:34:31] *now [18:34:39] Ironholds: let me know if that is sufficient [18:35:19] nuria, the changelog reflects what is merged, not what is deployed, no? [18:35:25] milimetric: no, but the upload log comment is always the same [18:35:28] because the changelog contains the code bd808 just confirmed isn't deployed. [18:35:34] "User created page with UploadWizard" [18:36:03] MatmaRex: oh ok, then let's see if this is available on labsdb [18:39:24] milimetric: what i noticed is here: https://phabricator.wikimedia.org/T120867#1870223 [18:40:11] (that task is primarily about uploads using the new cross-wiki tool in the editor, and why so many of them are really poor quality) [18:40:32] (but i'm noticing other things about the data now that i've gotten it and plotted) [18:40:51] MatmaRex: so how'd you get the numbers in the first place? [18:41:05] Ironholds: let me talk to mforns about something else and we can see about latest jar [18:41:08] milimetric: from the API. https://github.com/MatmaRex/commons-crosswiki-uploads [18:41:09] mforns: yt? [18:41:15] nuria, yes [18:41:21] https://github.com/MatmaRex/commons-crosswiki-uploads/blob/master/getlog.rb#L6 [18:42:36] I see, MatmaRex, ok, so I'm not going to do anything fancy, just query the db directly and look for that exact rev_comment [18:42:47] nuria, what's up? :] [18:44:41] mforns: we have to upgrade mysql on the slaves for EL [18:44:52] nuria, aha [18:44:52] jaime is going to do it next tuesday [18:45:25] mforns: but he will need your help as standby person. I *think* we should just deactivate mysql consumer while update is going on [18:45:32] nuria, did he tell you the hours? [18:45:36] ebernhardson: did you succeed in using HiveContext for your spark job? [18:45:37] ok [18:45:46] mforns: yes, here is ticket: https://phabricator.wikimedia.org/T121120 [18:45:51] ok [18:46:01] mforns: i have assigned it to you cause is ops on EU timezone [18:46:16] nuria, ok, I will combine a time with jaime [18:46:19] mforns: if you cannot make it let me know and we can change the times [18:46:34] nuria, surely can [18:46:40] mforns: also , we might want to pursue other strategy other than deactivate teh mysql consumer only [18:46:58] madhuvishy: yes! with joal's help [18:47:20] ebernhardson: oh cool! can you link me to your code? [18:47:20] mforns: but seems to me that "the least impact on system" the better [18:47:39] nuria, I see deactivating the mysql-consumer a very good way of dealing with the window [18:48:04] nuria, let kafka buffer the events [18:48:07] mforns: ok then once you sync up with jaime as of exact times please send e-mail to analytics list, will update ticket to this fact [18:48:29] madhuvishy: it mostly came down to this line: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/oozie/popularity_score/workflow.xml#L141 [18:48:43] madhuvishy: using cmd line options to pass in hive site, and setting jar paths [18:49:04] ebernhardson: cool. thanks :) [18:49:09] nuria, the only thing I would do is change the queue size before that (to avoid oom's when backfilling), and maybe fix the bug in EL (push and pop to the same end of the queue) [18:49:43] nuria, ok, will do [18:50:12] Ironholds: last deploy with jar v0.0.23 2015-12-01 [18:50:27] mforns: ok, we can deploy the bugfix on Monday and see it work for 24hrs before we do the update, are you OK doing those code changes? I can CR and we can deploy together if you want. [18:50:58] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1870271 (Nuria) @jcrespo: @mforns will coordinate with you as he is on the EU TZ Affected services is EL but only the DB consumer (data will flow into cluster and log files w/o issues) so I thin... [18:50:59] joal, awesome, thanks. [18:51:04] when was the one before that? [18:51:06] nuria, sure I can do the changes [18:51:11] Sorry Ironholds for not having been reactive today :) [18:51:31] I had answers to your questions, but have seen them when backlogging the chan :( [18:51:51] mforns: ok, will CR and if we can deploy today great, otherwsise we can deploy Mon [18:52:04] Ironholds: v0.0.22 2015-10-28 [18:52:05] no problem! It's been worked out :) [18:52:11] nuria, cool :] [18:52:17] Ironholds: https://tools.wmflabs.org/sal/analytics [18:52:21] ;) [18:52:29] cool! [18:52:40] I just wanted to make sure my memory of them being infrequent was not some fever dream ;p [18:52:43] Ironholds: not before 2015-10-21 though :( [18:53:08] MatmaRex: I'm running that query on 8 days in december but it's crazy slow [18:53:09] Ironholds: the deploy process is VERY manual, so I tend to prevent doing often :) [18:53:33] Ironholds: When luca arrives, I hope we'll be able to spend some time on continuous integration :) [18:53:40] aha [18:53:45] luca? [18:53:56] newcomer opsy style [18:53:59] MatmaRex: most likely the numbers will match the ones you get, so I'm just trying to validate that your method is as accurate as any other way to look for this. [18:54:16] in an ideal world we'd have UploadWizard send an event to EventBus or Event Logging and track things that way [18:54:20] milimetric: hmm, yeah, doesn't seem to be any index on log_comment :/ [18:54:25] ooh cool [18:55:07] milimetric: you seem busy, tomorrow for oozie ? [18:55:25] i just downloaded all the data and i was running those scripts to work with it. it's just 300 MB of JSON for that time period. :P [18:55:30] joal: oh no, I was just running a query for fun, not particularly busy [18:55:35] ok :) [18:55:52] Can I help with something or just CR ? [18:56:01] also milimetric : https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Sanitization_algorithm_proposal [18:56:06] mforns: --^ [18:56:10] joal: just CR [18:56:15] mostly just curious, what is hive-hcatalog-core.jar? or more specifically why is it included in various hive jobs kicked off from oozie. Trying to figure out if i need it for some reason (optimization of some sort?) [18:56:15] ok great [18:56:35] ebernhardson: needed when parsing jason [18:56:36] joal, will have alook at it :] [18:56:38] JSON sorry [18:56:44] joal: ahh ok [18:56:54] thanks [18:56:58] So needed when using wmf_raw.webrequest mostly [18:57:25] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1870327 (mforns) To deal with the mysql upgrade window, we analytics are planning to stop the (4) EL mysql consumers during that time span. Kafka will buffer the events that come in the meantime,... [18:57:33] joal: I think you mean threshold instead of trigger, but I quickly skimmed and that looks like what you were saying this morning so it makes sense to me [18:57:48] :) [18:58:45] milimetric: Actually, not a threshold, but the value being used with a theshold ... That's why I said trigger, maybe there's something more precise [19:00:28] joal: I see what you mean, sorry, you're right. I personally would just name the headings "Anonymize if there are less than K pageviews" and "Anonymize if there are less than K distinct IPs" [19:01:02] milimetric: You have the annoying habit to say easily what I spend hoursto [19:01:06] say wrongly :) [19:01:20] :P [19:01:23] Please feel free to update, as usual [19:02:27] milimetric: the geo dataset, do we plan to release it ? [19:02:30] oh, I think you're right though, that's a trigger, so it makes sense [19:02:40] joal: no plans to release it at hourly resolution [19:02:45] but we were thinking maybe weekly is ok [19:02:51] I think the same concerns would apply as before [19:02:57] great, that covers my concern :) [19:03:01] indeed :) [19:03:05] but that dataset is already released to some extent [19:03:12] ? [19:03:23] well, wikistats is already doing this just based on the sampled logs [19:03:26] Arf, by wikistats? [19:03:29] riiiight [19:03:49] hm, there is 1000-anonymization then [19:03:51] Analytics-Backlog: Fix EL mysql consumer's deque push/pop usage {oryx} - https://phabricator.wikimedia.org/T120209#1870345 (mforns) [19:04:06] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1870347 (mforns) Before we do that, it's necessary to fix this bug: T120209 [19:04:15] heh, not quite, it's just 1000 times less likely that you *won't* be anonymous [19:04:51] hmmmm right, but 1 meaning a 1000, you are anonymous by default ! [19:04:53] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1870350 (mforns) @jcrespo Hi Jaime :] When is the best time for you to do the mysql upgrade on Tuesday? [19:05:15] well, probably a lot more than 1000 because sampled logs include everything not just pageviews [19:05:27] oh yes, true as well [19:05:33] So wikistas is safe: ) [19:05:36] so the relationship between those and page titles is a lot weaker [19:05:39] safe-ish [19:05:41] yup [19:06:04] i'm just always skeptical of this because I've never been able to do complicated math like this in my head :) [19:06:42] :D [19:07:24] MatmaRex: anyway the query is very simple, but quarry is refusing to run it: http://quarry.wmflabs.org/query/6382 [19:07:56] MatmaRex: I've been running it on analytics-store but that's *still* running so I'll paste you the results here when it's done [19:09:54] milimetric: hm. [19:10:25] milimetric: are you sure it's not a typo somewhere? does that match anything? like, `select * where rev_comment = 'User created page with UploadWizard' limit 1` - that matches any rows? [19:12:10] MatmaRex: yeah, I checked, that matched [19:12:14] nuria, btw one question on EL [19:12:29] MatmaRex: based on the number of rows it's scanned, it should be close to done [19:12:43] nuria, where are the tests? I only see .pyc files... O.o? [19:14:25] mforns: where are you looking? [19:15:08] https://github.com/wikimedia/eventlogging [19:15:17] do you have the new submodule? [19:15:22] Analytics-Tech-community-metrics, Developer-Relations, DevRel-December-2015: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1870381 (Aklapper) >>! In T103292#1857922, @Nemo_bis wrote: > Make a list of counted c... [19:15:25] madhuvishy, oh! no [19:15:48] mforns: ya that may be it. the server code was moved out into a git submodule [19:15:55] madhuvishy, I see [19:16:27] MatmaRex: doh, I messed up the left(8, rev_timestamp) call, it should be left(rev_timestamp, 8) haha [19:16:33] MatmaRex: but for those 8 days in December I got 37955 [19:16:39] madhuvishy, so how should I proceed to replace the old in-vagrant stuff with the new submodule? [19:16:42] that seems to roughly match what you're seeing [19:17:50] mforns: i just killed the eventlogging repo in /mediawiki/extensions and cloned eventlogging, did submodule update i think [19:17:56] there may be a better way [19:18:00] milimetric: maybe you could try querying from logging table? i think it's smaller, maybe it'll be faster. log_type='upload' and log_action='upload/upload' [19:18:03] madhuvishy, ok thx! [19:18:18] ottomata: I was waiting in da cave for the MEETING ! [19:18:19] :( [19:18:23] OH! [19:18:34] Don't worry, I'm gone now :_P [19:18:36] (my data is also from logging and not revision, my code uses 'logevents' in api.php) [19:18:38] no one came? [19:18:42] nup [19:18:46] hm ok [19:18:50] well, not much to talk about anyway, i guess? [19:18:53] I guess not confirmed mean no meeting [19:18:56] (can't believe its been a month!) [19:18:58] nope [19:19:03] Fastm huh [19:19:05] i guess leila is on vacation and ellery is here at the spark training [19:19:16] research meeting right? [19:19:26] ottomata: Just said in retro it makes too long we've not spent some time working on something together [19:19:45] ja, s'ok though [19:19:46] yeah [19:19:49] I know it's a relief for you but still, I'll make sure to grab you for some task one of thoses days :) [19:19:51] you and me ? :) [19:19:59] oh please do! [19:20:19] You're all event-bus, I'm gonna distract :-P [19:20:38] joal: ottomata please upgrade Spark :D [19:20:47] YAY [19:20:49] :) [19:21:00] 1.5 is better with tungsten madhu ? [19:21:10] hehe [19:21:14] its in the new cdh! [19:21:23] (CR) Joal: [C: -1] "Few changes, but globally ok :)" (7 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/256355 (https://phabricator.wikimedia.org/T118323) (owner: Milimetric) [19:21:29] yeah, and it has such cool visualizations on what's going on in each stage of the computation [19:21:35] in the spark UI [19:21:41] so easy to debug [19:21:47] Oh, that's great ! [19:22:04] they are showing 1.6 preview here though [19:22:17] madhu, do you know if they have solve the issue of very big files per worker (there was a hard limit to 2G) [19:22:26] ? [19:22:32] let me ask in a bit :) [19:22:36] great [19:22:40] ottomata: lets get the new cdh, new java, new spark, new kafka, new hue [19:23:00] to be precise madhu, it's a buffering file limitation I think [19:23:09] ottomata: NEW CLUSTER ! [19:23:12] :D [19:23:42] heheh [19:23:51] * YuviPanda looks around vaguely [19:23:59] MatmaRex: no, looking for that in the logging table seems to not work [19:24:05] hey, yeah maybe you me and luca can have fun next quarter doing tons of stuff like that [19:24:06] * joal waves [19:24:15] MatmaRex: but the other query works fine, check those numbers with yours, do they match? [19:24:15] ha ha [19:24:17] yesss [19:24:28] and YuviPanda will setup all the notebooks [19:24:39] ottomata: YESIR ! Plus some druidy thingies :) [19:24:49] Oh YEAH ! [19:24:55] tools.wmflabs.org/paws [19:25:02] notebook running on our kubernetes cluster [19:25:07] for pywikibot people :D [19:25:16] (WMF) accounts will not work yet [19:25:44] YuviPanda: That's awesome :) [19:26:02] milimetric: which days was this, again? [19:26:02] joal: :D I'm going to set up a more researcher focused one too soon [19:26:20] joal: with matplotlib / scipy / numpy / pandas stack, with easy access to dumps and labsdb [19:26:22] YuviPanda: I'll ask Ellery if it would be feasible to connect a notebook with pyspark on the cluster [19:26:26] MatmaRex: December 1 through December 8 inclusive [19:26:38] joal: he's sitting in front of me, that's what he uses all the time apparently [19:26:40] YuviPanda: That would jsut rock :) [19:26:58] joal: he runs a notebook server on stat1002 and tunnel to it [19:27:00] joal: I want to setup a notebook setup in production though [19:27:05] Yes, now the thing is, how to connect propoerly, but I'm sure it would be feasible [19:27:08] I think it'll allow people to self-serve [19:27:18] joal: jupyterhub has proper authentication and stuff now [19:27:19] ok YuviPanda [19:27:29] milimetric: in that case it seems to almost match. i see 37977 non-deleted uploads via UploadWizard in this period. i might have slightly old data, somebody might've deleted the missing 22 in that time. [19:27:29] joal: so you can just setup LDAP based access and be done with that [19:27:36] I think we would want to restrict to in-wmf only [19:27:47] joal: yup, so LDAP with that group restriction would work [19:27:49] yoip YuviPanda [19:27:52] (39902 in total, including deleted files) [19:27:54] sorry missed the previous [19:28:00] That would just ROCK [19:28:04] :D [19:28:10] joal: you can also use MW OAuth with officewiki :D [19:28:14] MatmaRex: most likely, deletion drift has about that pace [19:28:22] joal: since I have merged a mediawiki auth provider upstrea [19:28:24] m [19:28:25] MatmaRex: ok, so that means your method is the better one since it seems to be faster [19:28:29] * YuviPanda has been doing a lot of upstream stuff with notebooks [19:28:41] MatmaRex: and if we want to monitor this ongoing, we should instrument it with EL [19:29:23] joal: I'm totally up for doing it, just needs enough 'demand' :D [19:29:32] YuviPanda: Then I'll become nasty : https://github.com/ibm-et/spark-kernel [19:29:38] :D [19:29:55] I can't let go scala for4 python using Spark, that'll not happen :) [19:30:18] YuviPanda: I'll gather the demands and get back :) [19:30:33] Thanks for setting that up [19:30:35] milimetric: i don't think it's faster, it's more like… pre-cached. it took a while to download all the data [19:30:36] joal: +1. I've been running it on tools ofr a few weeks [19:30:36] YuviPanda: --^ [19:30:49] joal: I expect the more researcher focused one to be up later this week or next week [19:31:04] YuviPanda: Please let us know :) [19:31:09] joal: I will! [19:31:13] joal: oh and another thing [19:31:20] I use pandas regularly, so having that willactually help [19:31:28] joal: http://jupyter-mw.wmflabs.org/wiki/Notebook:Evil [19:31:32] joal: that look familiar? :D [19:31:39] it's notebooks integrated into mw [19:31:42] so you can publish on wiki [19:31:50] (PS5) Milimetric: Oozie-fy Country Breakdown Pageview Report [analytics/refinery] - https://gerrit.wikimedia.org/r/256355 (https://phabricator.wikimedia.org/T118323) [19:31:58] (CR) Milimetric: Oozie-fy Country Breakdown Pageview Report (6 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/256355 (https://phabricator.wikimedia.org/T118323) (owner: Milimetric) [19:32:13] milimetric: hmm, since i already have your attention, got a minute or two to chat? :P we want to run a small A/B test related to uploads, i was hoping for somebody from analytics to sanity-check the ideas [19:32:16] Mwaaahahahaha :) [19:32:25] Not even needed to write doc anymore ! [19:32:34] MatmaRex: sure, but usually you'd want someone from Research or an Analyst [19:32:45] I can help with tools but I'm famously shitty as an analyst :) [19:32:54] joal: :D [19:33:09] ok YuviPanda, I'll wait IMPATIENTLY :) [19:33:21] joal: :D I've been working with upstream on their HTML sanitization too https://github.com/jupyter/nbconvert/pull/172 [19:33:22] MatmaRex: but of course do ask what's up, I'll try to help [19:33:34] ellery: Hi! Can you point AndyRussG and I to your code and data repos? [19:33:36] joal: I've basically been in notebookland for the last two weeks [19:33:43] brb [19:34:06] joal: but yeah, you need to gather more demands :D this might need a new node maybe (notebooks on production) although we can just run it on stat* if we need to [19:34:20] I'm gone for dinner a-team, YuviPanda, thanks for the heads-up, I'll try to bug you at all-hands to learn some more on notebooks hands-on :) [19:34:29] milimetric: the short version is, we discovered that the cross-wiki upload tool (in VE, Insert->Media->Upload) is used to upload around 1000 images per day. but a majority of these is copyright violations or otherwise useless. https://phabricator.wikimedia.org/T120867 [19:34:41] good night joal, see you [19:34:41] joal: oh yeah, I expect it to be pretty damn awesome in that time period :D [19:34:43] YuviPanda: duly noted [19:34:46] joal: have a good dinneR! [19:34:51] :] [19:34:52] Thanks :) [19:34:59] milimetric: we want to try out a few interfaces, hopefully before christmas, and see if any of them results in a better ratio of good:bad images [19:35:47] MatmaRex: ok, so in that case I'd definitely instrument UploadWizard with EventLogging [19:36:01] because otherwise you'd have no way of differentiating between uploads from one interface vs. another [19:36:16] milimetric: it's kind of short notice, because no one noticed that people are uploading junk until Commons people got upset and started yelling at us :/ and we're hoping not to have to disable it [19:36:31] milimetric: right now, to tell cross-wiki uploads apart from anything else, we just use a change tag [19:36:53] MatmaRex: gotcha, makes sense, but you'll need to tell one kind of cross-wiki upload from another kind [19:37:22] adding more tags seems harder than instrumenting with EL, so I'd recommend the latter [19:38:40] MatmaRex: is that not an option for any reason? [19:39:03] milimetric: so yeah, that is one thing i am wondering about. using change tags would be easier for me to write, and easier to query (i don't have analytics access at the moment because of management confusion) [19:39:12] it would also be nice and transparent :) [19:39:47] MatmaRex: well, EL schemas are public and they're about as transparent as you can get without innundating the public data with tags [19:39:54] milimetric: we couldn't measure anything other than the number of successful uploads. i don't think we want to measure anything else, though… [19:40:10] (when using tags, we couldn't…*) [19:40:15] MatmaRex: we can definitely help with the mangement confusion, you should have access to query and we can grant you that access, do you have an Ops-Access-Request? [19:40:32] joal: the 2gb restriction still exists [19:40:37] https://phabricator.wikimedia.org/T119404 [19:40:58] joal: they do want to work on getting around it - i've asked them to keep us posted [19:41:15] madhuvishy: thanks a lot :D [19:41:21] mforns: did you find tests on EL alreday? [19:41:27] *already [19:41:47] Analytics, Security-Team: Establish a process to periodically review and approve access for hadoop/hue users - https://phabricator.wikimedia.org/T121136#1870462 (csteipp) NEW [19:42:25] MatmaRex: ok, I guess escalate this to Trevor if you don't see him approve it today [19:42:43] MatmaRex: so I take it you haven't worked with EL much? [19:42:53] I think it's probably easier than you think... [19:43:05] milimetric: not at all :) [19:43:10] milimetric: I let you test your last patch, if it works, merged tomorrow morning :) [19:43:12] milimetric: but they'll have 3 day wait? :/ [19:43:23] but anyway, i think that's the least important part [19:43:26] madhuvishy: yeah, but the EL instrumentation wouldn't be done and deployed in that time anyway [19:43:32] * joal is gone for real, now [19:43:34] okay [19:43:38] bye joal [19:43:55] MatmaRex: what's least important you learning EL or the access? [19:44:11] Is the hive CLI deprecated by beeline? I'm trying to use beeline, but getting "not connected" errors. [19:44:24] awight: hive CLI is fine, beeline's not ready yet [19:44:25] I don't see anything about this on https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Hive [19:44:26] milimetric: least important is which tool we use to track the results. i'd imagine both change tags and EL are equally reliable. [19:44:30] milimetric: ok, thank you! [19:44:33] awight: no, we use hive CLI still [19:44:56] nuria: excellent--the 2-hour waits I've grown accustomed to :p [19:45:08] MatmaRex: right, ok, so let's pretend we have some way of finding uploads and figuring out which interface they were uploaded from [19:45:28] awight: beeline is useable - we haven't played with it enough - if you do want it : beeline -u jdbc:hive2://analytics1027.eqiad.wmnet:10000 -n [19:45:43] ooh, fancy! [19:45:47] lol, beeline won't make anything faster, it'll just be more ... bzzzzz hive-y [19:45:51] madhuvishy: thank you! [19:45:54] milimetric: do you think it's going to be problematic to test more than 2 versions at once? people acted ominous when i mentioned that. :P [19:46:03] it has output formatting options though [19:46:09] Yes, I plan to leave the office today covered in honey and ants [19:46:11] you can get vertical, csv, tsv etc [19:46:13] awight: you might run into issues though [19:46:19] with hive, only tsv [19:46:38] MatmaRex: I don't know how that works in mediawiki, the only team I know that's done true A/B tests where they release one interface to one group of people and another interface to another group of people is the former E3 / Growth team [19:46:50] and beeline is configured with 1024mb memory - no OOM's mostly [19:46:58] MatmaRex: and they had some tools but I remember had to jump through a lot of hoops, matt_flaschen might know more [19:47:34] milimetric: yeah, there seem to be about three different ways to bucket people in JS, all of them deprecated :P [19:48:04] MatmaRex: bucketing is not likely going to be the problem [19:48:36] MatmaRex: EL has code for proper bucketing but rather to set u your test so you have a true A/B experiment [19:50:13] nuria: wait but EL can't change the interface [19:50:24] the problem here is we want to measure how interface A compares to interface B [19:50:47] EL would have a field that would capture which interface you're using, and that can be analyzed later [19:51:00] milimetric: right, stickyness of features -as far as we know- is only handled by siging up on beta features [19:51:01] but actually serving different interfaces to different people is the hard part [19:51:18] milimetric: sure, one that mediawiki has to solve [19:51:48] right, ok, MatmaRex so I don't think we can help you with splitting people up into different groups that get different interfaces [19:52:06] but do talk to matt and maybe ori, they both worked on this in their previous lives [19:52:25] hmm. right. [19:52:28] Ironholds: did you figured out whether bd808 code was deployed? [19:53:03] MatmaRex: but say you got everything else working and the access is the only part you can't do, I'm happy to set up some recurrent queries that get the data you need and put it in a public file somewhere that you can easily graph [19:53:24] MatmaRex: for that we use this cool tool called reportupdater and it does lots of other things like this [19:53:24] also, do you think it's going to be enough to run this for a week? uploading files is a comparably "low frequency" activity. we're getting ~1000 files a day currently via this tool, we're expecting to get fewer with the interfaces we'll be testing [19:53:44] MatmaRex: for that I'd talk to folks in the Research team [19:54:01] MatmaRex: Aaron Halfaker, Leila Zia, etc. [19:54:08] they'll be able to help you figure out how to run a proper experiment [19:54:12] * halfak responds to ping [19:54:14] Hey! [19:54:17] more channels? alright [19:54:21] oh, halfak is here <3 [19:54:35] :D [19:54:40] halfak: so, the summary: [19:55:38] halfak: we have a shiny file upload tool inside VisualEditor (Insert->Media->Upload). we're getting ~1000 images a day uploaded with it, which is nice. what isn't nice is that they ar emostly copyright violation or otherwise junk, and Commons people are rightfully angry. (the uploads go to Commons) [19:55:52] * YuviPanda gets an intense sense of deja vu [19:56:03] Is it possible to get anonymous/logged-in status or interface language from the `webrequest` table? [19:56:09] halfak: we want to test some different interfaces, aiming to improve the ratio of good to bad uploads, and i'm basically asking how to go about this [19:56:10] YuviPanda: did they changed something on teh matrix? [19:56:12] *the [19:56:17] the task is https://phabricator.wikimedia.org/T120867#1868368 [19:56:30] nuria: :D [19:56:32] YuviPanda: see, not quite. mobile selfiepocalypse was about selfies, not copybios. [19:56:42] copyvios* [19:56:52] MatmaRex, so outcome measure is "proportion of good uploads"? [19:56:59] great example of mobile vs desktop usage patterns, i guess. :P [19:57:03] lol, I miss Maryana explaining selfiepocalypse [19:57:09] I see 1000 uploads per day, but how many users is that? [19:57:41] halfak: most are new users, rarely uploading more than one image, i think. i can give you the numbers in a minute [19:58:12] What proportion of image uploads are currently "good"? [19:58:22] halfak: outcome is basically that, yes. (although it wouldn't be good if the interface stopped almost everyone from uploading.) [19:58:48] halfak: see graphs, compared to other upload methods: https://phabricator.wikimedia.org/T120867#1868258 - at least 20% are bad, maybe as much as 50% [19:59:06] (uploadwizard has around 7% bad, i think) [19:59:48] halfak: when people checked uploads from one day very carefully, they ended up deleting 50%, the average from other days is closer to 20% [20:00:02] milimetric, will join in a sec. [20:00:07] np halfak [20:00:09] I think I can quickly do a power analysis [20:02:22] Assuming we have ~7000 obs from a week, split evenly between two conditions (3500), we should be able to fix significant diffs at around 3%. [20:03:01] If you run the experiment for two weeks and get ~14000 obs, you can detect significant differences doen to 2%. [20:03:05] *down [20:03:14] So it looks like 1 week is good. [20:03:39] Would you value an effect that was only 2%? [20:03:46] Or would you like to see a 5-10% difference? [20:04:00] If only 5-10% is meaningful to you, then I'd advise running the experiment for a week. [20:04:22] You don't want to run it for less than a week because of periodic patterns in editor behavior around weekends/week-days. [20:04:33] * halfak needs to run now but can chat more later. [20:04:50] halfak: i'm not entirely sure what you're asking, but if you're saying a week is probably fine, that's good enough for me [20:05:34] thanks [20:06:29] halfak: hmm, if we were to compare more options than 2 (say, 4), would a week still be reasonable? [20:13:35] awight: no, webrequest is geared towards request data, rather than user data, some "user" data goes into x-analytics field but are things more like "where there cookies on this request?" , "app version" [20:13:53] awight: https://wikitech.wikimedia.org/wiki/X-Analytics [20:24:11] if i'm creating an table in hive that holds hourly aggregated data, but it only gets ~4k rows/hour, would partitioning it only by (year,month) make the most sense? Or just skip partitions all together [20:24:36] ebernhardson: up to you really :) [20:24:46] the partitioning is mostly just for efficiency [20:24:49] :P [20:25:04] so a bigger bucket would probably be ok [20:25:28] especially if you don't intend to query it on smaller granularities regularly [20:25:40] yea i'm thinking this data will be small enough to not matter...i was almost thinking of just exporting it from hive to something else. but that seemed like even more work that wasn't necessary :) [20:47:59] Analytics-Kanban, Patch-For-Review: Fix EL mysql consumer's deque push/pop usage {oryx} - https://phabricator.wikimedia.org/T120209#1870652 (mforns) [20:48:06] madhuvishy: are you done with training? [20:48:26] Analytics-Kanban, Patch-For-Review: Fix EL mysql consumer's deque push/pop usage {oryx} - https://phabricator.wikimedia.org/T120209#1870653 (mforns) a:mforns [20:48:58] nuria: Thanks! I just learned I should also be reading the pageviews_hourly table. [20:49:11] awight: it does not have that info either [20:49:16] Analytics-Kanban, Patch-For-Review: Fix EL mysql consumer's deque push/pop and size {oryx} - https://phabricator.wikimedia.org/T120209#1870656 (mforns) [20:49:37] awight: makes sense? [20:56:54] nuria: yes, thanks. [20:59:17] nuria: trying to use the network_origin() function without sideloading my own jars fails with "Invalid function 'network_origin'" so I think that means that the default jars on stat1002 do not include my patches yet. [20:59:37] bd808: mmmm...it should let's look at that in a bit [21:00:00] nuria: it's the whole day - here till 5 [21:02:53] madhuvishy: ok, have changed couple things on the test of your patch, all minor though [21:03:03] nuria: oh sure [21:03:09] thanks [21:04:13] nuria, I pushed the EL changes for review [21:04:36] https://gerrit.wikimedia.org/r/#/c/258217/ [21:08:32] Analytics-Backlog: Remove batching queue from EL code - https://phabricator.wikimedia.org/T121151#1870731 (Nuria) NEW [21:08:36] cool, mforns what's the deal with the popping on the wrong side? [21:09:56] ottomata, it works the same, except it makes backfilling confusing, because there are some event batches that can be kept a long time in the deque [21:10:06] until the deque empties [21:10:59] instead of first in first out, the deque was last in first out [21:11:04] ohhh [21:11:19] like a stack? [21:11:22] yes [21:11:33] huh, interseting [21:11:46] so if the queue is full, then earlier events get inserted last [21:11:53] exactly [21:11:54] which would be confusing while waiting for things, eh? [21:12:07] yes, that is what happened to me last backfilling [21:17:37] nuria, thx for creating that task [21:17:49] np [21:20:48] How old is the data in pageview_hourly? I'm getting empty results for Dec 10, 09:00 UTC [21:24:23] Sorry, I must be doing something else wrong. This also returned null results: https://phabricator.wikimedia.org/P2403 [21:27:58] Hi all! I'm curious about all the 000000_0_copy_XX files generated by my oozie job (https://gerrit.wikimedia.org/r/258058) - it's creating those for the textfile tables I add to every hour, but not the rollup ones I truncate and recreate each time [21:28:38] I can easily cat them together to get the by-hour data, but I wonder if I'm doing something wrong [21:30:32] just a heads up, i'll be deploying the new avro schema for cirrussearch in ~2.5 hours. [21:31:21] nuria, wanna deploy EL? [21:31:52] mforns: sure, do you have some time to baby sit after ? as in 30 mins i need to leave for 1 hours [21:32:03] nuria, yes [21:32:15] mforns: ok, let's deploy [21:32:19] batcave? [21:32:24] sure [21:34:41] awight: sounds like you want country_code [21:35:30] but you are using country. and doing IN on a list of country codes [21:35:46] madhuvishy: oooh thank you [21:36:22] np :) [21:36:45] madhuvishy: that was it! [21:36:47] bd808: sorry, have not been able to look at your jar issue yet [21:37:10] nuria: no worries. If it's on your radar I'm happy [21:37:58] ebernhardson: you can !log here those things if you want :) [21:42:48] madhuvishy: oh i didn't remember you have that here. I'll log it when it's deployed [21:43:09] ebernhardson: great thanks :) [21:48:35] ejegg: i can check in a bit - what do you wanna do? [21:49:43] madhuvishy: Just wondering if it's normal for each insert into atable stored as textfile to generate another 000000_0_copy file [21:50:13] I can totally work with it if so, but I wonder if I'm doing somethign wrong. [21:50:53] hmmm we've never run into that - probably something in the way that you are using it. i'm at a talk, but let me read your code in a bit to see what's happening [21:56:37] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1870883 (mforns) @jaime The bug mentioned above has been fixed in production, we can proceed with the upgrade. [21:57:30] bd808: https://gist.github.com/jobar/fff283765cb4aec276bf [21:57:37] G,night all :) [21:58:21] joal: ah. so the jar exists it's just not default yet [21:58:39] bd808: There is no "default" jar per se [21:59:15] When you need some custom UDF, jar needs to be added [21:59:21] ooh a new version :) [21:59:22] https://www.irccloud.com/pastebin/cldCI92c/ [21:59:45] Analytics-Kanban, Patch-For-Review: Fix EL mysql consumer's deque push/pop and size {oryx} [3 pts] - https://phabricator.wikimedia.org/T120209#1870891 (mforns) [21:59:49] What's the current estimate of the average proportion of agent_type='user' pageviews coming from logged-in vs anonymous readers? [22:00:00] joal: got it. I glossed over that part on https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/QueryUsingUDF [22:00:08] thanks for the pointer [22:00:11] np :) [22:00:30] Have a good hive session bd808 :) [22:00:42] ahh thats right, mine got merged after this version was cut. I guess i keep using my -snapshot version for now :) [22:00:50] nuria: mischief managed. joal hit me with the cluebat [22:00:52] bd808: ok, things should be working fine [22:00:54] RECOVERY - Overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [100.0] [22:01:18] bd808: jaja, ya i figured it was something when defining the function, ok, solved now [22:01:35] that icinga alarm i think is mforns and i deploying EL [22:06:06] milimetric: spark is just distributed LINQ! [22:06:14] * YuviPanda feels happy C# feelings [22:10:36] thanks madhuvishy ! [22:22:45] PROBLEM - Overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] [22:25:37] is stats.wikimedia.org in a repository somewhere? [22:28:44] Analytics-Backlog: Upgrade Spark to 1.5 - https://phabricator.wikimedia.org/T121159#1870966 (ellery) NEW [22:34:33] Analytics-Backlog: Upgrade Spark to 1.5 - https://phabricator.wikimedia.org/T121159#1870998 (ellery) [22:51:20] ejegg: can you describe to me the flow of data you expect to happen in your hive ql, and how you want to access the data [22:51:37] if there's a ticket somewhere explaining it, pointing me to it is fine too [22:53:15] madhuvishy: kinda simple, I can just describe here [22:53:19] sure [22:53:37] first distilling all requests to donate.wm.org / [22:53:46] that have a contact_id param [22:54:08] what is this source table? [22:54:18] webrequest [22:54:23] ok cool [22:55:09] then saving contact_id and a couple other params, along with hr/day/mo/yr, to ejegg.donatewiki [22:55:40] okay [22:55:46] next, get unique contact_id / utm_source combinations. [22:56:36] for the current hour, insert into ejegg.donatewiki_unique all the rows where that contact_id and utm_source don't exist [22:57:03] then insert counts into a couple of tables stored as textfiles, for easy export as csv [22:57:26] one pair of gross/unique tables with counts by hour [22:58:04] and a pair of them with counts for each utm_source rolled up across all time [22:59:46] then I'm hoping to export the count tables and make them available to the rest of the fundraising department [23:00:18] can be public, though they won't be useful to anyone else [23:00:58] Analytics-Wikistats, Internet-Archive: "Top month" and "Trend last 24 months" missing in Wikipedia columns - https://phabricator.wikimedia.org/T72900#1871174 (Reedy) [23:01:22] ejegg: do you need these intermediate tables, donatewiki and donatewiki_unique? [23:01:51] yep, I need to keep track of what contact_id / utm_source combos have already been seen [23:02:31] FR wants counts of unique contact ids opening each email [23:03:00] Or rather, clicking links from each email [23:03:34] As well as total numbers of clicks counting contacts that click twice [23:05:03] madhuvishy: but I don't want to export those tables [23:05:46] ejegg: i think the thing you are missing is the idea that the hive queries generate results, that should be written to files on hdfs. these files can be whatever format in some location. independently, you define an external hive table with a matching schema for the results of these files. whatever you are creating and inserting into in your hive ql are not [23:05:46] real tables [23:06:11] hmm, I could get away with just a contact_id / utm_source table to track what combinations have been seen [23:06:12] look at some of our other oozie jobs [23:06:27] to get an idea [23:07:31] OK, so only the intermediate table(s) should be in hdfs, and the counts should be external tables? [23:08:18] ejegg: "tables" are only metadata [23:08:25] actual data is in files [23:09:01] So I'd use a temp 'table' to pull in the current hour's data, which I'm currently storing in ejegg.donatewiki [23:10:03] have a single persistent table/file stored on hdfs that just lists contact_id/utm_source combinations already seen [23:10:26] your intermediate data and final counts should be in external tables. only you'll set the location for intermediate ones in /tmp or sth. And create the external tables for the data you want to finally use, outside of this hql file [23:10:26] if you look at hive/ directory in analytics-refinery [23:10:26] you can see a bunch of creates [23:12:22] ejegg: the hive-action that you define in the workflow, will run the query in the script param, and store it in the destination directory. [23:12:38] OK, cool. But the persistent table I'm using to join to my intermediate data and filter out existing combos, that one should live in hdfs? [23:12:45] that will happen if you don't have an external table defined on top of the final results [23:12:59] everything is in hdfs - where else? [23:13:30] oh, I thought external tables meant not-hdfs [23:13:40] sorry, what is the 'external' disctinction? [23:15:04] Analytics-Backlog: Include all timezones in global metrics report interface {kudu} - https://phabricator.wikimedia.org/T121167#1871524 (Abit) NEW a:Milimetric [23:16:01] nvm, I see the docs! [23:16:03] ejegg: like i said - hive tables don't really "store" anything. I think the distinction between create table and create external table is just that with external you can tell it where to look [23:17:20] and some other implementation details [23:17:35] Would that help at all with the _copy_ issue? I'd still be adding rows each hour to the by_hour count tables [23:17:57] ya i don't think you should get any of those files [23:18:12] there may be also things missing in the oozie config to cause it [23:18:49] hmm, ok. My google-fu is failing me on this one [23:19:33] ejegg: have you tried running your query outside oozie? [23:20:34] madhuvishy: it's getting me the results I wan when I run in hive cli, let me see if it's creating all those tables... [23:21:00] there is a bunch of docs on wikitech [23:21:55] yeah, I've been following those and peeking at all the o+r job files I can find on stat1002 to get this far [23:24:43] ejegg: something like this is good to read - https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mobile_apps/uniques/monthly/generate_uniques_monthly.hql [23:25:16] thanks, I'll check it out! [23:28:19] madhuvishy: I created a test table in ejegg (with the same storage format as in my hql) and ran a few different inserts via hive cli. each one creates a new _copy_ file [23:28:22] this job writes the final results into an external table(stored at temporary directory), and then runs another oozie action to merge the results into another file which is what we look at. we have an external table defined on top of that file. that's here - [23:28:22] https://github.com/wikimedia/analytics-refinery/blob/master/hive/mobile_apps/create_mobile_apps_uniques_monthly_table.hql [23:28:26] wlil try with external [23:29:28] ah, maybe I don't need the serde with external... [23:39:11] madhuvishy: ah well, even creating it external just like that gh link gives me a new _copy_ file [23:39:16] for each insert [23:39:28] ohwaitaminute... insert overwrite might be what i need [23:42:32] nah, i don't actually want to overwrite [23:44:37] huh, I'll look at a bunch more examples... [23:45:07] thanks for the help madhuvishy ! [23:54:00] nuria: do we expect the mysql consumers to be lagging now? [23:54:08] i got a burrow alert [23:54:42] madhuvishy: lagging as in pulling from kafka slower than before? [23:55:48] nuria: yeah [23:56:00] not in error state but warning - so i guess slower [23:56:09] madhuvishy: so marcel decreased the batch sizes, see: https://grafana.wikimedia.org/dashboard/db/eventlogging [23:56:16] nuria: mm hmm [23:56:39] madhuvishy: and looks like insertion rate is smaller.... mmmm [23:57:03] hmmm, do you expect it to catch up? [23:57:51] madhuvishy: i looks like having it this way decreased teh inserted rate, which is bad [23:58:00] madhuvishy: we are going to have to deploy [23:58:03] ah yeah [23:59:11] i see the inserted rate is going down [23:59:41] madhuvishy: ok, changing sizes back to what they were [23:59:47] madhuvishy: stand by to CR