[03:55:48] Analytics-Backlog, Analytics-Cluster, Easy: Mobile PM sees reports on browsers (Weekly or Daily) - https://phabricator.wikimedia.org/T88504#1615431 (Krinkle) See also T107175. I'm also eagerly waiting for this. Essentially a user-friendly insight to aggregated information from wmf.webrequest.user_agent... [07:06:01] Analytics-Cluster, operations: Fix llama user id - https://phabricator.wikimedia.org/T100678#1615603 (faidon) Any news here @ottomata? puppet has been failing on analytics1026 for more than 68 days now. [09:24:08] Analytics-Tech-community-metrics, ECT-September-2015: Automated generation of (Git) repositories for Korma - https://phabricator.wikimedia.org/T110678#1615864 (Dicortazar) @Qgil, that's the point. However, as the name of the git repositories can be inferred from the name of the gerrit projects, that shou... [09:50:15] Analytics-Tech-community-metrics, ECT-September-2015: Automate the identities generation - https://phabricator.wikimedia.org/T111767#1615957 (Dicortazar) NEW a:Dicortazar [09:51:56] Analytics-Tech-community-metrics, ECT-September-2015: Automate the identities generation - https://phabricator.wikimedia.org/T111767#1615957 (Dicortazar) [09:51:58] Analytics-Tech-community-metrics, ECT-September-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1615966 (Dicortazar) [09:54:14] Analytics-Tech-community-metrics, ECT-September-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1615983 (Dicortazar) We're now working on an automated way to deal with identities and affiliations. We alrea... [10:03:45] Analytics-Tech-community-metrics, ECT-September-2015: Provide open changeset snapshot data on Sep 22 and Sep 24 (for Gerrit Cleanup Day) - https://phabricator.wikimedia.org/T110947#1615994 (Dicortazar) @Aklapper, is this related to this panel?: http://korma.wmflabs.org/browser/scr-backlog.html This is a... [12:38:25] Hey halfAFK [12:38:37] nO NEWS ON MY SIDE TODAY [12:39:06] do you have things you'd like to discuss, or do we cancel the meeting ? [12:39:12] halfAFK: --^ [12:51:00] morning [12:51:06] Hi milil [12:51:11] milimetric sorry [12:51:18] hi joal :) [12:51:39] o/ joal & milimetric [12:51:43] hey, I'm going to skip our meeting this morning guys [12:51:49] Nothing here either. Let's all skip :) [12:51:53] ok ! [12:52:37] Something somewhat unrelated that I wanted to show you guys: https://www.mediawiki.org/wiki/User:Halfak_%28WMF%29/mediawiki-utilities [12:54:03] Anything with a [docs] link is ready to use. [12:54:12] Just had a quick glance : seems awesome halfak ! [12:55:11] I'm converting mediawiki utilities from a monolith into a set of small unixy packages. :) [13:05:40] halfak: microservices FTW ! [13:06:48] And composable parts! [13:13:12] :) [13:34:08] (PS2) Bmansurov: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [13:34:59] (PS2) Joal: Add json revisions sorted per page job [analytics/wikihadoop] - https://gerrit.wikimedia.org/r/233937 (https://phabricator.wikimedia.org/T108684) [13:36:00] (CR) Bmansurov: "Hi Dan, I'd appreciate some feedback on my monkey patch." [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [13:36:13] bmansurov: looking now [13:36:19] thanks [13:58:39] Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779#1616299 (Jarekt) NEW [14:00:13] Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779#1616307 (yuvipanda) Those aren't actually running for months, I think - those happen when for some reason there's an unhandled exception in the query results serializer, preventing it from updating the status accordingly... Nee... [14:25:18] (CR) Joal: "Thanks Marcel for your comments :)" [analytics/wikihadoop] - https://gerrit.wikimedia.org/r/233937 (https://phabricator.wikimedia.org/T108684) (owner: Joal) [14:36:58] holaaaa [14:37:41] (CR) Milimetric: Update SQL scripts to reflect Edit schema change (4 comments) [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/236237 (https://phabricator.wikimedia.org/T111557) (owner: Neil P. Quinn-WMF) [14:39:12] Analytics-Dashiki, Analytics-Kanban, Browser-Support-Firefox: vital-signs doesn't display pageviews graph in Firefox 41, 42 {crow} [3 pts] - https://phabricator.wikimedia.org/T109693#1616415 (Milimetric) @Nemo_bis, I'm just making sure, you have the same issue despite clearing your cache? I tried it w... [14:45:00] joal:yt? [14:45:07] Yup nuria [14:45:17] what's up ? [14:45:18] I was doing the CR for the cassandra stuff [14:45:33] Yes [14:45:44] joal: and i am not sure what this class does: https://gerrit.wikimedia.org/r/#/c/236224/1/oozie/cassandra/refine_webrequest.hql [14:46:24] Oh ! That's a file copy-paste mistake :S [14:46:33] ahhhh [14:46:35] ok [14:46:39] that makes sense [14:46:43] Sorry for having spot it :) [14:47:46] sorry for NOT having spot it myself nuria [14:48:53] joal: np, then, idea is to load cassandra from pageview_hourly [14:49:07] and so far your changes are going into /tmp for testing right? [14:49:13] nuria: from pageview_hourly and projectview_hourly [14:49:19] joal:right [14:49:49] and tmp is just because I need some temp storage between my two steps (data computation and loading job) [14:53:11] nuria: --^ [14:53:28] joal:ok [14:54:49] So the last step of a successful job is to delete that temp folder [14:54:52] nuria: --^ [15:04:40] joal: right, looking into that now [15:04:54] Thanks nuria [15:14:44] joal: and the cassandra load job is to come then? [15:15:03] joal: as a future changeset, correct? [15:15:27] nope, it's actually a dependent change nuria [15:16:17] (CR) Nuria: [C: 1] "Looks good, just one file needs to be removed." (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/236224 (https://phabricator.wikimedia.org/T108174) (owner: Joal) [15:19:53] (PS3) Bmansurov: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [15:20:51] Analytics-Kanban: Delete obsolete schemas {tick} - https://phabricator.wikimedia.org/T108857#1616623 (mforns) a:mforns [15:21:12] (PS4) Bmansurov: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [15:21:33] Analytics-Kanban: Delete obsolete schemas {tick} - https://phabricator.wikimedia.org/T108857#1532313 (mforns) Here's the list of the schemas that should be permanently deleted: ``` Campaigns_5485644 Campaigns_5487321 Analytics_5751947 NewEditorMilestone_6538838 PerformanceMetric_6757313 AccountCreation_493334... [15:25:37] (CR) Milimetric: "This looks like it works, I'll try it after some meetings today. In the meantime, you should add a simple test for pickColumns in https:/" [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [15:30:13] (PS1) Nuria: [WIP] Make pageview definition aware of preview parameter [analytics/refinery/source] - https://gerrit.wikimedia.org/r/236800 [15:31:10] (CR) Bmansurov: "OK" [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [15:43:42] Analytics, Labs, Labs-Infrastructure, Labs-Sprint-108, Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1616747 (ellery) Thanks Otto! [15:50:25] halfak: same as last week --> meeting with altiscale ? [16:02:53] (CR) Milimetric: [C: -1] Add filters above timeseries graphs in the compare layout (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [16:10:53] halfak: do we spend the rest of the meeting discussing the collaboration with Nathan ? [16:11:56] Nathan ...? [16:12:00] joal, ^ [16:12:13] hm? I guess I didn't get the name correctly :S [16:12:14] Oh! Nitin! [16:12:16] :) [16:12:18] Sure! [16:12:19] Sorry [16:12:21] * halfak jumps back in [16:12:25] Or batcave? [16:12:44] joining the previous one [16:14:49] ottomata: you lemme know when you need me to read your writeup [16:14:57] ottomata: will be testing udfs until then [16:16:00] k am in ops meeting, still tweaking some [16:22:37] milimetric: i have emailed lydia for hive access but got no response, let me know if she pings you [16:22:50] will do nuria [16:29:34] Hey milimetric [16:29:40] hi joal [16:29:42] Got out of meeting with Aaron [16:29:48] ooh, i got one with kevin now :/ [16:29:52] after :) [16:29:59] np ! [16:30:16] * joal gets back to bots [16:30:27] (PS5) Bmansurov: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [16:32:05] (CR) Bmansurov: "I've added some tests too, but getting a weird error related to karma: "Can not load "requireglobal", it is not registered! Perhaps you ar" (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [16:32:42] Hey mforns [16:33:07] Got halfak validation: do you mind merging the XML->JSON patch in wikihadoop ? [16:33:17] +1 [16:35:45] thx halfak [16:50:02] ottomata: looking ... [17:12:01] joal, I need more time with these comments man. I have another meeting soon too, so let's talk tomorrow [17:16:07] madhuvishy: HEWO [17:16:16] ottomata: Hiii [17:16:26] shall we? [17:16:35] milimetric: ok no problem [17:16:35] yes, omw to batcave [17:21:38] I guess I haven't been on stat1003 in a while, the research password changed [17:22:01] marktraceur, if you are in the right group, you can read a special file [17:22:02] How do I get the secrets [17:22:04] what are your groups? [17:22:08] Errr [17:22:27] mforns: around? [17:22:37] madhuvishy, in meeting [17:22:38] statistics-users researchers [17:22:43] mforns: we are seeing a weird exception because of dupliate key error with eventlogging [17:22:48] that is causing the process to die [17:22:49] marktraceur, check out /etc/mysql/conf.d/research-client.cnf [17:22:51] Ta [17:22:54] And add a symlink [17:23:00] So when we change it again, you don't notice :D [17:24:03] Indeed. [17:24:11] Oh, now there's some other silly error. Sigh. [17:24:26] Possible the mysql server needs to be restarted? [17:24:47] Looks mostly OK to me, but I can't connect [17:25:30] milimetric: , yt? [17:26:20] hey ottomata [17:26:31] can you come to batcave for a sec [17:26:33] madhu and I have a q [17:26:40] about el mysql consumer [17:26:46] brt. [17:30:27] halfak: Isn't the sql server on a different host? Looks like the file you mentioned points at localhost. [17:33:06] marktraceur, the file doesn't reference a host [17:33:11] Right [17:33:18] OK, I just added it to the command, whatever [17:33:31] Yeah. That's what I do. I set up an alias. [17:33:50] alias dbstore='mysql --defaults-file=~/.my.research.cnf -h analytics-store.eqiad.wmnet -u research' [17:34:08] ~/.my.research.cnf is my symlink [17:36:18] Right [17:36:51] OK, looks like I've got access and all...now to start hammering stat1003 with some queries... [17:36:56] Well, "hammering" [17:37:02] Really it shouldn't be too bad, I'm dealing with uploads [17:42:09] Analytics-Tech-community-metrics, ECT-September-2015: Automated generation of (Git) repositories for Korma - https://phabricator.wikimedia.org/T110678#1617460 (Aklapper) >>! In T110678#1590569, @Qgil wrote: > because this will allow us to mark repositories as "Inactive"in code review terms, but we will st... [17:45:03] (PS2) Joal: [WIP] Add cassandra load job for pageview API [analytics/refinery] - https://gerrit.wikimedia.org/r/236224 (https://phabricator.wikimedia.org/T108174) [17:46:35] (CR) Joal: [WIP] Add cassandra load job for pageview API (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/236224 (https://phabricator.wikimedia.org/T108174) (owner: Joal) [17:47:42] Analytics-Tech-community-metrics, ECT-September-2015: Provide open changeset snapshot data on Sep 22 and Sep 24 (for Gerrit Cleanup Day) - https://phabricator.wikimedia.org/T110947#1617472 (Aklapper) @Dicortazar: It's basically about those two "total" numbers of [[ http://korma.wmflabs.org/browser/gerrit_... [18:08:11] madhuvishy, hi [18:08:38] mforns: ottomata and i had a question on the EL consumer. milimetric helped out :) [18:08:51] or did you ping about something else? :) [18:09:02] madhuvishy, oh sorry, I was in the meeting with jaime crespo [18:09:06] no, just that [18:09:30] mforns: no problem! [18:09:34] ok [18:14:03] actually mforns_brb, milimetric we do have a big [18:14:05] question [18:14:07] batcave again? [18:26:16] mforns_brb: milimetric, take a look [18:26:16] https://gerrit.wikimedia.org/r/#/c/236850/ [18:26:24] madhuvishy: and I are going to merge and deploy this [18:34:54] Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779#1617653 (Jarekt) Ok So you are saying I should stop waiting for my http://quarry.wmflabs.org/query/5045 query results? ;) If they are all killed after 20 min., as queries from other tools like CatScan2, than no need for " easy... [18:52:21] ottomata: was away, looking [18:52:50] Is there a performance difference between Hive and Beeling when querying? [18:52:58] Any feedback on this Hive query? [18:52:59] SELECT CONCAT(user_agent_map['browser_family'], ' ', user_agent_map['browser_major']), COUNT(*) FROM webrequest WHERE year=2015 AND month=8 AND is_pageview GROUP BY user_agent_map['browser_family'], user_agent_map['browser_major']; [18:53:17] Beeline* [18:55:34] madhuvishy / ottomata: crazy that setup.py didn't have a mysql driver until now :) [18:55:40] the patch looks good to me [18:55:47] +2 post-merge [18:56:11] hey krinkle [18:56:13] Krinkle: I don't think there's a perf. difference, but we haven't tried beeline yet. [18:56:26] Shouldn't be any perf diff [18:56:35] your query looks fine too [18:56:45] But there is a faster way to get your result [18:56:50] I was hoping there is [18:56:56] Krinkle: the only thought is maybe you'd want to sample a bit, 'cause a month of data is a LOT to crunch over [18:57:21] Krinkle, milimetric: is_pageview --> pageview_hourly [18:57:21] is there reason to not sample? [18:57:40] yeah, querying pageview_hourly works too [18:58:00] or using TABLESAMPLE: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling [18:58:24] So you mean query pageview_hourly instead of webrequest and omit the condition [18:58:31] you could go for: correct Krinkle [18:58:37] sorry :) [18:58:43] Krinkle: exactly [18:59:00] Krinkle: data size is 1 over 30 [18:59:06] should be faster [18:59:19] and that table is not sampled? [18:59:25] nope [19:00:10] And by the way, another thing you would need to do if you wanted to stay with webrequest table is to add a restriction clause on webrequest_source partition [19:00:43] like: "AND webrequet_source IN ('text', 'mobile')" [19:00:50] Krinkle: --^ [19:00:57] Why? [19:01:10] Cause is_pageview only applies for data in those two partitions [19:01:21] No pageview in uploads for instance :) [19:01:24] Right [19:01:29] Ah, so it optimises the query [19:01:33] correct [19:01:35] not needed for _hourly? [19:01:47] nope, pageview_hourly has that job done for you :) [19:02:06] Oh, I was about to forget [19:02:11] So my current workflow is to run this query, dump it into google docs, populate an additional percentage column, copy it into plot.ly and generate a bar graph [19:02:28] You need to SUM(view_count) instead of COUNT(*) [19:02:38] hm [19:02:54] Krinkle: if you dump it into /a/aggregate-datasets on stat1002, it'll end up here: http://datasets.wikimedia.org/aggregate-datasets/ [19:03:01] Cause pageview_hourly pre-aggregates on dimensions [19:03:02] Hm.. intersting. view_count can be > 1? [19:03:06] so you can get data out publicly that way too, and load from plotly [19:03:27] sure can be :) [19:04:00] Krinkle: "views" in pageviews_hourly are the hourly views for the values in the other columns [19:04:03] ah, there are no query, path, or user columns [19:04:18] yeah, and no varnish host [19:04:33] okay, that makes it quite plausible indeed [19:04:53] K. query now takes 10min instead of 45 :) [19:06:05] nice :) [19:11:40] (PS1) OliverKeyes: Add non-API searches to search UDFs [analytics/refinery/source] - https://gerrit.wikimedia.org/r/236860 [19:12:40] ottomata, madhuvishy, I just got back [19:12:47] looking at the patch [19:13:36] (CR) OliverKeyes: "These are teeny tiny petty tweaks and shouldn't impact the mergability of the code." (3 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/236800 (owner: Nuria) [19:19:26] joal: milimetric: Thanks [19:19:27] https://docs.google.com/spreadsheets/d/1n9FhSqcBGM9iKXrlHsP0EZI0gU89Rmz5m51uglUGVjs/edit#gid=0 [19:19:42] cool [19:20:00] Nice :) [19:21:31] It'd be cool to maybe get some telemetry into "Other" and see what major User-Agent may be missing support in ua-parser [19:21:37] Krinkle: This pageview_hourly table is really made for the kind of analysis you are doing, so please go for it ! :) [19:21:59] over 10% of global traffic is "Other" [19:22:14] yeah Krinkle [19:22:33] A first thing we have to do is to update our version of ua-parser regexp [19:22:47] Ours are a bit outdated (and UAs change fast) [19:23:11] It's on the go: https://phabricator.wikimedia.org/T106134 [19:23:18] Krinkle: --^ [19:24:53] Hn.. bingbot and googlebot are also part of it by default I see [19:24:58] I guess they qualify as page views [19:25:35] Excluding may not be trivial since Google is also starting to pre-render stuff in various experiments so it really could be a page view [19:25:47] They qualify as pageviews, but you can remove (most of) them easily with 'AND agent_type = 'user'' [19:25:51] Krinkle: --^ [19:26:06] joal: What are the distinct values? [19:26:19] for the moment we have 'user' and 'spider' [19:26:28] It's not part of ua-parser? [19:26:39] We are in the process of adding bot (for specific wiki bots we know about [19:28:09] agent_type = 'spider' === (user_agent_map['device_family'] = 'Spider' OR user_agent RLIKE '(goo wikipedia|MediaWikiCrawler-Google|wikiwix-bot)' [19:28:17] Thanks Ironholds for CR, let me get the final changes in and you can CR-gain, ccjoal [19:28:19] Forgot q parenthesis [19:28:23] Thanks Ironholds for CR, let me get the final changes in and you can CR-gain, cc joal [19:28:35] nuria, np! Everything else LGTM [19:28:36] thanks nuria for letting me know [19:28:53] Krinkle: --^ [19:29:15] OK. [19:29:19] And we are also currently working on that (bots): https://phabricator.wikimedia.org/T108598 [19:29:37] You couldn't be more on the spot Krinkle :) [19:30:18] The bot noise isn't too bad for the numbers, but it's deceptive when looking at percentages [19:30:36] Right [19:31:05] Hm.. agent_type does slow down the query a fair bit but I'll wait patiently :) [19:31:40] Really Krinkle ? [19:32:28] A good way to know if your query is really slowed down or if you hit resource bottleneck is to watch, at the end of the query, the real CPU time [19:32:54] Since hadoop is shared resource and paralellize load, it's a better estimation of query efficiency than speed [19:32:58] Krinkle: --^ [19:33:58] And when I say real CPU time, I mean the line that starts with: Total MapReduce CPU Time Spent [19:35:48] Analytics-EventLogging: Package the Avro PHP library for easier Composer usage - https://phabricator.wikimedia.org/T111851#1617951 (bd808) NEW a:Ottomata [19:36:32] joal: it's about the same [19:36:38] Analytics-EventLogging: Package the Avro PHP library for easier Composer usage - https://phabricator.wikimedia.org/T111851#1617966 (bd808) a:Ottomata>None Should we put this in Gerrit or just have it be GitHub only? I'd lean towards Gerrit. [19:36:40] So Chrome 44 changed from 11 to 19% [19:36:45] when filtering out non-user [19:39:56] Hm.. there's a long train of non-sensical values [19:39:59] like Firefox 8013 [19:40:14] Safari 38 [19:41:04] Maybe I should filter out entries with < X number of hits [19:41:12] Analytics-Kanban: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1617983 (JAllemandou) Quick analysis over a recent hour using three regexp, '(?i).*bot.*', '(?i).*crawler.*', and the one define by Bob (link in the task de... [19:41:41] Meh, then the total would be off. I'll leave it as is and just display only the top 100 [19:41:41] nuria, milimetric, Ironholds : Do you mind having a quick look --^ [19:42:13] I'd like a confirmation about the WikiBot convention, and the addition of the regexps to our bot filtering system. [19:42:25] If possible please :) [19:42:29] joal, the phab link above or is there a gerrit patch? [19:42:33] Krinkle: that sounds right, ua parsing will always be imprecise as it is hard to keep up, but the majority of the reporting should be correct [19:42:56] Ironholds: Nope, analysis only before coding (I managed to do that, do you ealize ? ;) [19:43:04] Yeah, i've contributed a fair bit to ua-parser myself when working on jQuery TestSwarm [19:43:12] joal, so, the phab ticket? [19:43:17] Yes [19:43:20] The patterns are mostly just that, patterns. Not bound to sanity ranges [19:43:35] yeah, ua-parser is (speaking as one of the technical architects) fucking terrible [19:43:36] so people making up user agents that match the pattern of a real browser will get reported as such [19:43:41] it remains, however, less terrible than the alternative [19:43:48] I just wish I could say totally rebuild the schema ;p [19:43:48] Yeah, it's one of the best. [19:44:09] the one big change I'd make is adding more metadata to the YAML file to indicate conflicts. [19:44:14] Ironholds: Hm.. what kind of ideals do you have in mind in this regardd? [19:44:23] Krinkle I have used uadetector, works reasonably well as well [19:44:37] at the moment if a known UA would match multiple patterns, we solve that by...putting the "right" one further up in the file, so it's hit first [19:44:47] Ironholds: Right :D [19:44:58] Well Ironholds, that works ! [19:45:00] Opera > Chrome > Safari [19:45:01] problem; that "right" regex right for that one known UA may be ~~totally stupid to run against everything and yet we have to~~ [19:45:07] Krinkle has it [19:45:15] another common problem is in relation to particular iOS versions [19:45:19] joal, it works but it's inefficient [19:45:26] yup :) [19:45:26] http://webaim.org/blog/user-agent-string-history/ [19:45:33] every UA you submit, or a vast chunk of them, must always be checked to see if they're Opera. [19:45:55] (from 2007) [19:45:59] joal: do you want comments on tt? [19:46:04] Zakas did a nice rendition of it in 2010 https://www.nczonline.net/blog/2010/01/12/history-of-the-user-agent-string/ [19:46:05] so one thing I wanted to do was refactor the YAML schema to specify names for each pattern and some sort of run_if_match parameter [19:46:26] basically if you DID match Safari, it'd know to check if you were Chrome and/or Opera before confirming a match [19:46:41] which means we could order regexes in order of commonality rather than primacy [19:46:47] yes pliz nuria, two things: Do I include the three regexp given what we have seen, and second, are you ok with WikiBot convention [19:46:51] ambiguous UAs would take longer to match but "most" UAs would take much less time [19:46:58] Ironholds: Hm.. bottom up essentially? [19:47:22] more binary-tree like rather [19:47:25] indeedy [19:47:42] but I never got around to it because I didn't particularly want to handle the work of /also/ refactoring all the /implementations/ :/ [19:47:51] running the C++ and R versions are enough work for me [19:48:04] joal, it looks good. I can think of a couple of not-used heuristics we could incorporate, actually. [19:48:17] Please comment Ironholds :) [19:48:19] I've actually been working on a very similar problem for search [19:48:21] cool! [19:48:38] Ironholds: If you find an exploit in the regex implementations you can just ship C code in them and not have to update the other implementations. [19:48:44] :P [19:49:51] ahahah [19:49:53] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1618011 (SVentura) @spage, thanks for adding @Ironholds. APIs use: We have a really big need to find out the WHO, HOW, W... [19:50:12] I think I'm just going to hyper-optimise the C++ implementation (which is /terrible/) and call that victory. [19:50:12] Ironholds: About WikiBot convention, if you have an opinion, please share as well :) [19:50:17] joal, yep [19:50:26] thanks man [19:50:35] Guys, It'll be my day ! [19:50:42] Have a good end of day a-team ! [19:50:52] Analytics-Kanban: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1618014 (Nuria) >if user agent contains (eg. WikiBot) then mark agent_type = bot Sounds good. >Overall, if we were to use the three regexp in addition to... [19:51:01] milimetric: let me know tomorrow if you want us to read some puppet [19:51:28] Analytics-Kanban: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1618018 (Ironholds) Agreed and agreed. A couple more heuristics I've found useful: 1. Looking for an email address and/or URL. This would need to be tested... [19:52:25] ottomata, madhuvishy, I added a comment in the mysql consumer patch, just an idea for error recovery improvement :]. The code looks cool, you found a real issue: the mysql writer handles all other queues before exiting but not the one that had the error. [19:53:32] ottomata: I'm back - batcave? [19:53:46] ottomata1: ^ [19:53:52] oo mforns nice idea [19:53:56] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1618022 (Ironholds) >>! In T102079#1618011, @SVentura wrote: > @spage, thanks for adding @Ironholds. > APIs use: We have... [19:54:06] madhuvishy: lets do IRC for a bit [19:54:16] ottomata: okay [19:54:16] so, it deployed in labs [19:54:19] have also done in prod now [19:54:24] ah cool [19:54:26] doing in labs first was good, found a couple of tweaks to ake [19:54:27] make [19:54:31] aha [19:54:44] so, forwarderon eventlog1001 is now producing to kafka as well as 0mq [19:54:47] so you added the kafka forwarder to prod [19:54:52] cool [19:55:06] i think we should do the eventlogging server side processor next [19:55:15] yeah alright [19:55:16] make it consume from kafka instead of zmq [19:55:23] we can either just do that [19:55:29] or we can make it do that, and produce to kafka too [19:55:35] in addition to producing to zm1 [19:55:37] 0mq [19:55:43] thoughts? [19:55:45] just consume first? [19:55:52] i don't think it will hurt to output to both. [19:56:01] hmmm [19:56:26] if we test switching off one in labs [19:56:43] and it's fine, may be we can just follow through in prod? [19:57:04] given that the events are going into kafka now, we won't lose data right [19:57:36] hm,yes, but we can't use multiplexer with anyting but 0mq [19:57:46] Ironholds: Can I make this spreadsheet public? I know it's only ua-parser aggregated data, but since those values can be abused it does expose a few unique entries. [19:57:48] so we can't turn off server side valid 0mq until we also do client side [19:57:54] I dont' imagine so but just want to sanity check [19:58:01] ottomata: aah okay, lets keep both then [19:58:07] k [19:58:16] Krinkle, on the phab ticket? [19:58:16] ja i think we should do this for server side [19:58:17] then client side [19:58:21] with both 0mq and kafka on [19:58:27] it makes me a bit uncomfortable but it's more the lawyers or security peeps to ask than I [19:58:27] ottomata: ya alright [19:58:28] Ironholds: g docs https://docs.google.com/spreadsheets/d/1n9FhSqcBGM9iKXrlHsP0EZI0gU89Rmz5m51uglUGVjs/edit#gid=0 [19:58:30] then we can change mysql consumer, and maybe hafnium stuff [19:58:38] ottomata: yup cool [19:58:42] oh, and we can talk to Krinkle about changing their stuff on hafnium too [19:58:43] :) [19:58:49] * Krinkle hides [19:58:51] heheh [19:58:55] Krinkle, previously we've done that with a minimum cutoff for values [19:58:57] Krinkle: we might do it for you... :) [19:58:58] and it's been fine [19:59:13] what timeframe is this over? [19:59:13] ok, madhuvishy preparing patch [19:59:17] (1 day, 1 week, 1...?) [19:59:17] Ironholds: August [19:59:20] oh, wow [19:59:25] awesome! [19:59:39] I'd say if you roll everything with sub-10k entries up into "Other" just to be on the safe side you'll probably be fine [19:59:49] Okay [19:59:51] but you may want to poke Chris or whoever Michelle's delegate is [19:59:53] just to be sure [20:01:32] ottomata: okay, let me know when you push [20:04:28] madhuvishy: https://gerrit.wikimedia.org/r/236921 [20:05:46] Analytics-Tech-community-metrics, Research consulting, Research-and-Data: Data for audit report - https://phabricator.wikimedia.org/T110067#1568254 (DarTar) @ezachte, I understand this is completed on our end, moving it to Done. [20:07:19] Krinkle: when we reported browsers for teams before we "eat" the longtail [20:07:30] ottomata: do we still need the processor:kafka role [20:07:44] yes, it is still pushing client side events to kafka [20:07:58] aah right [20:08:04] on analytics1010 [20:08:09] right i forgot [20:08:14] i mean, until we move it to eventlog1010, it isn't produciton [20:08:15] but ja [20:08:20]