[02:42:43] Analytics-Kanban, user-notice: Pageviews API reporting inaccurate data for pages titles containing special characters - https://phabricator.wikimedia.org/T128295#2093342 (Shizhao) The bug have fixed? [03:42:47] Analytics-Wikistats: Total page view numbers on Wikistats do not match new page view definition - https://phabricator.wikimedia.org/T126579#2093366 (Tbayer) @ezachte : Any updates? [05:13:38] (CR) Nuria: "Looks good. Let me know if you have tested it in on cluster." [analytics/refinery] - https://gerrit.wikimedia.org/r/274187 (https://phabricator.wikimedia.org/T126767) (owner: Joal) [08:04:56] Analytics-Kanban, user-notice: Pageviews API reporting inaccurate data for pages titles containing special characters - https://phabricator.wikimedia.org/T128295#2093596 (Slaporte) From what I can see, the data from Feb. 23-29 [looks fixed](https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikiped... [09:59:28] (CR) Joal: [V: 2] "Self merging documentation." [analytics/refinery] - https://gerrit.wikimedia.org/r/268216 (owner: Joal) [09:59:47] (CR) Joal: [C: 2] "Self merging documentation." [analytics/refinery] - https://gerrit.wikimedia.org/r/268216 (owner: Joal) [10:04:15] (PS1) Joal: Correct webrequest refine HQL. [analytics/refinery] - https://gerrit.wikimedia.org/r/275382 [10:10:01] Analytics-Kanban, user-notice: Pageviews API reporting inaccurate data for pages titles containing special characters - https://phabricator.wikimedia.org/T128295#2093735 (JAllemandou) Hi, Backfilling has finished yesterday night. Data should be correct now :) Sorry for the inconvenience and thanks again fo... [10:27:41] (CR) Joal: [C: 1] "Looks good, minor comments not impacting functionnality." (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/273557 (https://phabricator.wikimedia.org/T108618) (owner: BryanDavis) [10:33:16] hi a-team :] [10:33:42] hellooooooo [10:33:47] how is India mforns?? [10:34:09] hey elukey! awesome so far :] [10:35:17] \o/ [10:48:07] Analytics-Kanban, Patch-For-Review: Create reportupdater browser reports that query hive's browser_general table {lama} - https://phabricator.wikimedia.org/T127326#2093829 (mforns) Awesome! [10:50:06] Hi lads :) [10:51:27] elukey: cluster is back to normal, you can backfill pageview data if you want :) [10:52:30] joal: thanks! I'll try to find a bit of time today/tomorrow to finalize my wiki page, and then I'll ask you and milimetric again if I can proceed with 1 hr [10:52:40] on hdfs [10:52:42] awesome :) [11:03:29] hey joal :] [11:03:43] Hi :) [11:11:02] (PS1) Joal: Remove mobile/zero from dataset_dump script [analytics/refinery] - https://gerrit.wikimedia.org/r/275392 [11:15:34] Analytics-Kanban: Parse User-Agent strings with OS like "Windows 7" correctly into the user agent map {hawk} - https://phabricator.wikimedia.org/T127324#2093889 (mforns) @madhuvishy This seems correct, you're right. I agree we should leave it as it is. Nevertheless, my point was: in the case of //Windows 7//... [11:56:48] hey mforns, are you working today [11:57:02] milimetric, hey! yes [11:57:11] I am catching up on email [11:57:20] do you have an idea for me to do afterwards? [11:59:38] you can grab anything you like. I can catch you up on the reportupdater stuff if you want [12:00:48] milimetric, I've seen the latest changes and the reports backfilled and running, thanks for finishing it! [12:00:58] Analytics-Tech-community-metrics, DevRel-March-2016: gerrit_review_queue.html: List of Repositories has not been updated recently - https://phabricator.wikimedia.org/T128170#2094032 (Lcanasdiaz) One of the log files of octopus_gerrit was not correctly setup. Now, this is the output of the gerrit retrieval... [12:01:13] there are a couple of things left. Are you at standup today? [12:05:18] milimetric, yes [12:06:00] you mean the hadrcoded user and the line endings? [12:17:32] Analytics-Tech-community-metrics, DevRel-March-2016: gerrit_review_queue.html: List of Repositories has not been updated recently - https://phabricator.wikimedia.org/T128170#2094084 (Lcanasdiaz) This should be fixed with the next execution which is planned for tomorrow morning. [12:37:25] joal, hi! the text partition of the webrequest table at 2016-03-05T07/1H seems to have failed computation, is that expected? (I'm on ops week, can I do someting?) [12:38:50] Hi mforns :) [12:38:58] o/ [12:39:30] The alert email is from 2 days ago: no email today, no error today ;) [12:40:21] mforns: double checking the exact possible problem: have a look in hue in refine-webrequest-text coordinator, and see if the given time has really failed (just did it, it has not) [12:41:18] mforns: Usually when there is only the last lined (up to 2) of the alert email that sais not computed, I don't bother too much: it usually means the cluster is a bit late :) [12:41:26] mforns: makes sense ? [12:42:01] joal, aha understand [12:42:14] makes sense, thanks! [12:43:25] np mforns :) [12:49:59] Analytics-Cluster, Operations, hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2094118 (mark) I approve using one of the old rb servers for this, as soon as available. Let's make sure we have disks for them? [12:52:44] Analytics-Kanban: Communicate the WikimediaBot convention {hawk} - https://phabricator.wikimedia.org/T108599#2094157 (mforns) @jayvdb Yes, we'll send an email to wikitech-l. @bd808 I'd like to reach out to bot/framework maintainers and ask them to implement the optional improvement ('bot') of the User-Agent... [13:01:21] Analytics-Tech-community-metrics, Developer-Relations, DevRel-March-2016, Gerrit-Migration: Make MetricsGrimoire/korma support gathering Code Review statistics from Phabricator's Differential - https://phabricator.wikimedia.org/T118753#2094187 (Aklapper) [13:26:19] Analytics-Kanban, Patch-For-Review: Create reportupdater browser reports that query hive's browser_general table {lama} - https://phabricator.wikimedia.org/T127326#2094213 (mforns) @milimetric Looking into the line ending thing. [13:36:36] Analytics-Tech-community-metrics, Developer-Relations, DevRel-March-2016, Gerrit-Migration: Make MetricsGrimoire/korma support gathering Code Review statistics from Phabricator's Differential - https://phabricator.wikimedia.org/T118753#2094217 (Lcanasdiaz) @Aklapper any news about when you guys are... [14:31:11] Analytics-Kanban, Patch-For-Review: Create reportupdater browser reports that query hive's browser_general table {lama} - https://phabricator.wikimedia.org/T127326#2094290 (mforns) @milimetric I saw your patch removing the CRLF from the .gitignore file. In addition to it, in stat1002 the file browser/con... [14:34:58] (PS1) Mforns: Remove CRLFs from file [analytics/reportupdater] - https://gerrit.wikimedia.org/r/275470 (https://phabricator.wikimedia.org/T127326) [14:41:24] (CR) Mforns: [C: 2 V: 2] "Self-merge :] just removing CRLFs from a file." [analytics/reportupdater] - https://gerrit.wikimedia.org/r/275470 (https://phabricator.wikimedia.org/T127326) (owner: Mforns) [14:46:31] Analytics-Kanban, Patch-For-Review: Create reportupdater browser reports that query hive's browser_general table {lama} - https://phabricator.wikimedia.org/T127326#2094305 (mforns) Done. Now both reportupdater and reportupdaterq-queries have no CRLF files neither in the original repos, nor in stat1002. Ho... [15:18:09] Analytics-Kanban, Editing-Analysis, Patch-For-Review: Re-enable the edit analysis dashboard - https://phabricator.wikimedia.org/T126058#2094519 (mforns) a:Neil_P._Quinn_WMF>mforns [15:20:46] (PS1) Mforns: Increase the log level of the executor log. [analytics/reportupdater] - https://gerrit.wikimedia.org/r/275493 (https://phabricator.wikimedia.org/T126058) [15:47:45] milimetric, can you merge this if ok please? https://gerrit.wikimedia.org/r/#/c/275493/ [15:48:32] I'm checking if the edit RU is executing the edit queries or not, but I need more logs [15:49:08] (CR) Milimetric: [C: 2 V: 2] Increase the log level of the executor log. [analytics/reportupdater] - https://gerrit.wikimedia.org/r/275493 (https://phabricator.wikimedia.org/T126058) (owner: Mforns) [15:49:30] mforns: I'm pretty sure it started and picked back up where it was [15:49:43] so we made some changes to the puppet and stuff [15:49:56] milimetric, but I didn't see any new results so far... [15:50:09] we moved all the output to /srv/reportupdater/output/ and symlinked it from /a/limn-public-data [15:50:09] thx! [15:50:20] I see [15:50:28] when I looked on Friday there were some files that had been updated since the upgrade [15:50:34] oh [15:50:38] ok [15:50:41] there/s /a/limn-public-data.orig which I was comparing it too [15:50:43] let me have another look [15:50:47] (it's a copy of the original) [15:50:50] aaaah [15:53:33] it looks like of all the limn-edit-data metrics, the only one that's making progress is failure_rates_by_type [15:53:56] so if you ls -alArt /srv/reportupdater/output/metrics/failure_rates_by_type/visualeditor [15:54:09] you'll see some stuff updated March 5th and 7th [15:54:15] so I think those queries continue to be quite slow [15:54:48] aha [15:54:49] I realize now that RU doesn't run in parallel like we can do with wikimetrics [15:54:50] makes sense [15:54:58] yes you're right [15:55:22] but I don't think we want to go down that path again [15:55:43] that was a hard-ish problem to solve and I'd rather focus on making a druid cluster where we can get these metrics super fast [15:55:53] aha sure [15:55:59] than make it feasible to use this much slower way of getting data [16:01:13] Analytics-Kanban, Wikipedia-Android-App-Backlog: Count requests to RESTBase from the Android app - https://phabricator.wikimedia.org/T128612#2094593 (Nuria) a:Milimetric>Nuria [16:03:15] Analytics-Kanban, Patch-For-Review: Create reportupdater browser reports that query hive's browser_general table {lama} - https://phabricator.wikimedia.org/T127326#2094598 (Milimetric) Thanks very much for cleaning that. I continue to be overwhelmed by unseen enemies like encoding and line endings :) [16:07:51] Analytics, Pageviews-API: Wikimedia pageviews API blocked by ad blockers - https://phabricator.wikimedia.org/T126947#2094604 (Milimetric) >>! In T126947#2091634, @MusikAnimal wrote: > Apparently any kind of pageviews statistics stuff is considered an ad, as I see they have several other similar routes bl... [16:16:15] Analytics, Pageviews-API: Wikimedia pageviews API blocked by ad blockers - https://phabricator.wikimedia.org/T126947#2094623 (MusikAnimal) @Milimetric Of course! We're going to try to use our own backend server to make the requests and get around the ad blockers using the route `/pv`, which I don't think... [16:18:34] Analytics: Invalid page titles are appearing in the top_articles data - https://phabricator.wikimedia.org/T117346#2094641 (Milimetric) We'd love to get pageId into each of our pageview requests, but that's not happening any time soon. When we have that, yes, we can add poor old en.wikipedia.org/wiki/- back... [16:24:11] (CR) Nuria: Update camus to support reading avro schemas from an avro protocol (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/274307 (https://phabricator.wikimedia.org/T128530) (owner: EBernhardson) [16:24:18] Analytics, Pageviews-API: Document that wikimedia pageviews API is blocked by ad blockers - https://phabricator.wikimedia.org/T126947#2094678 (Milimetric) p:Triage>Normal [16:26:44] Analytics-Kanban: Parse User-Agent strings with OS like "Windows 7" correctly into the user agent map {hawk} - https://phabricator.wikimedia.org/T127324#2094693 (Nuria) It will be good to take the opportunity of having looked into this to upgrade ua parser if proceeds. [16:28:15] a-team: I am working on https://phabricator.wikimedia.org/T128491 atm so I might need to take the first "free to go" ticket to skip standup :( any issue with that? [16:28:56] elukey, np for me [16:29:58] (CR) Nuria: [C: 2 V: 2] "Right, cause x-analytics argument is optional. I do not think this had any functional effect given than 'ispreview' header was not being s" [analytics/refinery] - https://gerrit.wikimedia.org/r/275382 (owner: Joal) [16:34:12] (CR) EBernhardson: Update camus to support reading avro schemas from an avro protocol (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/274307 (https://phabricator.wikimedia.org/T128530) (owner: EBernhardson) [16:43:48] Analytics-Kanban: Communicate the WikimediaBot convention {hawk} - https://phabricator.wikimedia.org/T108599#2094762 (bd808) >>! In T108599#2094157, @mforns wrote: > @bd808 I'd like to reach out to bot/framework maintainers and ask them to implement the optional improvement ('bot') of the User-Agent policy. C... [17:06:06] Analytics, Pageviews-API: Pageviews data for most recent day is missing when given one date range but not for other date ranges - https://phabricator.wikimedia.org/T128925#2094935 (Milimetric) Open>Invalid Our cluster might serve incomplete data because we're going for eventual consistency. So whi... [17:08:31] Analytics, Analytics-Cluster: Ensure file.encoding is UTF-8 for all JVMs in the Analytics Cluster - https://phabricator.wikimedia.org/T128607#2080328 (Milimetric) p:Triage>High [17:12:49] Analytics, RESTBase: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#1899324 (Milimetric) @GWicke: can you give us some examples of what you'd like to see in these reports? [17:14:15] Analytics: Better redirect handling for pageview API - https://phabricator.wikimedia.org/T121912#2094987 (Milimetric) p:Triage>Normal [17:14:53] Analytics, Datasets-General-or-Unknown, Operations, Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2094989 (ArielGlenn) [17:15:32] Analytics, Analytics-Cluster, Operations, Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881737 (Milimetric) p:Triage>Normal [17:16:54] Analytics: kafka-tools fails to run on stat1002 - https://phabricator.wikimedia.org/T121552#2094994 (Milimetric) Open>Invalid You should specify the port number: kafka-tools -b kafka1012.eqiad.wmnet:9092 print_topics [17:17:17] Analytics, RESTBase: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#2094997 (GWicke) @Milimetric: The main bit of information we are looking for is number of Varnish requests per API entry point. We already have per-entrypoint information fro... [17:19:26] Analytics: eventlogging user agent data should be parsed so spiders can be easily identified {flea} - https://phabricator.wikimedia.org/T121550#2095015 (Nuria) [17:21:23] Analytics: MobileWikiAppDailyStats should not count Googlebot - https://phabricator.wikimedia.org/T117631#2095024 (Nuria) Removing analytics. Note that our work on this regard can be tracked on T121550 ( eventlogging user agent data should be parsed so spiders can be easily identified {flea}." ) [17:23:49] Analytics: Track overall traffic, without any filtering, broken down into major categories, for internal use. - https://phabricator.wikimedia.org/T117236#2095036 (Milimetric) We're going to try to accomplish this via loading wmf.webrequest data into Druid without the page_title dimension. We'll keep it in t... [17:24:04] Analytics: Track overall traffic, without any filtering, broken down into major categories, for internal use. - https://phabricator.wikimedia.org/T117236#1769279 (Milimetric) p:Triage>Normal [17:25:13] Analytics, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#2095047 (Milimetric) p:Triage>Normal [17:28:22] Analytics, Datasets-General-or-Unknown, Operations, Wikidata: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#2095058 (Milimetric) Open>declined So we're leaning towards declining this unless requests for dumps.wik... [17:44:08] Analytics-Kanban: Integrate new browser visualization into wikistats - https://phabricator.wikimedia.org/T129101#2095134 (Nuria) [17:45:58] Analytics-Kanban: Browser visualizations should include a table visualization that displays browser and percentage in text form - https://phabricator.wikimedia.org/T129102#2095150 (Nuria) [17:46:55] Analytics-Kanban: Integrate new browser visualization into wikistats - https://phabricator.wikimedia.org/T129101#2095164 (Nuria) The socalled squid reports can be found here: https://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm [17:59:21] a-team signing off for today! bye [17:59:28] mforns: o/ [17:59:36] Good night mforns :) [17:59:40] night all [18:34:22] madhuvishy: yt? [18:34:42] madhuvishy: is there anything else besides fab to be able to deploy to prod? [18:34:56] madhuvishy: fab dashboard:vital-signs production deploy -u nuria doesn't update source [18:36:29] cc milimetric [18:39:37] nuria: yes [18:39:39] for prod [18:39:49] you have to explicitly specify hostname [18:40:28] like this - [18:40:30] fab dashboard:edit-analysis,layout=compare,hostname=edit-analysis-test.wmflabs.org staging deploy [18:40:36] in your case [18:41:01] fab dashboard:vital-signs,hostname=vital-signs.wmflabs.org production deploy [18:43:18] madhuvishy: k [18:49:49] nuria: all good? [18:50:22] madhuvishy: indeed!!! [18:50:33] madhuvishy: now i can remove the evil cron [18:50:36] :D [18:50:58] milimetric: all deployed now: https://vital-signs.wmflabs.org/#projects=dewiki,frwiki,enwiki,eswiki,jawiki/metrics=Pageviews [19:14:23] joal:yt? [19:16:17] joal: looks like last access uniques for february was restarted , correct? [19:17:37] nuria: how come commons doesn't show up any more ... hmmm https://vital-signs.wmflabs.org/#projects=commonswiki/metrics=Pageviews [19:18:43] milimetric: looking [19:19:53] I'm gonna go get some food [19:20:23] other than that the deploy looks good, nuria, sorry I missed that. It looks like all the non-standard wikis have the problem, like mediawiki, commons, etc. [19:20:52] milimetric: non standard in what way? [19:25:15] joal: nuria I'm reading the conclusion of the Wikimedia Bot convention thread and am a bit confused [19:25:37] madhuvishy: the e-mail thread? aham [19:25:43] nuria: yeah [19:26:00] https://lists.wikimedia.org/pipermail/analytics/2016-February/004905.html [19:26:17] so we were gonna match more things according to user agent polic [19:26:22] which is fine [19:26:35] and users would be encouraged to use the word bot in the UA [19:26:38] madhuvishy: yes [19:26:46] users using bots yes [19:27:00] currently mentioning the word bot classifies you as spider [19:27:08] and WikimediaBot classifies you as bot [19:27:16] but WikimediaBot didn't get any buy in [19:27:24] so are we going to remove the bot flag? [19:27:28] joal: ^ [19:29:03] madhuvishy: the bot versus spider is super confusing [19:29:13] madhuvishy: but i do not think we have to deal with that now [19:29:30] madhuvishy: right? [19:29:36] nuria: i understand - right but the code that marks WikimediaBot as bot is not relevant anymore? [19:30:00] can i remove it with this patch - because the task is to update our code to match the policy [19:30:02] madhuvishy: the important thing is that if "pattern-x-we-are-missing" is on UA traffic is not classied as coming from a user [19:30:12] madhuvishy: if that makes sense [19:31:03] nuria: no no i understand - I'm just asking if we are not longer going to have the bot distinction and I can remove the isWikimediaBot code [19:31:41] madhuvishy: ok, anything that matches "bot" should be getting tagged as non-human [19:31:59] nuria: that is already happening [19:32:01] madhuvishy: if so we can remove any specific code for "wikimediabot" and update docs on that regard [19:32:05] it's tagged spider [19:32:09] alright [19:32:21] madhuvishy: spider... man... [19:32:42] madhuvishy: when you remove teh code it will be tagged simply as "bot" correct? [19:32:51] as in "robot" [19:32:52] nuria: ha ha [19:32:56] madhuvishy: great. [19:33:01] no - it will be spider [19:33:05] which is what it was [19:33:20] what else is being tagged as spider? [19:33:30] okay so - there's a huge regex that matches all the things - if anything is matched with this regex it's a spider [19:33:35] everything is a spider currently [19:33:59] we added this thing were, if you mentioned "WikimediaBot" specifically - it would be tagged bot [19:34:12] that proposal seems to have been rejected in the thread [19:34:49] so I'm saying/asking if I should add more regex to the spider identifying regex and remove the isWikimediaBot code [19:35:00] which was never really used [19:35:18] yes, please. [19:35:20] and everything, and now more things (emails etc) will be matched for the is_spider classification [19:35:31] Let's remove the code that is ener been used, update the docs [19:35:35] okay [19:35:36] cool [19:35:44] *never [19:36:08] nuria: also - is there a good way to match emails with regex? [19:36:19] madhuvishy: email adresses? [19:36:23] yes [19:36:27] the UA policy [19:36:43] encourages bot devs to put in their email addresses [19:36:51] madhuvishy: ya, and we have a few of those [19:37:05] now we have to add a regex to match them [19:37:24] madhuvishy: it can be another step after the main regex, commons does it [19:37:27] madhuvishy: let me see [19:38:47] madhuvishy: do we use commons.. mmm.. i think not [19:39:06] nuria: i think we have it [19:39:13] the file has import org.apache.commons.lang3.StringUtils; [19:39:44] madhuvishy: then we can use commons e-mail validator , which is build for a different purpose but should work for this [19:40:10] madhuvishy: https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/EmailValidator.html [19:40:29] nuria: but would that match for an occurrence of email anywhere within a string? [19:40:44] madhuvishy: ah, no wait [19:41:15] we dont care so much if the email is valid as much as finding if there is a semblance of an email in our UA string no? [19:41:31] user@.gmail.com may be invalid - but definitely a bot [19:41:42] madhuvishy: ya, but if UA is only an e-mail that class would do the job, do they send anything else besides e-mail in that case according to poilicy? [19:41:52] yes [19:42:03] https://meta.wikimedia.org/wiki/User-Agent_policy [19:42:21] it encourages them to name the tool, and also leave contact info like email etc [19:42:33] the implication is emails imply bots [19:43:15] i'm wondering that if anything matches the _@_._ pattern - whether or not it's a "valid" email - it definitely is a bot [19:43:21] madhuvishy: i rather not use regexex if possible [19:43:29] nuria: how else? [19:43:48] madhuvishy: or "precise" regexes that is [19:43:58] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L49 [19:44:25] nuria: yeah super long precise regexes may not be necessary for our usecase [19:48:28] madhuvishy: and it cannot be done as far as i know, so let's be super coarse? \w{1,}@\w {1,}? [19:48:33] madhuvishy: jajajaj http://emailregex.com/ [19:48:59] madhuvishy: please look at the perl one wowowow [19:50:10] nuria, madhuvishy : was way for diner [19:50:27] joal: want to chime in on e-mail detection [19:50:32] nuria: Yes, I have restarted monthly uniques for february [19:50:35] joal: I say we keep it super corse [19:50:41] *coarse [19:50:51] joal: thanks for restarting job [19:51:13] sure, batcave? [19:51:35] madhuvishy: batcave for A=sec? [19:51:39] * a sec? [19:53:30] madhuvishy, nuria : We should spend some time with mforns on that UA thing [19:57:43] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#2095926 (Sadads) @Legoktm and/or @Beetstra can we implement the last couple bits so that we c... [19:58:11] joal: nuria are you on batcave? [19:58:12] madhuvishy: just talked with nuria t, please remove the code on WikimediaBot [19:58:16] okay [19:58:16] We just were :) [19:58:25] want to chat a bit, can join bck :) [19:58:31] sorry i missed the messages [19:58:34] np [19:58:35] madhuvishy: and let's keep e-mail check super coarse [19:58:43] no if you've already discussed it i'm good [19:58:43] Let's batcave again ;) [19:58:49] ok [19:59:05] okay cool, got it - so everything will be spider [20:00:00] milimetric: ya, sitematrix is not finding those projects, let me see what is different [20:00:48] removal of WikimediaBot, email addresses very simple: \W@\W\.[a-zA-Z]{2,3} [20:01:10] joal: perfect thanks [20:01:33] madhuvishy: correct, everything is spider, and we might maybe solve the naming thing on that one day (nuria doesn't like the spider terminology :) [20:01:43] I'm also adding User:, User_talk:, github and tools.wmflabs.org [20:01:44] well everything, everythin but user :) [20:01:48] yes [20:02:08] hm, not sure I understand madhuvishy [20:02:19] tools.wmflabs, ok sure, but User: ? [20:02:21] joal: nah, if everyone is happy with spider i am happy to ahem....not talk about that EVER again [20:02:33] Muhahaha :D [20:02:46] joal: yeah the bot policy encourages bot devs to leave their User pages there or email or some contact info [20:02:55] K makes sense [20:03:04] Awesome madhu ! [20:03:04] cool [20:03:07] Thanks a lot :) [20:03:13] * madhuvishy heads to lunch and office [20:03:15] np! [20:03:22] Bye bye :) [20:06:25] Bye joal, thanks for helping out :) [20:22:44] milimetric: From reading scrollback it sounds like edit-analysis queries are running, albeit slowly? [20:23:09] James_F: yes, very slowly [20:25:36] nuria: Such is life. [20:25:45] * James_F adopts Zen patience. [20:26:09] * nuria wishes i could do same [20:27:43] milimetric: fixed special projects, will submit patch [20:29:35] Analytics-Kanban: Special projects not showing up on dashiki after Pageview API migration - https://phabricator.wikimedia.org/T129131#2096138 (Nuria) [20:40:51] cool, I'll review it when it's up, nuria [20:41:47] milimetric: ok, trying to fix wikidata too [20:43:44] milimetric: wikidata is going to need a pull request [20:47:21] (PS1) Nuria: Correcting parsing of sitematrix for special projects and wikidata [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) [20:48:36] (CR) Nuria: "Not so fan of special casing" (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) (owner: Nuria) [20:56:45] (CR) Milimetric: Correcting parsing of sitematrix for special projects and wikidata (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) (owner: Nuria) [21:01:40] (CR) Nuria: Correcting parsing of sitematrix for special projects and wikidata (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) (owner: Nuria) [21:07:27] (PS2) Nuria: Correcting parsing of sitematrix for special projects and wikidata [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) [21:08:48] (CR) Nuria: Correcting parsing of sitematrix for special projects and wikidata (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) (owner: Nuria) [21:20:32] hi! Is there a way to get the aggregated pageviews over the last N months other than fetching all the hourly files? [21:22:46] Analytics, Deployment-Systems, scap, Scap3 (scap3-adoption): Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2096617 (thcipriani) [21:23:20] Analytics, Analytics-Cluster, Deployment-Systems, scap, Scap3 (scap3-adoption): Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2096628 (greg) [21:25:50] nuria: hmmmm... this doesn't work: https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/www.wikidata.org/all-access/user/daily/2015010100/2016030712 [21:28:36] Analytics, Analytics-Cluster, Deployment-Systems, scap, Scap3 (scap3-adoption): Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2096708 (thcipriani) [21:33:24] ok, I'm not crazy, it was a task that's not done yet, nuria: https://phabricator.wikimedia.org/T127030 [21:48:51] milimetric: indeed, i think i spaced out with how many tabs did i have open! [21:52:24] (CR) Milimetric: Correcting parsing of sitematrix for special projects and wikidata (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) (owner: Nuria) [21:52:49] nuria: I'll submit a patch to strip www. in the api itself, seems safe in my opinion [21:53:07] but for now, you don't need to special case anything, just use that regex I pointed out in my last comment ^ [21:54:25] milimetric: right, but api client below still assumes projects have a "." [21:57:08] (PS3) Nuria: Correcting parsing of sitematrix for special projects and wikidata [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) [21:57:27] sylvinus: aggregated as in, aggregated by project or per-article? [21:57:44] (CR) Nuria: Correcting parsing of sitematrix for special projects and wikidata (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) (owner: Nuria) [21:57:45] madhuvishy: per article, for all articles [21:57:48] nuria: all the weird projects that don't have <>.<> in the URL are in data.sitematrix.specials [21:58:11] in the _.forEach, you can examine the key (second param) and if it's === 'specials', you can add '.org' [21:59:58] sylvinus: you can get daily aggregates through the api [22:00:10] milimetric: we dont have monthly aggregated per-article yet? [22:01:40] madhuvishy: but I shouldn't get the N million articles from the api right? [22:01:59] I want to get all of them :) [22:02:01] madhuvishy / sylvinus: we have monthly, yep: https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/all-agents/monthly/2015100100/2015123000 [22:02:17] sylvinus: you want the counts for each article, per article, right? [22:02:21] yes [22:02:33] for that you need to use the dumps: http://dumps.wikimedia.org/other/ [22:02:44] yeah - and dumps are only hourly? [22:03:02] sylvinus: ^ and we're in the process of re-processing that. The best dataset to use out of there right now is Erik Zachte's: http://dumps.wikimedia.org/other/pagecounts-ez/ [22:03:19] milimetric: my other question was why we don't have monthly on the per-article endpoint (although sylvinus cannot use this) [22:03:19] sylvinus: contrary to what that description says, that dataset is now using the *new* pageview definition, as of last December I think [22:03:23] yes I just wanted to make sure there was no other way (like a monthly dump somewhere ;) [22:03:28] (we're in the process of updating docs) [22:03:40] sylvinus: oh you want per-article monthly? [22:03:50] yeah he does [22:03:58] sylvinus: that's a good request, you should file it at: https://phabricator.wikimedia.org/tag/analytics/ [22:04:17] oh wait... sylvinus I think erik's data *does* aggregate monthly [22:04:18] one sec [22:04:20] ok! :) will do. sorry if it was unclear [22:04:29] no it's ok, I jumped into the conversation late [22:06:07] sylvinus: ok, right, so they're "monthly" in the sense that hourly data is compressed into one line per month, with that format that Erik describes [22:06:09] http://dumps.wikimedia.org/other/pagecounts-ez/ [22:06:34] so it's /almost/ what you need :) [22:06:53] yes that's pretty close! [22:06:59] there's a -totals file that looks promising [22:07:01] thanks a lot ! [22:07:05] np, good luck [22:07:30] (we'll work to add some parsing tools in python for this stuff, in a central place somewhere, but I'm sure people have written that a million times, so someone's bound to have it) [22:07:39] (I'm planning to integrate that into the rankings for https://about.commonsearch.org) [22:07:45] cool!!! [22:08:26] Analytics-Wikistats: Total page view numbers on Wikistats do not match new page view definition - https://phabricator.wikimedia.org/T126579#2097005 (ezachte) @Not much. I did some consistency checks, but nothing conclusive yet. My approach is to compare Wikistats counts with ad hoc aggegrated webstatscollect... [22:10:49] (CR) Milimetric: [C: 2 V: 2] Correcting parsing of sitematrix for special projects and wikidata [analytics/dashiki] - https://gerrit.wikimedia.org/r/275618 (https://phabricator.wikimedia.org/T129131) (owner: Nuria) [22:12:11] I deployed it to prod, nuria [22:12:14] all good now [22:12:42] https://vital-signs.wmflabs.org/#projects=metawiki/metrics=Pageviews [22:14:05] 22:13 < icinga-wm> PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [22:14:10] and kafka1020 [22:14:32] milimetric: ah! i was deploying first to staging [22:14:41] milimetric: but hey there it is [22:15:23] oops :) sorry, and I was just saying how careful I am while deploying :P [22:15:48] Analytics-Kanban: Remove cron on wikimetrics instance that updates vital signs - https://phabricator.wikimedia.org/T125751#2097052 (Nuria) [22:16:11] (PS1) Milimetric: Strip out www. in front of project names [analytics/aqs] - https://gerrit.wikimedia.org/r/275681 (https://phabricator.wikimedia.org/T127030) [22:16:36] nuria: that's the patch to strip www. ^ [22:16:58] greg-g: what's going on - ottomata our kafka guy is out today [22:18:13] Analytics-Kanban, Patch-For-Review: Strip out a www. prefix for the "project" parameter passed into the pageview API - https://phabricator.wikimedia.org/T127030#2097059 (Milimetric) a:Milimetric [22:18:51] madhuvishy: not sure exactly, that's why I pasted in here so ya'll would see the monitoring complaining about the hosts :) [22:20:24] Analytics-Kanban, Patch-For-Review: Special projects not showing up on dashiki after Pageview API migration [3] - https://phabricator.wikimedia.org/T129131#2096138 (Nuria) [22:22:14] (CR) Yurik: [C: -1] Strip out www. in front of project names (1 comment) [analytics/aqs] - https://gerrit.wikimedia.org/r/275681 (https://phabricator.wikimedia.org/T127030) (owner: Milimetric) [22:44:33] (PS2) Milimetric: Strip out www. in front of project names [analytics/aqs] - https://gerrit.wikimedia.org/r/275681 (https://phabricator.wikimedia.org/T127030) [22:45:22] (CR) Yurik: [C: 1] Strip out www. in front of project names [analytics/aqs] - https://gerrit.wikimedia.org/r/275681 (https://phabricator.wikimedia.org/T127030) (owner: Milimetric) [22:45:36] (CR) Milimetric: "1. can't believe you saw this, thanks!" (1 comment) [analytics/aqs] - https://gerrit.wikimedia.org/r/275681 (https://phabricator.wikimedia.org/T127030) (owner: Milimetric) [22:53:43] (CR) Ppchelko: Strip out www. in front of project names (1 comment) [analytics/aqs] - https://gerrit.wikimedia.org/r/275681 (https://phabricator.wikimedia.org/T127030) (owner: Milimetric) [22:55:32] (CR) Milimetric: Strip out www. in front of project names (1 comment) [analytics/aqs] - https://gerrit.wikimedia.org/r/275681 (https://phabricator.wikimedia.org/T127030) (owner: Milimetric) [23:29:57] Analytics-Kanban, Research-and-Data, Patch-For-Review: Remove Client IP from Eventlogging capsule {mole} - https://phabricator.wikimedia.org/T128407#2097399 (madhuvishy) Thanks @leila :) @Nuria @Ottomata should we start this deployment process tomorrow (Tuesday)? The plan is here - https://etherpad.wi... [23:47:38] (PS5) BryanDavis: Add initial oozie job for ApiAction [analytics/refinery] - https://gerrit.wikimedia.org/r/273557 (https://phabricator.wikimedia.org/T108618) [23:47:45] (CR) BryanDavis: Add initial oozie job for ApiAction (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/273557 (https://phabricator.wikimedia.org/T108618) (owner: BryanDavis)