[00:06:34] Hi analytics! Pls let me bother u with a few questions: (1) does there exist a standard place and way to run jobs that read data from Hive, create some aggregate numbers, and make it public somewhere? [00:07:14] More specifically (2) a way to pull data from Hive, aggregate and make it available to our Graph and Map wiki extensions? [00:07:37] And even more specifically (3) do the above with specific pageview and webrequest data? [00:07:43] Thx in advance! [00:13:22] ottomata milimetric madhuvishy nuria_ ^ ? :) [00:14:29] AndyRussG: yes [00:14:37] quick thing though, Madhu moved to the labs team [00:14:44] AndyRussG: one sec, I'll paste you some links [00:14:52] the tool you're looking for is called reportupdater, you heard of it? [00:15:18] https://github.com/wikimedia/analytics-reportupdater-queries/ is an example of using it [00:15:35] one query is configured here: https://github.com/wikimedia/analytics-reportupdater-queries/blob/master/browser/config.yaml#L3 [00:15:45] (ah right sorry madhuvishy :) ) [00:16:00] and defined here (notice the $1 templating): https://github.com/wikimedia/analytics-reportupdater-queries/blob/master/browser/all_sites_by_browser_family_and_major [00:16:51] basically, you write some yaml, you write some sql, and you magically get data out. In this case, data's gets written and rsynced out to here: https://datasets.wikimedia.org/limn-public-data/metrics/browser/ [00:17:01] (the name of the TSV file is the name of the query) [00:17:16] more on reportupdater: https://wikitech.wikimedia.org/wiki/Analytics/Reportupdater [00:17:25] lemme know if you have questions AndyRussG [00:17:43] Wow that sounds amazingly ready-to-eat, microwave straight from frozen! :) [00:18:14] * AndyRussG looks... [00:19:10] milimetric: yeah I'll surely let you know if I have questions, which is likely... thx so much!! :D [00:24:25] So basically it's an analytics-specific cron job runner I guess? [00:24:40] * AndyRussG paraphrases doc [00:29:00] Hmmm [00:29:09] So I can write python scripts too? [00:31:38] Here is the development workflow I'm imagining... I want to process some data for a specific set of bugs, and was planning to work stuff out using pyspark jupyter notebooks... But a bunch of the processing I want to do would be nice to run repeatedly with broad criteria and make public in the future... So I was thinking of trying to make what I do for these bugs feed into that longer-term goal [00:32:27] maybe I'm imagining stuff backwards nonsensically, dunno [00:32:35] milimetric ^ [00:48:31] Mmm I suppose it can be made to do hourly? [00:48:38] (or something more frequent than daily?) [01:15:24] It can do hourly, yes, AndyRussG [01:15:31] And it can do python [01:15:42] Any executable script really [01:16:47] AndyRussG: so just get some scripts first, then give configuring reportupdater a shot and ping me on the code review [01:17:05] milimetric: sounds great! thx!!! [01:17:10] That repo I showed you is a decent place, or you can start another [01:17:32] That one is analytics/reportupdater-queries in gerrit [01:18:51] I guess the analytics cluster would be the place to run stuff? [01:21:18] It's for two things... well no, three, potentially: monitoring the relationship of pageviews and CN banner impressions for active campaigns, making banner impression numbers public and easily graphable/mappable, and replacing our aging partly-broken campaign "allocation" pages (that show who's being targeted by CN campaigns, where, which devices, etc.) [01:21:56] Right now I'm working specifically on tracking down details and figuring out the causes of a spate of reports of CN banners not being shown as expected [01:22:32] So hopefully for the latter bugsolving I can hone some queries and code snippets that can eventually feed into the former medium-term goals... [01:23:35] So I'll try first putting stuff in that repo! [01:24:03] (first, as in, first step once I have code I'd like to repeatedly call to create reports, then) [05:33:54] AndyRussG|zzz: there is also this project which will enable accessing the data in jupyter notebooks directly https://wikitech.wikimedia.org/wiki/PAWS/Internal [05:40:42] Analytics, Analytics-EventLogging: eventlogging user agent data should be parsed so spiders can be easily identified {flea} - https://phabricator.wikimedia.org/T121550#2642773 (Nuria) [05:42:16] Analytics, Analytics-EventLogging: eventlogging user agent data should be parsed so spiders can be easily identified {flea} - https://phabricator.wikimedia.org/T121550#1881486 (Nuria) [06:03:30] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinksChange - https://phabricator.wikimedia.org/T115119#2642815 (Nuria) @Samwalton9 : Before enabling enwiki please calculate how many events you mi... [08:17:20] Analytics-Tech-community-metrics, Developer-Relations: Sudden rise of changesets in wikimedia.biterg.io metrics - https://phabricator.wikimedia.org/T145849#2642955 (Qgil) [08:34:08] joal: o/ [08:34:27] I got https://gerrit.wikimedia.org/r/#/c/310831/4 reviewed so if you don't have anything against I'd like to merge it [08:34:45] it will add the new aqs nodes to the LVS pool but they won't get traffic [08:36:52] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinksChange - https://phabricator.wikimedia.org/T115119#2642976 (Samwalton9) Assuming that 9:27 and 9:33 today were representative, something like 0... [09:24:41] !log added aqs100[456] to conftool-data (not pooled but the load balancer is doing health checks) [09:24:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:26:10] aaaaand we have health checks from LVS to aqs100[456] !!! [09:27:56] now I am going to double check the puppet config but we should be able to switch (at least from the puppet point of view) [10:23:28] Hi elukey, sorry I missed the ping :) [10:23:34] You did well merging that :) [10:23:51] don't worry! I always ask you an advice to preserve mental sanity [10:24:02] but checking everything it seemed not impactful [10:24:13] I checked also puppet and I think we are good [10:24:20] I would have thought I have the opposite effect to most people (about mental sanity ;) [10:24:28] nono :D [10:25:02] That's great, I guess we are gonna try to put traffic early next week [10:25:20] * joal is relieved this thing comes to an end [10:26:00] yeah [10:26:14] tentatively the switch could be made middle of next week [10:26:52] Which would also mean rise of traffic-limitation :) [10:29:21] not immediately but yes :D [11:16:12] Analytics-Tech-community-metrics, Developer-Relations: Sudden rise of changesets in wikimedia.biterg.io metrics - https://phabricator.wikimedia.org/T145849#2642955 (Dicortazar) Indeed, according to korma, the number of changesets make sense after that huge increase. Those are at the level of 1 or 2K. On... [11:36:24] ottomata: For when you get online - https://gist.github.com/jobar/878a7dd7ec039559ee0aac225bb445e7 [11:41:25] Analytics-Tech-community-metrics, Developer-Relations: Sudden rise of changesets in wikimedia.biterg.io metrics - https://phabricator.wikimedia.org/T145849#2643377 (Dicortazar) I see the issue, in the new platform we're losing some packages when uploading the data. Issues in our infrastructure that we al... [11:56:30] * elukey lunch! [13:04:49] Analytics-Kanban: Better identify varnish/vcl timeouts and document - https://phabricator.wikimedia.org/T138511#2643514 (elukey) Last two puppet patches applied are: - https://gerrit.wikimedia.org/r/#/c/307246 - Raised the maximum number of incomplete log transactions from 1000 (default) to 5000. - https://... [13:07:07] Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2643519 (elukey) As agreed with the team we will concentrate on putting the new cluster in production rather than keep going with this investigation. There are too many variables changed betwe... [13:24:46] ottomata: halo [13:25:08] not sure you've seen, but sent you some spark goodies [13:27:20] oo i did not see...:) [13:27:30] where did you send them? [13:28:05] ottomata: above in chan, here it is - https://gist.github.com/jobar/878a7dd7ec039559ee0aac225bb445e7 [13:29:27] Analytics-Kanban: Create documentation for edit history reconstruction - https://phabricator.wikimedia.org/T139763#2643556 (mforns) [13:34:43] Analytics, GLAM-Tech, Pageviews-API: WMF pageview API (404 error) when requesting statitsics over around 1000 files on GLAMorgan - https://phabricator.wikimedia.org/T145197#2643582 (Sadads) @Nuria This isn't primarily an outreach.wikimedia problem: most of the files used for GLAMMorgan end up on othe... [13:37:48] huh i see joal [13:37:49] cool [13:37:54] that is a good change [13:37:58] we should apply that to master [13:38:16] ottomata: Will submit a patch [13:38:32] :) [13:39:17] ottomata: o/ [13:39:29] hii [13:39:38] time to chat about pivot? [13:39:46] sure, i'm hold on the phone though! [13:39:49] bu tja [13:42:04] on IRC it is fine :) [13:42:42] so the idea that we discussed yesterday (briefly on standup) would be to create a module called imply_pivot to wrap service::node, and then a simple role that enables it [13:43:00] afaics service::node will handle scap [13:43:17] now if this is correct, my next question is.. users/groups? [13:43:31] pivot/pivot? Because scap will need also to know who owns the fileszz [13:43:37] (and who deploys) [13:43:47] sure, maybe just call the module pivot? Oorrrrrrrrr ja [13:43:53] it should be possible to re-use our analytics-deploy account [13:43:54] i think just pivot, cause we didn't call druid imply_druid [13:44:17] I thought about it but maybe it could overlap/confuse people? [13:44:25] not a big deal for me, pivot is fine [13:44:43] i don't know of a naming conflict with pivot so i think that's good [13:45:08] analytics-deploy group is good for dpeloyment too [13:45:19] account [13:45:45] and pivot/pivot for user/group? [13:45:47] ok as well? [13:45:54] ja [13:46:00] good thanks :) [13:46:33] last question - do we want to use a separate service runner config or should we use the default? It has some extra things but it looks good [13:47:02] (there is also the pivot config too, I'll add it but theoretically we shouldn't need it) [13:47:20] (or maybe we should to not rely on auto-discovery, don't know) [13:53:46] (PS1) Joal: Add singleton capactiy to UAParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/311127 (https://phabricator.wikimedia.org/T121550) [13:57:43] elukey: auto-discovery, i don't know [13:58:07] to discover measures, dimensions, etc.. [13:58:13] service runner config, hah, i also don't know, never looked at it :) [13:58:15] oh [13:58:15] hm [13:58:32] and auto-discovery would be people entering those into the UI? [13:58:44] nono it does everything by itself [13:58:48] oh ok [13:58:49] asking to the druid broker [13:58:55] i thikn it'd be good if we could puppetize our druid cluster info, but aside from that i think its fine [13:58:57] i think that stuff will change [13:59:07] but if you add a measure for example it doesn't get it on the fly, you need to restart [13:59:19] okok [13:59:25] I'll add the possibility to specify a config file [14:03:30] ok [14:03:46] we'd have to restart if we puppetized it anyway, no? [14:08:18] yeah maybe it could be cleaner since we could add an explicit subscribe [14:08:21] or similar [14:08:27] (notify IIRC) [14:35:37] hey mforns [14:35:44] hi milimetric! [14:35:54] how goes [14:35:59] good :] sup? [14:36:21] Hi guys [14:36:29] hry joal 1 [14:36:31] ! [14:36:36] hey [14:36:40] all wrong [14:36:42] it's a party [14:36:45] oh no! [14:37:49] milimetric, what? [14:38:01] you said "all wrong" [14:38:09] i thought you meant literally everything was wrong [14:38:19] xD, no, I typed all wrong [14:38:21] sorry [14:38:43] wanna chat before standup? [14:39:44] mforns / joal ^ [14:39:45] Analytics, GLAM-Tech, Pageviews-API: WMF pageview API (404 error) when requesting statitsics over around 1000 files on GLAMorgan - https://phabricator.wikimedia.org/T145197#2643754 (Mrjohncummings) When I spoke to @Magnus before, he said this was the problem causing his tools to return false results... [14:39:48] sure milimetric [14:40:01] milimetric, yea [14:40:03] omw [14:51:17] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinksChange - https://phabricator.wikimedia.org/T115119#2643774 (Nuria) @Samwalton9: are you taking into account the wiki size when it comes to numb... [14:56:02] Analytics, GLAM-Tech, Pageviews-API: WMF pageview API (404 error) when requesting statitsics over around 1000 files on GLAMorgan - https://phabricator.wikimedia.org/T145197#2643781 (Nuria) @Sadads: Throttling requests is not a mistake, is done on purpose as API cannot support arbitrary traffic. In y... [15:00:59] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinksChange - https://phabricator.wikimedia.org/T115119#2643788 (Samwalton9) >>! In T115119#2643774, @Nuria wrote: > @Samwalton9: are you taking int... [15:01:24] a-team: standduppp [15:03:45] ouch joining, I was doing puppet stuff :/ [15:10:17] ottomata: https://gerrit.wikimedia.org/r/#/c/311139/1 - let me know if it makes sense to you [15:10:49] argh the message is not right [15:10:51] updating [15:11:35] https://gerrit.wikimedia.org/r/#/c/311139/2/modules/service/manifests/node.pp is better [15:13:34] (CR) Milimetric: [C: 2] Disable the deprecated option by_wiki [analytics/reportupdater] - https://gerrit.wikimedia.org/r/306968 (https://phabricator.wikimedia.org/T132481) (owner: Mforns) [15:14:19] (PS5) Milimetric: Add re-run script [analytics/reportupdater] - https://gerrit.wikimedia.org/r/308977 (https://phabricator.wikimedia.org/T117538) (owner: Mforns) [15:14:21] (CR) Mforns: [C: 2 V: 2] Make use of the new explode by file feature [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/307274 (https://phabricator.wikimedia.org/T132481) (owner: Mforns) [15:20:45] Analytics, GLAM-Tech, Pageviews-API: WMF pageview API (404 error) when requesting statitsics over around 1000 files on GLAMorgan - https://phabricator.wikimedia.org/T145197#2643808 (Mrjohncummings) @Nuria Is there a way round this throttling specifically for these tools? I use these tools as part of... [15:43:46] (CR) Ottomata: [C: 1] Add singleton capactiy to UAParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/311127 (https://phabricator.wikimedia.org/T121550) (owner: Joal) [16:07:08] milimetric, mforns: still in cave? [16:11:06] I'm grabbing lunch, joal, but we have a plan for pairing and finishing this up tonight. [16:11:36] milimetric: not sure what 'this' is, but sounds great :D [16:11:53] joal, we wanted to add the event_user_groups to the denormalized schema [16:12:06] and also fix the "latest" fields back in the reconstruction [16:12:20] yep [16:12:43] mforns, milimetric: anything against adding all user-related fields at once then? [16:13:06] joal, sure, just blocks missing no? [16:13:19] have a good weekend team! [16:13:22] * elukey afk! [16:13:24] bye elukey ! [16:14:02] mforns: easiest is to reuse the user sub-object, see the todo in EditHistory object [16:14:17] joal, aha, I understand [16:14:20] mforns: And when I say easier, I should say cleaner ! [16:14:28] hehe [16:15:08] mforns, milimetric: As said after standup, looks like we have to process restores from logging table [16:15:27] the few examples I looked were all covered by that [16:15:28] joal, page restores? [16:15:31] correct mforns [16:15:47] to solve the wiki terror problem? [16:16:19] to solve page-history shorten problem [16:16:37] I see, cool! Do we want to look at that today? I guess no.. right? [16:16:45] that's too many changes [16:16:51] mforns: wasd planning to actively start on Monday [16:17:09] ok [16:17:12] mforns: It would be awesome if you could push a patch tonoght [16:17:21] for me to start afresh [16:17:29] but we won't block sending the data to Erik right? [16:17:35] sure, we'll commit tonight [16:18:31] (small changes I can take care of: add isDeleted, isReverted, rename sha1 revert to isIdentityRevert) [16:18:36] awesome guys ! [16:18:40] thanks a lot :) [16:22:13] Hello. Do you happen to know what is the best way to convert \u30b8\u30ab\u30a6\u30a4\u30eb\u30b9 into メインページ? This is for an sql query for the wikimedia data [16:37:09] Analytics, GLAM-Tech, Pageviews-API: WMF pageview API (404 error) when requesting statitsics over around 1000 files on GLAMorgan - https://phabricator.wikimedia.org/T145197#2643934 (MusikAnimal) Hey, I didn't author this tool (not sure if that's why I was pinged). When I tried it out the only failed... [16:39:06] heading home back shortly [16:45:32] hey mforns, I'm done eating [16:45:38] so whenever you want, we can cave [16:45:54] milimetric, hi, let's do this :] [16:50:12] milimetric: good morning! which LDAP credentials should I use for https://piwik.wikimedia.org/? I tried my wikitech name/pw but that didn't work. [16:50:22] (PS9) Mforns: [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric) [16:51:00] bearloga: for LDAP credentials, I use the wikitech credentials, but that site needs you to login to a piwik user as well [16:51:25] so are you having trouble with LDAP or piwik credentials? [16:52:51] milimetric: I'm having trouble with the first layer of authentication. after my wikitech login failed, I tried both names/password pairs from piwik.credentials file in your dir. [16:53:18] k, lemme double check what ldap I use there [16:53:34] milimetric: I have 2FA turned on on my wikitech account if that helps you figure it out [16:53:53] ooh... I don't know how that works with piwik actually [16:54:10] bearloga: maybe you can ask in labs or ops? I'm no LDAP expert by any stretch :) [16:54:21] but yeah, I'm using my wikitech credentials there [16:54:29] will do! :) [17:05:01] milimetric: turned off 2FA, still won't accept my credentials. [17:05:31] ok, bearloga, I can try to dig into it later [17:05:37] but I'm in a long meeting now [17:05:59] milimetric: no worries! let me know if you find something later :) thanks! [17:06:12] np, sorry for the delay / bouncing around [17:17:26] mforns: https://www.mediawiki.org/wiki/Manual:Ipblocks_table [17:28:03] Analytics, Reading-analysis, Research-and-Data, Research-consulting: Collect feedback for the current state of the WMF core metrics proposal - https://phabricator.wikimedia.org/T145886#2644087 (leila) [17:33:28] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinksChange - https://phabricator.wikimedia.org/T115119#2644117 (kaldari) >Before enabling enwiki please calculate how many events you might be send... [17:40:04] (PS17) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) [18:15:42] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2644237 (Ottomata) Been thinking about service names. 'Kasocki' is the node library that provides Kafka Consumer -> socket.io integration. We need a Wikimedia product name for this as well... [18:39:37] mforns: https://github.com/shagabutdinov/sublime-local-variable [18:59:35] (PS26) Mforns: [WIP] Refactor Mediawiki History scala code [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301837 (https://phabricator.wikimedia.org/T141548) (owner: Joal) [19:33:19] (PS18) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) [19:42:58] (PS10) Mforns: [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric) [20:45:15] (PS19) Milimetric: Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) [20:59:38] mforns: https://gerrit.wikimedia.org/r/#/c/301837/26/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/edithistory/user/UserState.scala [21:13:55] (PS11) Mforns: [WIP] Join and denormalize all histories into one [analytics/refinery/source] - https://gerrit.wikimedia.org/r/307903 (owner: Milimetric) [21:15:39] milimetric, ... [21:15:43] it failed already [21:15:48] with a timeout [21:16:02] Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout [21:16:30] oh [21:16:54] do you guys normally increase that when running? [21:16:58] or you think something's wrong? [21:17:29] milimetric, I saw that error before when trying this same code [21:17:30] uh... I'll run it again too, see what I get [21:17:37] don´ t remember how I fixed it [21:17:48] probably because it was just bumping the memory [21:17:58] oh ok, I'll google around or bump the memory, yea [21:18:10] I'll launch it again now with 64 partitions instead of 32 [21:18:37] milimetric, done, will be back in 20 mins [21:19:03] ok (I'm still setting up my spark stuff) [21:44:15] milimetric, it finished! [21:44:26] I hope the data is healthy [21:44:35] woo! [21:44:38] awesome, thx mforns [21:44:40] good night! [21:44:43] I'm gonna take a look [21:44:44] nite [21:44:49] cool [22:41:21] bearloga: sorry, I'm running super late today and still swamped [22:41:58] bearloga: I won't be able to try and debug the login issue, but I'll try Monday if you remind us. Other people here might be able to help too, I'm really the worst at opsy stuff :) [22:55:47] milimetric: np! [22:55:53] milimetric: have a good weekend! [22:56:06] thx, you too!