[00:09:13] (PS3) Neil P. Quinn-WMF: Update SQL scripts to reflect Edit schema change [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/236237 (https://phabricator.wikimedia.org/T111557) [00:11:29] (CR) Neil P. Quinn-WMF: "Okay, I think I've taken care of this. Look at patch set 3." (4 comments) [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/236237 (https://phabricator.wikimedia.org/T111557) (owner: Neil P. Quinn-WMF) [00:13:38] Analytics-EventLogging, MediaWiki-API, Patch-For-Review: Mediawiki API is returning empty strings for 'required' boolean fields - https://phabricator.wikimedia.org/T97487#1618828 (bd808) Open>Resolved >>! In T97487#1594989, @Milimetric wrote: > The solution in https://phabricator.wikimedia.org/T... [00:32:18] Analytics-Kanban: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1618861 (mforns) Here is the discussed white-list. {F2559632} **This white-list specifies which data must be kept indefinitely, the rest of the data must be auto-purged after 90 days.** It is a TSV file wi... [00:37:11] Analytics-Kanban: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1618887 (mforns) @jcrespo This is the list we talked about. Just to make clear that we can not start auto-purging before T108856 is set up, because we'd loose the history of the column editCount, which is n... [00:37:29] Analytics-Kanban: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1532299 (mforns) [00:37:30] Analytics-Kanban: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1618891 (mforns) [00:42:32] Analytics-Kanban: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1618897 (mforns) a:mforns>jcrespo [00:43:32] Analytics-Kanban: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1532299 (mforns) @jcrespo I assigned the task to you, as we spoke. Please, let me know if I can help you in any way. Thanks! [00:44:24] Analytics-Kanban: Delete obsolete schemas {tick} - https://phabricator.wikimedia.org/T108857#1618903 (mforns) a:mforns>jcrespo [00:45:23] Analytics-Kanban: Delete obsolete schemas {tick} - https://phabricator.wikimedia.org/T108857#1532313 (mforns) @jcrespo Hey, I also assigned this task to you, as we combined. Thanks! [01:17:14] madhuvishy: that google runs js doesn't mean googlebot accepts cookies, those are two different things [01:17:32] madhuvishy: sorry i was not here earlier [01:18:57] madhuvishy: did you find the answer to the EL udp server side issue? [01:31:57] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1618989 (Qgil) Having metrics about the use of our web APIs is still a good goal per se. The methods described by Dario ha... [02:01:58] (PS2) Nuria: [WIP] Make pageview definition aware of preview parameter [analytics/refinery/source] - https://gerrit.wikimedia.org/r/236800 [10:31:02] (CR) Milimetric: [C: 2 V: 2] Update SQL scripts to reflect Edit schema change [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/236237 (https://phabricator.wikimedia.org/T111557) (owner: Neil P. Quinn-WMF) [10:33:50] marktraceur: looks like http://datasets.wikimedia.org/limn-public-data/metrics/multimedia-health/uploads/ is not rsync-ed yet, so ping me when you're around and we can troubleshoot [10:37:26] (CR) Milimetric: "That might be my fault, did you try "npm install -g karma-cli" ? Also, "npm install" for the other dependencies. I had previously said "" [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [10:45:10] Analytics-Kanban: Delete obsolete schemas {tick} - https://phabricator.wikimedia.org/T108857#1620169 (jcrespo) p:High>Triage [10:46:27] Analytics-Kanban: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1620180 (jcrespo) p:High>Triage [10:47:52] Analytics-Kanban: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1620196 (jcrespo) p:High>Triage [11:18:27] joal: I've added changes based on Alex's guidance: https://gerrit.wikimedia.org/r/#/c/231574/3..4/hieradata/role/common/analytics/cassandra.yaml [11:18:36] mind taking a look? [11:18:47] I said I'd get your blessing before we bothered him to look at it [11:50:35] hey milimetric [11:51:06] I'm gonna look at that, but maybe I'll need some help ! [11:52:28] joal: batcave? [11:52:33] sure ! [11:52:47] omw [12:13:30] (PS6) Bmansurov: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [12:14:13] (CR) Bmansurov: "Yes, I followed the readme and everything worked out fine." [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [12:14:56] milimetric: I think roughly I put it in the wrong place [12:15:11] milimetric: Looks like I don't have the permissions to put stuff in limn-public-data [12:15:40] (PS7) Bmansurov: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [12:29:26] marktraceur: sux, do you have sudo -su stats access? [12:58:02] ottomata: Morniiing :) [12:58:20] morning! [12:58:37] Quick question: an1015-1016-1017 are already available, or do you need to decommision them from hadoop ? [12:58:47] an15 is avail, i need to decom the others [12:58:51] when do you need them? [12:59:08] Not known yet, but wanted to be sure for us not break anything :) [12:59:36] k [12:59:49] milimetric has patched the puppet code base on alex comments, so hoppefully it could be reasonnably fast [13:09:39] ok, it might be good to start decoming them soon [13:09:42] i will try to start that today [13:09:47] best to do one at a time i thikn [13:09:50] but we can do more if we need to [13:14:41] no need to rush ottomata, we were mostly wondering if those specific servers will be available for this [13:14:46] and if you wanted to rename them [13:17:13] milimetric: they will need to be reinstalled, so ja we can rename [13:17:16] and probably should [13:17:22] what are the restbase casses named? [13:17:39] milimetric: it takes a day or two for a hadoop node to be properly decommed [13:17:42] so it might be good to start now [13:21:48] milimetric: Not sure, I don't think so [13:24:49] marktraceur: what server are you doing this on? stat1002 or 1003? [13:25:15] ottomata: restbase casses? [13:25:44] cassandra servers [13:33:09] milimetric: 1003 [13:35:29] ottomata: restbase100[1-9] https://github.com/wikimedia/operations-puppet/blob/c89185d3f713a262906be5a60c6be091d318db10/hieradata/role/common/restbase.yaml [13:35:42] oh right its colocated [13:36:02] marktraceur: try ssh-ing into there and do "sudo -su stats" [13:36:13] if it asks you for a password, don't try to type it in or someone will yell at you :) [13:36:17] that just means you don't have access [13:36:21] hm, should we rename them using "api" term as suggested by alex ? [13:36:28] brb [13:36:41] something like analytics-api[1-3] ? [13:36:50] milimetric: i doubt he has sudo to stats access [13:37:21] ottomata, milimetric ---^ [13:37:22] ? [13:38:55] milimetric: Oh, oops, I tried to type it in. [13:39:00] * marktraceur braces himself [13:39:14] I guess it's ottomata who will yell. [13:39:25] Sorry ottomata [13:39:30] haha i aint a yeller [13:39:52] marktraceur: https://xkcd.com/838/ [13:46:06] Analytics-Kanban: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1620628 (JAllemandou) Thanks to @milimetric I corrected a typo in my usage of Bob's regexp --> It covers more than in the previous report. The numbers in th... [13:46:37] marktraceur: ah, so you have two choices, you either ask for access from Ops-Access-Requests, you need the group statistics-users [13:46:41] (https://wikitech.wikimedia.org/wiki/Analytics/Data_access) [13:46:41] milimetric: --^ with new numbers [13:47:09] thx joal, remind me if I haven't looked until after standup, I'm getting piled on :) [13:47:22] np milimetric, good luck depiling :) [13:47:40] ottomata: have you see my comment on names? [13:48:13] joal / ottomata: on general naming conventions, alex was saying we shouldn't have anything related to our team or software that's used currently, but I think analytics-api works (analytics as in purpose not team) [13:48:18] yes, not a fan of hyphens if we can avoid them but maybe! [13:49:30] hm, ottomata, can't think of no hyphen case here: cassandra is too braos name, restbase is already used an too broad as well ;;; [13:49:56] But analytics-restbase works as well [13:49:56] yeah, but node names are one case where i'm ok with abbreviating and concatenating :) [13:50:04] Right :) [13:50:12] anapi ? [13:51:02] apilytics ? [13:51:28] Trying to avoid the sooooo many puns you can do playing that game [13:52:21] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/wikihadoop] - https://gerrit.wikimedia.org/r/233937 (https://phabricator.wikimedia.org/T108684) (owner: Joal) [13:52:55] Thanks mforns ! [13:53:13] hey, np! [13:53:57] (CR) OliverKeyes: [C: 2 V: 2] [WIP] Make pageview definition aware of preview parameter [analytics/refinery/source] - https://gerrit.wikimedia.org/r/236800 (owner: Nuria) [13:59:02] (CR) Nuria: "ay ay , note that change was still WIP, it was not tested on the cluster yet" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/236800 (owner: Nuria) [13:59:39] Ironholds: ay....the changes for pageviews had not been tested on cluster yet, that is why they said WIP [13:59:47] ..whoops [13:59:52] how does one un-do a merge? ;p [13:59:56] Ironholds: so they were not ready to merge [14:00:19] well, this is awkward. [14:00:19] Ironholds: let me test, cause they are not deployed yet [14:00:24] *thumbs up* [14:01:41] milimetric: ...or? [14:02:34] marktraceur: lol, omg, sorry, or you can use the report-updater script runner I was telling you about [14:02:47] for that you have to convert your SQL to timeboxed, but it shouldn't be too bad [14:02:59] * marktraceur isn't so sure but is willing to try [14:03:00] usually it just means adding a date range and taking in the parameters that it fills in [14:03:06] Besides it should be more better in general [14:03:12] marktraceur: which way? access request or sql? [14:03:22] SQL sounds better [14:03:35] As much as I like getting access to random things [14:03:35] * Ironholds sighs [14:03:40] why did we ever rewrite this project [14:04:58] marktraceur: here's a simple example then, of a script that runs with reportupdater: https://gerrit.wikimedia.org/r/#/c/227911/3/mobile/mobile-options.sql [14:05:32] marktraceur: so the easiest way is for me to make the repo and all the stuff you need [14:05:40] wait, do you already have a "limn-multimedia-data" repo? [14:05:58] (the limn-*-data is load bearing - it's a convention we've hard-coded into puppet and makes everything way easier) [14:06:50] milimetric: We have some limn things happening, but I don't know if it has anything in it [14:07:01] I mean if there's a limn-multimedia-data thing [14:07:17] milimetric: I think we have a lot of scripts to pass data around on stat1003 and not much else [14:07:25] We kinda wrote our own scripts for it, I think. [14:07:47] yep, i remember that. [14:07:52] ok, lemme look around for a sec [14:09:06] ok marktraceur looks like there's no repo specifically named "limn-multimedia-data". So I'll make it, and add basic structure, you put SQL into the multimedia/ folder, I'll point you to some examples, and we can go from there [14:10:08] marktraceur: I see, there's a limn-multimedia-data on github. That should be in gerrit, so I've made it in gerrit and we can migrate as you want [14:15:03] Cool, thanks. [14:21:23] (PS1) Milimetric: Add basic structure [analytics/limn-multimedia-data] - https://gerrit.wikimedia.org/r/237098 [14:21:52] marktraceur: https://gerrit.wikimedia.org/r/#/c/237098/ congratulations, it's a baby patch! [14:22:03] I'll comment with some examples there [14:22:18] AOK. [14:22:48] (CR) Milimetric: "Some other repos that use reportupdater for reference:" [analytics/limn-multimedia-data] - https://gerrit.wikimedia.org/r/237098 (owner: Milimetric) [14:25:41] (PS8) Milimetric: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) [14:26:05] (PS9) Milimetric: Add filters above timeseries graphs in the compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) [14:26:31] (CR) Milimetric: [C: 2 V: 2] "Baha, this was awesome, thanks very much. I just updated the style of the checkboxes a little from what I originally put in there." [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [14:26:37] Ironholds: http://spark.apache.org/releases/spark-release-1-5-0.html [14:28:41] joal, yay, the big day is finally here! [14:30:46] (PS1) Milimetric: Fix optimizer config error for compare [analytics/dashiki] - https://gerrit.wikimedia.org/r/237102 [14:31:04] (CR) Milimetric: [C: 2 V: 2] Fix optimizer config error for compare [analytics/dashiki] - https://gerrit.wikimedia.org/r/237102 (owner: Milimetric) [14:32:56] bmansurov: https://edit-analysis.wmflabs.org/compare/ [14:33:02] (I deployed your changes) [14:33:07] let Neil know [15:03:38] ottomata, yt? [15:03:56] Analytics-Backlog: Give /aggregate-datasets/ on stat1002 open permissions - https://phabricator.wikimedia.org/T111956#1620813 (Ironholds) NEW [15:04:37] milimetric, looks great! [15:30:46] Analytics-Kanban: enforce group-writeable in stat1002:/a/aggregate-datasets/ and stat1003:/a/public-datasets/ - https://phabricator.wikimedia.org/T111956#1620927 (Ironholds) a:Ottomata [15:33:40] (PS1) Milimetric: Version the bundles same as the main scripts [analytics/dashiki] - https://gerrit.wikimedia.org/r/237108 [15:34:04] nuria: I added you to that ^ [15:34:23] k, will look after standup [15:34:45] Analytics-Backlog: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1620957 (ggellerman) [15:35:10] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Make EventLogging monitoring and alerts based on Kafka metrics {stag} [8 pts] - https://phabricator.wikimedia.org/T106254#1620958 (Ottomata) [15:35:41] Analytics-Backlog: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1620963 (ggellerman) [15:36:27] Analytics-Backlog: Delete obsolete schemas {tick} - https://phabricator.wikimedia.org/T108857#1620968 (ggellerman) [15:49:34] halfak: BTW, re. https://meta.wikimedia.org/w/index.php?title=Meta:Requests_for_comment/Enable_flow_in_the_Research_talk_(203)_namespace&diff=13537632&oldid=13535483 note that Flow isn't on TWN – they're probably talking about LQT: https://translatewiki.net/wiki/Special:Version [15:50:09] mforns: https://plus.google.com/hangouts/_/wikimedia.org/am [15:57:22] Interesting. Thanks James_F [15:57:47] halfak: Getting TWN to use Flow and convert their legacy LQT installation is indeed a longer-term objective. :-) [15:57:53] James_F, would you like to note that in the discussion or should I? [15:58:17] halfak: I'm intentionally staying out of the conversation so far. [15:58:26] (Sorry.) [15:58:28] kk. Will post then. Thanks for the info. [15:58:43] No not at all. Your best judgment and all that :) [16:01:41] Analytics-Kanban: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1621095 (Milimetric) As for the WikiBot convention. Joseph was just saying that MediawikiBot is used by Bing in the user agent. So we have to be more care... [16:09:47] (CR) MarkTraceur: [C: 2] "Merging so I have something to build on" [analytics/limn-multimedia-data] - https://gerrit.wikimedia.org/r/237098 (owner: Milimetric) [16:11:20] (CR) Nuria: [C: 2 V: 2] "Tested locally, looks great." [analytics/dashiki] - https://gerrit.wikimedia.org/r/237108 (owner: Milimetric) [16:12:35] milimetric: I guess Jenkins won't merge for me. :( [16:12:42] (CR) MarkTraceur: [V: 2] "RIGHT." [analytics/limn-multimedia-data] - https://gerrit.wikimedia.org/r/237098 (owner: Milimetric) [16:13:21] Analytics-Backlog, Analytics-Dashiki, Editing-Analysis, VisualEditor, Patch-For-Review: Improve the edit analysis dashboard {lion} - https://phabricator.wikimedia.org/T104261#1621141 (bmansurov) [16:14:36] milimetric: Does reportupdater do different time ranges? I want per-month, not per-day [16:16:28] marktraceur: yes, monthly is fine, I just saw daily in some of your files yesterday [16:17:38] Some of the old ones maybe [16:17:46] But those are event tracking, so it makes sense [16:17:55] Uploads and unique uploaders per day is less useful IMO [16:20:45] Analytics-Kanban: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1621169 (JAllemandou) No bot match the '.*WikimediaBot.*' for the hour of analysis while the list of bots containing WikiBot is not: DotNetWikiBot/2.101 (M... [16:24:34] ottomata: wanna continue the deployment? [16:24:51] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1621206 (SVentura) @Qgil there is no disagreement here. "I'd rather focus on obtaining useful metrics of our web APIs" Co... [16:26:58] madhuvishy: ya first i gotta merge that metrics chnage, just made some lucnh [16:27:33] ottomata: okay cool, ping me. [16:28:38] (PS1) MarkTraceur: Flesh out SQL, tweak configuration [analytics/limn-multimedia-data] - https://gerrit.wikimedia.org/r/237134 [16:28:42] milimetric: ^^ :) [16:29:49] marktraceur: I'll check it out after lunch [16:29:55] Excellent plan. [16:30:02] Both lunch, and looking at the patch. [16:32:35] joal: if (at your convenience) you could look at the patch that ironholds merged just in case something might jump to your eye... i need a few minutes to troubleshoot my internet connection and will be testing on cluster shortly [16:35:33] joal: https://gerrit.wikimedia.org/r/#/c/236800/ [16:40:48] milimetric: If I wanted to add a metric to that collection that I didn't think I could use sql for, would I need to bend over backwards to do it now? [16:41:07] nuria: after interview :) [16:41:16] joal: whenever, of course. [16:50:44] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1621294 (Qgil) [16:54:00] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1621297 (Qgil) I have added a "Metrics requested" section in the description and I have added the metric that the upcoming... [17:01:43] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1621334 (SVentura) @qgil, you're right, different groups will have different KPI lenses - we're meeting this afternoon to... [17:03:04] joal, do you want to debrief? [17:03:10] hehe mforns [17:03:15] was writing the same :) [17:03:18] xD [17:03:20] batcave [17:03:24] omw [17:04:59] marktraceur: we were going to add arbitrary script running for another project, but we don't have that yet. You can talk to mforns, he said he was going to do it in his volunteer time. I'd be obliged to you if you gave him a reason to do it during working hours :) [17:07:49] Got it. [17:08:08] milimetric: I ask because I have a metric, "illustrated pages", which I'm probably going to need to use bloody mwgrep for [17:08:19] Or something. [17:08:22] Maybe I could use imagelinks [17:11:33] ottomata: heya, can you come to batcave for a minute ? [17:11:44] we're with mforns discussing Julio's interview [17:11:51] (CR) Milimetric: Flesh out SQL, tweak configuration (5 comments) [analytics/limn-multimedia-data] - https://gerrit.wikimedia.org/r/237134 (owner: MarkTraceur) [17:12:21] milimetric, marktraceur, I've already started this, will take one week or two I thjink [17:12:39] joal in meeting with d'ana [17:12:40] and kevin [17:12:52] ok ottomata, let's discuss later [17:12:55] thx :) [17:14:04] I'm actually starting to think I could use imagelinks. [17:14:12] But I also approve of your efforts, mforns [17:14:21] Analytics-EventLogging, Patch-For-Review: Kafka Client for MediaWiki - https://phabricator.wikimedia.org/T106256#1621406 (csteipp) [17:15:26] marktraceur, cool, let me know if you decide to go with scripts, thanks! [17:15:45] Analytics-Tech-community-metrics, ECT-September-2015: Provide open changeset snapshot data on Sep 22 and Sep 24 (for Gerrit Cleanup Day) - https://phabricator.wikimedia.org/T110947#1621412 (Qgil) This is one case where, for once, we are happy with a simple report and we don't need changes in the dashboard... [17:22:46] Analytics-Tech-community-metrics, ECT-September-2015: Automated generation of (Git) repositories for Korma - https://phabricator.wikimedia.org/T110678#1621439 (Qgil) As I see it, our interest in Gerrit data is mostly about the present: how is the queue doing? are we progressing? The interest in Git data... [17:56:41] Analytics-Kanban, Patch-For-Review: Write scripts to track cycle time of tasked tickets and velocity [8 pts] - https://phabricator.wikimedia.org/T108209#1621599 (kevinator) Open>Resolved [18:06:31] Analytics-Kanban, Patch-For-Review: Write scripts to track cycle time of tasked tickets and velocity [8 pts] - https://phabricator.wikimedia.org/T108209#1621639 (ksmith) This could be really useful to other teams, so please share the results publicly when you are ready. [18:27:29] Question for the group...if you wanted historical data on whether a page had an image (or rather, how many pages on a wiki had images), how would you do that? [18:27:50] If I wanted to start tracking that now, I can use page_image in page_props, but historical data isn't a thing in page_props [18:28:09] I could run through every revision, parse it, parse the old revision, and count image transclusions, but ew. [18:29:13] marktraceur: I don't have a better idea :( [18:32:15] I'm not sure there is one...I might be stuck choosing between a long-running script and not having historical data [18:34:09] marktraceur: Ask halfak, he might have a data set that would better to work with (diffs instead of text) [18:35:11] marktraceur, I can show you how to write a 20 line python script that will generate your answer overnight on stat3. [18:35:36] halfak: how awesome :) [18:38:35] marktraceur, how OK is it to only consider [[File|Image:...]]? [18:39:10] halfak: Roughly OK, but there's a good number of pages that only have an image in the infobox, and that's OK [18:39:16] Like, that should count [18:39:30] OK. We might need to be clever for that, but not that clever. [18:39:44] halfak: Overnight for all of the numbers historically? [18:39:50] Yeah [18:39:56] * halfak flexes muscles [18:40:13] Damn :) [18:41:01] I guess I should write up a sql query for daily stats on page_props, then. [18:41:38] * halfak works on gist [18:41:57] mforns_gym: madhuvishy, deployed eventlogging alert change, looks good [18:43:52] Actually, I guess just running the script on revisions starting from the last one would be fine... [18:44:02] halfak: Thanks for the help :) [18:44:09] Sure. No problem. :) [18:44:22] halfak: Will the data get put in a database on stat3, then? [18:44:32] Or just output as TSV? [18:44:41] You'll be running the script. You can output however you like. [18:44:45] Oh, rockin' [18:44:49] I suggest TSV and load that into MySQL if necessary [18:44:55] Here's where I try to confirm that I have access to stat3 [18:45:38] Weren't you just asking about the MySQL password? [18:45:48] For stat1003 [18:46:01] Sorry, stat3 == stat1003 [18:46:04] Ah. [18:46:07] I'm just careless with my typing :) [18:46:24] I type the damn name so often! [18:46:57] What chars are valid in a filename? [18:47:00] Well, in that case, I might like to have write access to mysql so I can dump changes (i.e. "new image" or "removed image" on any revision would generate a row) [18:47:14] halfak: Hmm, should be anything but slash, colon, and # [18:47:24] Maybe a few others are illegal [18:47:32] Bar, I suppose [18:47:34] Maybe {{}} [18:47:36] Dunno if [] and {} are illegal [18:47:41] I guess they'd have to be [18:47:44] That'd be weird [18:48:02] Dunno if you've noticed, mediawiki is pretty weird [18:49:07] :D [18:49:56] mark, do you want a count of images? [18:50:04] Do you want to know when new images are added? [18:50:22] I'm thinking that changes per-revision is the best way to do it [18:50:40] So, a count per revision or only output those revisions that have a change in the set of images? [18:51:24] Like, revision 129 | +2 | 20040918101010 || revision 212 | -3 | 20040921123821 [18:51:56] At least...hm [18:51:57] Sure. [18:52:09] Do you care if an image was replaced? [18:52:16] Now I have to think about this, actually. I ultimately need a count, so changes could be aggregated but maybe slowly [18:52:24] That would be a nil change, I think [18:52:31] So no change for my purposes [18:52:44] Sure. We can always take multiple passes too if you want something else. [18:53:03] I guess I can do sum(img_delta) where timestamp < x and timestamp > y [18:54:31] A few more than 20 lines. You're going to need to debug my regexes. [18:54:56] I'm cool with that. [18:59:13] 58 lines https://gist.github.com/halfak/c7a6bb267fcefb3aa14c [18:59:34] But it comes complete with docopt and good script structure. [19:00:30] See also http://pythonhosted.org/mwxml/ [19:00:34] marktraceur, ^ [19:00:46] (Python 3 is required) [19:01:01] If you haven't done the whole virtualenv dance, I can show you how to do that too. [19:02:06] I already see a couple of typos. [19:02:22] I hope they're as obvious to you :) [19:05:00] Thanks halfak [19:06:23] 2 spaces!?!?!? [19:06:25] :P [19:06:50] marktraceur, it was the default in the gist editor :( [19:06:58] I usually go for 4 spaces [19:07:13] Yes, yes, no problem [19:15:57] (CR) Joal: "Didn't reviewed the tests, quite some comments to discuss." (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/236800 (owner: Nuria) [19:16:08] halfak: Looks mostly fine to me... [19:16:19] One issue in the regexes, and I added some extensions [19:16:32] Hey nuria, reviewed the commited code --^ [19:16:34] Yeah. I kinda rushed those. [19:16:40] Glad it seems reasonable to you. [19:16:57] Let's discuss the comments tomorrow (end of day for me) [19:17:09] BTW, you might also consider using mwparserfromhell to parse links and template params *BUT* that's substantially slower than the regex scan. [19:17:17] halfak: Does stat1003 have the python libraries or do I need to do "the whole virtualenv dance" as well? [19:17:29] You need to do the virtualenv dance. [19:17:39] But this is happy. It will give you flexibility. [19:17:42] (I'm going to assume that false positives are relatively rare and risk it, not doing the parser) [19:17:43] And it's easy with a little help. [19:17:51] * marktraceur is prepared. [19:18:00] marktraceur, I'd suggest writing up some test cases for those regexes. [19:18:38] Fair enough [19:18:43] https://gist.github.com/halfak/9f4830895496af9e9731 [19:18:46] ^ How to virtualenv [19:19:00] Those commands will set you up like I manage my virtualenvs [19:19:15] commands should all work as expected on stat1003 [19:22:05] Coolio. [19:25:56] halfak: I believe "pip install mwxml" is not a thing I should/can do on the cluster [19:26:16] Oh! You have to set up a proxy for http [19:26:19] * halfak gets the docs [19:26:37] https://wikitech.wikimedia.org/wiki/Http_proxy [19:26:56] Note the https proxy points to an http URL and that is OK [19:27:00] (apparently) [19:27:14] Geez. [19:27:50] Hooray. [19:27:56] halfak: didn't know about python virtual envs ! [19:28:03] halfak: thx for teaching :) [19:28:09] :D! [19:28:52] halfak: How would you suggest testing the script? I can come up with examples, just need to know the format I guess [19:29:01] Maybe just passing straight text in... [19:29:51] I'd write a separate script that imports extract_image_links and runs a few chunks of text against it and compares the output. [19:30:01] a-team, I'm off for today ! [19:30:09] o/ joal [19:30:19] nuria: ping me tomorrow whenever you want to dicuss the CR [19:30:34] milimetric: I'll try to contact alex tomorrow about machine choice and puppet [19:31:09] See you tomorrow [19:31:30] ok joal, I'll try to address marco's comments by then [19:31:38] have a nice night [19:31:53] oh, I have seen them :( [19:31:59] I'll have a look tomorrow [19:32:18] halfak: OK, it looks like it's working fine to me...final piece of the puzzle, where is the dump? [19:35:56] marktraceur, lates enwiki is stat1003:/mnt/data/xmldatadumps/public/enwiki/20150901/ [19:36:10] Fun times [19:36:29] BTW, when you kick that script off, it's going to use all of the CPUs on stat1003 [19:36:33] So you should NICE it. [19:36:53] NICE? [19:37:57] "man nice" [19:38:12] TL;DR: it lowers the priority of your processes so that other processes can take priority. [19:38:26] This allows you to use all of the computation resources without degrading performance for others. [19:38:31] Neat. [19:38:33] Er, nice. [19:38:43] Basically "nice " [19:38:51] And it all just works. [19:38:53] I'll probably do something as generous as nice -n 19 because I don't have any particular timeframe [19:39:10] So "nice python image_link_deltas.py /mnt/data/..." [19:39:14] No! [19:39:22] Lower numbers are higher priority [19:39:37] Right, so 19 is lowest priority [19:39:43] Wait... hmmm [19:40:02] Oh! It adds that onto the default priority. [19:40:15] Default is 20, so -20 would make the priority 0 [19:40:24] I dunno why they don't just have you set the priority directly. [19:40:50] Anyway, -n19 would be appreciated. [19:40:57] Awesome. [19:41:12] And these dump files, the latest non-tmp one has all the revisions? [19:41:39] I'll give you a string. One sec. [19:42:34] You need to process 177 files. This GLOB will get them all /mnt/data/xmldatadumps/public/enwiki/20150901/enwiki-20150901-pages-meta-history*.xml*.bz2 [19:42:54] You can pass that GLOB to the script as the ... arg and it will work as expected. [19:43:04] Ah, K. [19:43:14] so "nice python extract_image_deltas.py /mnt/data/xmldatadumps/public/enwiki/20150901/enwiki-20150901-pages-meta-history*.xml*.bz2 > my_output.tsv" [19:44:41] I guess you forgot to put in a call to main() [19:44:44] * marktraceur does [19:45:02] Yea. [19:45:16] if __name__ == "__main__": main() [19:45:25] That way, you can import it and it won't try to call the main() function [19:46:05] Yup [19:46:18] And...trying to get the len() of a generator [19:47:07] OK, running now [19:47:10] ha [19:47:27] I can see it churning :) [19:47:54] using 0.1% of CPU and 100% of CPU at the same time :) [19:47:59] Fun. [19:48:15] > Avail 48G [19:48:17] Should be fine. [19:48:22] :P [19:58:48] ottomata: around? I am around if you want to continue deploying [19:58:56] hey! [19:59:13] yes, have been sidetracked all day by meetings, and have been slowly finishing up the eventlogging monitoring ticket [19:59:14] just merged this [19:59:17] https://gerrit.wikimedia.org/r/#/c/237188/ [19:59:48] ottomata: ah cool! [20:00:19] so, yes!, atlhough, i don't have much time to work more tonight [20:00:22] got about an hour left. [20:00:22] hm. [20:01:16] ottomata: we can switch a few more pieces may be? [20:01:54] also, we should write down the rest of the deployment plan? [20:01:59] yeah, lets try to do at least one. i think the next peace is the client side processor [20:02:04] ok [20:02:10] cool, batcave? [20:02:14] https://etherpad.wikimedia.org/p/eventlogging_stag [20:02:17] madhuvishy: lets IRC [20:02:22] alright [20:03:47] madhuvishy: do you think it is better to run 2 client side process [20:03:53] one still consuming from zmq, and the other from kafka [20:03:55] or 1 process [20:03:58] with an extra output [20:03:59] so [20:04:04] hm, will write in ether pad [20:05:09]