[07:35:18] question if any of the lovely analytics engineers turn up; hadoop's config directory, where is it? [07:35:27] * Ironholds can't sleep and so is expeirmenting with getting RHive up and running. [07:39:15] Ironholds: You might be looking for /etc/hadoop/conf [07:39:31] That's hadoop's config directory. [07:39:55] For Hive (since you coined RHive) it is /etc/hive/conf [07:40:10] qchris, aha, danke schoen! [07:40:19] Bitte gern :-D [07:40:26] * Ironholds is arguing with some lovely South Korean developers who wrote a Hive/R interface [07:40:32] well, s/arguing/fixing their documentation [08:18:09] (CR) Hashar: [C: -1] "I agree default tests should probably not provide coverage. If you want to retain the ability to easily run coverage tests, you could add" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/149384 (owner: QChris) [08:18:15] qchris: ^^^ :-D [08:18:23] * qchris looks [08:18:25] I am around if you want to talk about it [08:20:09] hashar: so the -1 is because I removed it only in one place. Right? [08:20:39] yup [08:20:44] Ok. Thanks. [08:20:49] for some reason it is in both setup.cfg and tox.ini [08:20:57] Ja. Good catch. [08:20:59] Thanks. [08:21:01] OpenStack by convention does not use setup.cfg and only tox.ini [08:21:08] but milimetric proposed to use setup.cfg [08:21:23] I guess you can push a patch that removes the config sections from tox.ini [08:21:26] then rebase your on top of it [08:21:47] also, a tox environment would let one easily produce coverage report. Might want to make it another patch as well or just integrate that in your [08:21:48] I'll just amend. [08:21:51] tox -e cover rocks [08:22:16] Looks good. But I guess that'll be a separate change. [08:22:18] at one point, maybe I will be able to have the HTML coverage report created and uploaded somewhere publicly so folks can easily review the report whenever a patch is proposed [08:22:47] Sounds good :-D [08:24:23] (PS3) QChris: Allow to run nosetests without coverage report [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/149384 [08:25:33] qchris: would you add the 'cover' env in a different patch? [08:25:52] Yes. [08:26:18] (CR) Hashar: [C: 1] "Per IRC with christian, the introduction of a tox 'cover' target can be done in a follow up patch :)" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/149384 (owner: QChris) [08:26:20] \O/ [08:26:27] I love python [08:26:48] Currently, tests are run not directly, but through a separate script. [08:27:03] This separate script runs the coverage report still (on purpose). [08:27:11] yeah I don't think they pass out of the box [08:27:12] And I would not want to interfere with that too much. [08:27:14] iirc we need some daemon setup [08:27:32] It depends on other services. Yes. [08:28:11] i am sure you guys will figure out how to mock them with 'import mock' :-D [08:28:24] Ha :-D [08:28:36] Yes that would be great. [08:28:44] But we're constantly short on time. [08:29:16] And that would require reworking some parts. [08:29:29] Not sure how soon that'll happen :-/ [08:29:45] It seems we instead prefer to buy us more and more tech-debt. [08:29:51] Well ... :-D [08:42:06] qchris: common problem :] [08:42:35] if people write tests, run them locally and care about coverage, I guess that is good enough [08:43:44] Yup. [08:43:59] Luckily ... focus decreased a bit. [08:44:23] But "tests with good coverage" is better than "no tests". [08:44:24] :-) [08:44:36] s/focus/focus on coverage/ [08:48:35] fully agree [11:38:10] (CR) QChris: [C: -1] Fix slow Rolling Active Editor metric (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/149482 (https://bugzilla.wikimedia.org/68596) (owner: Milimetric) [11:57:01] (CR) QChris: [C: 1] use protocol relative url for image links on stats homepage [analytics/wikistats] - https://gerrit.wikimedia.org/r/147876 (owner: Chmarkine) [12:21:09] Analytics / Wikimetrics: Backing up wikimetrics data fails if data is written while we back it up - https://bugzilla.wikimedia.org/68731 (christian) NEW p:Unprio s:normal a:None The run of the hourly script for 2014-07-28 05:00 failed with tar: /var/lib/wikimetrics/public/69987: file cha... [12:22:56] milimetric: hey! around? [12:23:14] DarTar: hey! around? [12:23:22] hi there [12:23:40] DarTar: heya! [12:23:48] how’s it going [12:23:50] DarTar: http://quarry.wmflabs.org/ now actually runs queries :) [12:23:56] neat :) [12:24:11] * DarTar checking it out [12:24:11] and has shareable results as well: http://quarry.wmflabs.org/query/22 [12:24:23] DarTar: I currently have it set to kill queries after 1minute, need to tune that. [12:24:34] kk [12:25:19] DarTar: still a work in progress (need to add CSV download, etc). do give me feedback / tell other things that might need fixing [12:25:44] totally, what’s the best way to send feedback / issues? [12:26:34] DarTar: talk page of the meta page [12:26:41] DarTar: https://meta.wikimedia.org/wiki/Research:Ideas/Public_query_interface_for_Labs [12:26:46] sounds good [12:30:01] YuviPanda: I just noticed that the default license for the queries is CC BY SA, I would set as a default the least restrictive license (CC BY or maybe even CC0) to maximize reuse [12:30:48] definitely expect some feature requests, a query interface could become really big ;) [12:31:05] DarTar: :D [12:31:07] DarTar: yeah :) [12:31:11] DarTar: I'm adding a 'fork' button soon [12:31:18] nice [12:31:40] DarTar: right, so I wanted CC0, and then halfak pointed out that wikipedia content is CC BY-SA, and then someone else pointed that mw.org's content is something else, and wikidata's is CC0... [12:31:55] DarTar: so clarification needed on that, but I want to have a consistent license for all the SQL [12:32:28] DarTar: btw, don't publicize the URL yet. I need to puppetize this and make this scale slightly better. A couple more days. And I also suspect that the current data will be lost, but that should be ok [12:33:27] sure, we should have a discussion about this but in general when it comes to data with no specific attribution requirements the less restrictive the better (also to avoid attribution stacking problems which are a nightmare for open data) [12:33:39] yeah sure [12:36:11] issues with SA for data and code is well summarized here: http://www.dcc.ac.uk/resources/how-guides/license-research-data [12:38:17] DarTar: yeah, I agree. [12:38:31] DarTar: hmm, so I just realized - it will be trivial to mark the SQL as CC0, since the users are creating it [12:38:35] DarTar: but the data is more complicated [12:38:47] DarTar: I'm just going to clarify the wording to make sure it mentions just the SQL, and make it CC0 [12:39:21] Legal can help us here [12:42:11] DarTar: yeah [12:42:29] DarTar: need to get a conversation started. I want to publicly put this out during the wikimania research hackathon [12:42:38] nice [12:44:23] DarTar: I'm also wondering what a nice 'kill timeout' would be for these. Right now at 1m, but I'm guessing 10m is ok too. [12:45:07] yes 1m sounds a bit restrictive, maybe there could be different user groups with different limits one day? [12:45:57] DarTar: that wouldn't be too hard to do, no. I was thinking '10m per query, and 3 concurrent queries per user', and relax / tighten after seeing how usage pans out [12:46:12] sounds reasonable [12:47:12] yeah, will have to implement the concurrent queries thing in a bit [12:47:44] DarTar: I'm also considering adding 'recurring runs' at some point, perhaps to only trusted users. That + embedding support should make this really useful [12:49:22] it looks like this will be getting quite similar to wikimetrics (public reports, priv levels, recurrent reports etc), did you get a chance to talk to anyone in analytics dev to see if there’s an opportunity to reuse / share code? [12:50:40] DarTar: yeah, talked to milimetric about it at teh start. At that point decided to not make it the same tool, though. Mostly because there wasn't too much non-trivial code sharing required, and the wikimetrics model seemed to have timeseries as the underlying structure rather than arbitrary sql [12:51:32] right [12:53:46] DarTar: a '+1' from the analytics/research team as such would be awesome as well :) [12:54:58] everybody in the research team I talked to is excited about this project, how can we make the endorsement more useful than just a +1 ? [12:58:27] DarTar: hmm, I don't know. Some form of saying 'Analytics/Research team greatly approves and loves this idea and thinks it would do a lot of good' in a public forum would be great [13:14:56] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734 (christian) NEW p:Unprio s:normal a:None It seems that for all of June 2014, the per day data shown on stats.grok.se is about twice as high as it should be. See http://s... [13:15:55] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734#c1 (christian) Created attachment 16057 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16057&action=edit Screenshot of enwiki's Main_Page on stats.grok.se [13:16:38] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734#c2 (christian) Created attachment 16058 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16058&action=edit Screenshot of dewiki's Wikipedia:Hauptseite on stats.grok.se [13:17:39] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734#c3 (christian) Created attachment 16059 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16059&action=edit Screenshot of plwiki's Wikipedia:Strona_główna on stats.grok.se [13:18:39] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734 (christian) a:christian [13:20:26] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694 (christian) a:Dan Andreescu>christian [13:20:55] Analytics / General/Unknown: ULSFO post-move verification - https://bugzilla.wikimedia.org/68199 (christian) a:Dan Andreescu>christian [13:41:38] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734#c4 (christian) The June 2014 upward jump is not present in webstatscollector's hourly raw pagecount files. Neither is it in the per day aggregations we provide. Also other tools consu... [14:02:02] hangoutprobss.... [14:16:59] halfak: you didn't have to load it again, I fixed up that table [14:17:21] I had a lot of errors. I'm worried about bad data. [14:17:33] It's no issue to re-load the dataset. I don't have to re-run queries. [14:17:33] yeah, I was going to bring that up next [14:17:41] the numbers I'm getting are low [14:17:46] about 10% low I think? [14:17:57] The queries ran as expected. [14:17:59] at least compared to running a very similar query on the raw data [14:18:20] yeah, it's def. weird, i saw the query you ran and it made sense [14:18:21] Not sure what's up with that. [14:18:30] Yeah. Could be related to import errors. [14:18:31] also, about 1 million records had 0 for user_id [14:18:51] yeah. Since I took out the inner-join to user, 0 is a valid rev_user value. [14:18:57] So, we'll want to filter them. [14:19:25] oh interesting, what does that mean in the data? Just a bug in the system? [14:19:42] When an anon makes an edit, rev_user = 0 [14:19:56] So, all edits by anons are lumped together. [14:24:14] oh, doh, why do i always forget that - soryr [14:24:47] so halfak: either way the table is awesome and speeds things up a lot, but I still don't think we'll be able to backfill in time [14:24:51] I thought of another way to speed it up [14:25:14] We could put an index it that includes "revision". [14:25:16] so when you're done loading the data, I'd like to try adding a new column: revisions_for_past_30_days [14:25:23] That would let you slice with the btree. [14:25:32] Oh sure. That works too. [14:25:42] the problem is that RAE touches each record 30 times as it's rolling back over it [14:25:42] I also have the sorted dataset. You could just run a script over it. [14:26:16] It's sorted by wiki ASC, day ASC. [14:26:52] the index on the table is working really well at getting specific days / etc [14:27:06] i think it's just basically 30x slower than it should be [14:39:37] doh halfak: i think the low numbers were user error. I was doing > 5 as opposed to >= 5 [14:39:42] sorry [14:40:00] Woot! I was getting worried. [14:40:04] I love dumb user errors. [14:40:09] I had deep compiler errors. [14:40:12] *hate [14:41:01] Every time I get stuck, I always hope that I did something stupid and simple wrongly so that I can fix it without submitting a bug. [14:41:29] Or worse, recompiling the kernel and sending a patch upstream. [14:41:39] hm... the numbers are still a tiny bit off [14:41:46] here's etwiki from editor_day: [14:41:54] | 20140723 | 118 | [14:41:54] | 20140724 | 119 | [14:41:54] | 20140725 | 120 | [14:41:55] | 20140726 | 119 | [14:42:08] and from the classic query: [14:42:10] "2014-07-25 00:00:00": 117.0, [14:42:10] "2014-07-26 00:00:00": 119.0, [14:42:10] "2014-07-27 00:00:00": 121.0, [14:42:12] "2014-07-28 00:00:00": 118.0 [14:42:55] * halfak *shrugs* [14:43:20] Or wait. I see that we're off by a few days in the listing. [14:43:44] Are we regularly high in editor_day? [14:44:06] i'm running for arwiki, one sec [14:44:09] http://pastebin.com/xWA5AbMP btw [14:44:29] so arwiki: [14:44:30] | 20140718 | 935 | [14:44:30] | 20140719 | 943 | [14:44:31] | 20140720 | 946 | [14:44:33] | 20140721 | 958 | [14:44:35] | 20140722 | 987 | [14:44:37] | 20140723 | 1003 | [14:44:39] | 20140724 | 1012 | [14:44:42] | 20140725 | 1026 | [14:44:44] | 20140726 | 1016 | [14:44:46] | 20140727 | 992 | [14:46:09] and classic query: [14:46:09] "2014-07-18 00:00:00": 912.0, [14:46:10] "2014-07-19 00:00:00": 915.0, [14:46:10] "2014-07-20 00:00:00": 918.0, [14:46:10] "2014-07-21 00:00:00": 926.0, [14:46:12] "2014-07-22 00:00:00": 941.0, [14:46:14] "2014-07-23 00:00:00": 972.0, [14:46:16] "2014-07-24 00:00:00": 981.0, [14:46:18] "2014-07-25 00:00:00": 992.0, [14:46:21] "2014-07-26 00:00:00": 992.0, [14:46:23] "2014-07-27 00:00:00": 1000.0, [14:46:26] Let's pick a day. I propose 2014-07-20 [14:46:49] looks like editor_day is consistently higher except for the last day [14:47:05] k, 7-20 [14:51:27] Hmm... I'm struggling to replicate. Sorry for the delay. [14:52:41] http://pastebin.com/E5xFXV0q [14:52:45] milimetric, ^ [14:53:13] So I can replicate the 946 - anon = 945. [14:53:30] Now, let me try to do it with arwiki's db [14:53:56] hm... is this just another +-1 [14:54:42] I filtered the user_id = 0 [14:54:59] right [14:57:59] yeah, my query for RAE gets 918 on 7-20 [14:58:16] (the result above was cached from wikimetrics running it a few days ago, but re-running manually got the same thing) [14:58:39] Sure enough. I get it when I manually go to the wiki to run the query! [14:58:39] http://pastebin.com/B2wa0WVL [14:58:42] WTF [14:58:53] ^ forgot to filter anons there [15:01:54] OH! [15:02:01] I think I've got it. [15:02:06] So, dates. Ha. They are strings. [15:03:18] The HHMMSS is the problem. I'm re-running a query to demo. (if I'm right) [15:03:44] The editor_day table doesn't have an HHMMSS, so the BETWEEN pulls in the full day of July 20th. [15:04:01] However the revision table has HHMMSS, so the BETWEEN only pulls in the first second of July 20th. [15:05:16] See my solution: http://pastebin.com/5Ece76Vk [15:05:40] huh... [15:06:12] TL;DR: BETWEEN "20140620" AND "20140720" means something different when timestamps look like YYYYMMDD (%Y%m%d) vs YYMMDDHHMMSS (%Y%m%d%H%i%S). [15:06:18] ok, i get it [15:06:36] but for our purpose, 7/20 00:00:00 would be the end date [15:06:56] so the between is OK there, the fix is to make it between start and 7/19 when we want 7/20 00:00:00 labeled-data [15:07:06] +1 [15:07:20] gotcha, ok, awesome! thanks for that, my brain no worky yet i guess [15:07:27] * milimetric should really look into this coffee thing [15:07:57] If I had a nickel for every time I derped a string/date comparison. [15:08:29] * YuviPanda gives everyone caffeine [15:24:37] halfak: is it ok to add that column to editor_day now? [15:24:51] or is it still loading? [15:25:25] Go ahead and add it. [15:25:53] I'm doing a clean load with another name -- just in case we need it. [15:26:20] "staging.editor_day_fixed" [15:26:32] "staging.editor_day" should be fine to modify [15:28:41] gotcha, ok [15:38:25] Analytics / Wikimetrics: metrics.wikimedia.org (Wikimetrics) unresponsive - https://bugzilla.wikimedia.org/68743 (christian) NEW p:Unprio s:normal a:None https://metrics.wmflabs.org/ in currently (2014-07-28 15:29) very unresponsive (and may appear down). Some pages (like uploading a new c... [15:41:25] Analytics / Wikimetrics: Backing up wikimetrics data fails if data is written while we back it up - https://bugzilla.wikimedia.org/68731#c1 (christian) It happened again for the 2014-07-28 14:00 run: tar: /var/lib/wikimetrics/public/69989: file changed as we read it tar: /var/lib/wikimetrics/publi... [15:46:33] milimetric: [15:46:44] happy monday -- looks like WM is having some issues [15:46:54] hi tnegrin [15:46:55] should ottomata take a look? [15:47:09] tnegrin: https://bugzilla.wikimedia.org/68743 [15:47:16] nah, it's basically just not able to handle the kind of strain we're trying to put on it [15:47:31] the backfilling? [15:47:34] yes [15:47:41] ok -- thanks qchris btw [15:47:48] like, running enwiki and frwiki back to 2007 was fine [15:48:00] but running like 5 projects or more at a time created all the issues you see above [15:48:29] got it [15:48:32] i mean, that makes sense, 5 projects at a time means it's trying to queue up like 15,000 processes [15:48:53] so we just never thought about this very carefully and we acknowledged that at the time [15:49:07] it's time to pay down some tech debt and to optimize how backfilling works [15:49:20] also, RAE is a gnarly metric. Even with perfect optimization it would never be done in time [15:49:41] ok -- let's get that figured out. in the short term, can we reduce the processing? [15:49:55] we created a temp table over the weekend and are trying to pre-compute some aggregates to try to make it faster [15:50:06] i can kill the recurrent reports on the server, no problem [15:50:15] i just wanted to let them run so we had enough data to debug [15:50:19] kk [15:50:19] but i think we do now - so killing [15:50:24] let's talk during our 1:1 [15:50:33] thanks folks [15:50:35] k [15:53:39] Analytics / Wikimetrics: metrics.wikimedia.org (Wikimetrics) unresponsive - https://bugzilla.wikimedia.org/68743#c1 (christian) a:Dan Andreescu Assigning to milimetric, as he ist about to kill the relevant jobs. [15:55:41] milimetric: poke me when you've some spare cycles, want to understand the architecture of wikimetrics a little better [15:55:58] YuviPanda: k [15:59:53] Analytics / Wikimetrics: metrics.wikimedia.org (Wikimetrics) unresponsive - https://bugzilla.wikimedia.org/68743 (christian) [16:03:23] Analytics / Wikimetrics: metrics.wikimedia.org (Wikimetrics) unresponsive - https://bugzilla.wikimedia.org/68743#c2 (Dan Andreescu) This is due to recurring reports I ran to test wikimetrics and see if it could handle back-filling lots of data. It back-filled 2 large wikis at a time all the way to 200... [16:08:53] Analytics / Wikimetrics: metrics.wikimedia.org (Wikimetrics) unresponsive - https://bugzilla.wikimedia.org/68743#c3 (Dan Andreescu) NEW>RESO/FIX also, I deleted the symlinks from the /var/lib/wikimetrics/public/datafiles folder. This leaves the system in a fairly clean state. I left the old rep... [16:39:19] milimetric, if it's of use, the new "editor_day_fixed" table has loaded with no errors [16:54:55] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734#c5 (Toby Negrin) Hi Christian -- thanks for this. Let me know if you have trouble getting in touch with Henrik. I can ping him as well. -Toby [17:05:58] thanks halfak, I'm still futsing with filling that column (just learned about insert ... on duplicate update) [17:06:19] but the numbers seem to correspond with querying the raw wiki so i think we're ok [17:49:22] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734#c6 (christian) (In reply to Toby Negrin from comment #5) > Let me know if you have trouble getting in > touch with Henrik. I can ping him as well. I received a response from him alrea... [18:01:52] Analytics / Tech community metrics: Wrong data at "Update time for pending reviews waiting for reviewer in days" - https://bugzilla.wikimedia.org/68436#c1 (Alvaro) Working on it. Thanks for the detection Quim. [18:28:39] Analytics / Tech community metrics: Wrong data at "Update time for pending reviews waiting for reviewer in days" - https://bugzilla.wikimedia.org/68436#c2 (Alvaro) NEW>PATC Quim, the problem is related with the update of messages in this page. I think you worked in an HTML not in sync with master... [20:19:07] Analytics / General/Unknown: Pagecounts too high for June 2014 on stats.grok.se - https://bugzilla.wikimedia.org/68734#c7 (christian) NEW>RESO/FIX The June 2014 bump on stats.grok.se graphs is gone again. Graphs look as expected again. Thanks for the fix Henrik! [21:25:22] Anyone know if we have an ottomata today? [21:25:44] For some reason we don't have gcc on stat3 and I need to install a python package. [21:25:51] That requires some compilation. [22:37:40] Analytics / Refinery: Story: HadoopUser has refined data available in clustered bucketed Hive tables - https://bugzilla.wikimedia.org/67127#c1 (Kevin Leduc) p:High>Low Moving to low priority. We should define "refined data" and ETL first.