[00:00:15] and only for cache misses [00:00:53] gwicke: ok, i see. my main concern is not to see it fall over, not that limit is too low [00:00:54] Analytics, CirrusSearch, Discovery, operations: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1802620 (EBernhardson) NEW [00:01:44] kk [00:02:38] it looks like aqs is much happier with DTCS [00:03:03] looking at https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system?panelId=12&fullscreen&from=1446768172128&to=1447372972129&var-node=All [00:12:58] gwicke: wait what is DTCS? [00:13:14] https://labs.spotify.com/2014/12/18/date-tiered-compaction/ [00:16:14] madhuvishy: i was falling behind on recruiting stuff with today's e-mail issues, need to finish that today and will join you soon in wikimetrics work [00:18:33] milimetric: one great thing would be to document how to handle page titles that include slashes, question marks, apostrophes, and such. [00:19:37] milimetric: or better, provide per-article views based on page-id instead of title. [00:22:30] ragesoss: yes, that's a known bug / annoyance [00:22:45] right now you have to double URI-encode the article titles [00:22:47] which is not ok [00:22:54] ew [00:23:11] if it turns out to be really hard to fix, we'll just document [00:23:15] but hopefully we can just fix it... [00:23:24] ew indeed [00:24:11] oh, sorry, I missed your other ping, catching up [00:25:11] ragesoss: as far as difference with stats.grok.se, it's hard to comment because I don't know how far they got [00:25:18] *how often they update now [00:25:49] so, technically, the Pageview API should have lower numbers by a little bit than the previous stats (which stats.grok.se was ingesting) [00:26:23] but we've vetted the data in a number of ways and it looks good to us [00:27:06] milimetric: compare: http://stats.grok.se/en/latest60/Selfie vs https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Selfie/daily/2015100100/2015103000 [00:28:04] k, i'll do a little more thorough work on that specific article, I'll look at the raw data that feeds both [00:28:19] The other more-popular articles from the categories I looked at had similar patterns: 2-3 times the page views reported by Pageview API, instead of the slightly lower I was expecting. [00:28:34] although most of the articles I spot checked were very similar. [00:28:58] that's... pretty weird [00:29:06] ok, skeptical hat on :) [00:31:02] similarly: http://stats.grok.se/en/latest60/Female%20genital%20mutilation (about 2000 per day) vs https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Female_genital_mutilation/daily/2015100100/2015103000 (closer to 6000) [00:32:04] similarly [[Feminism]] and [[Misogyny]] [00:32:19] (those three being the most viewed articles in Category:Feminism) [00:34:00] Analytics, Analytics-Backlog, Fundraising research: Provide performant query access to banner show/hide numbers - https://phabricator.wikimedia.org/T90649#1802757 (Jgreen) [00:34:03] Analytics, Analytics-Cluster, Fundraising Tech Backlog, Fundraising-Backlog, and 3 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1802755 (Jgreen) Open>Resolved (07:14:09 PM) awight: Jeff_Green: so the new data is good, and is correctly... [00:41:18] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1802784 (Nuria) @ezachte: I created https://phabricator.wikimedia.org/T118323 to keep track of the "pageviews per country" report... [00:44:04] ragesoss: sorry, I forgot stats.grok.se is using *really* old data. So they don't even have mobile web numbers. It looks like that explains the difference. [00:44:07] so, for October 10: [00:44:17] old data, en.wikipedia: 1200 [00:44:25] old data, en.m.wikipedia: 991 [00:44:28] ah. okay, cool. [00:44:35] thanks milimetric. [00:44:42] new data, all of en.wikipedia: 2220 [00:45:24] kind of cool the API lets you break that down by site :) I'm gonna go pat myself on the back now [00:48:34] milimetric: so, in Ruby, I did URI.encode (twice) but that didn't affect the forward slash character. I got all the articles with slashes when I did `title = URI.encode(CGI.escape(title))` [00:48:55] milimetric: but there are still 45 articles from my batch of several thousand that I can't get data for. [00:49:30] looks like these remaining ones are maybe not an encoding problem. [00:49:50] ragesoss: encodeURIComponent (javascript) is what seems to work. The bug we're going to look at very soon is: https://phabricator.wikimedia.org/T118403 [00:50:03] and you can see there an example of successfully querying with a slash [00:51:23] milimetric: what about this? https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Fourth-wave_of_feminism/daily/2015100100/2015103000 [00:51:24] ragesoss: http://ruby-doc.org/stdlib-1.9.3/libdoc/uri/rdoc/URI.html#method-c-encode_www_form_component [00:51:50] according to http://stackoverflow.com/questions/2858345/ruby-equivalent-to-javascript-s-encodeuricomponent-that-produces-identical-outpu, URI.escape is deprecated in 1.9.2 [00:52:03] gwicke: while you're on this context switch :) what's up with that, why does it need two encodes? [00:52:12] it shouldn't [00:52:17] https://phabricator.wikimedia.org/T118403 [00:52:23] in JS, you'd use encodeURIComponent [00:52:47] it looks like it needs /HIV%252FAIDS/ instead of /HIV%2FAIDS/ [00:53:13] milimetric: any idea why that last query I posted doesn't work? [00:53:22] (checking) [00:53:30] It's an article with sparse, but not zero, page views according to grok. [00:53:49] and there are no problem characters in the title. [00:54:00] https://en.wikipedia.org/api/rest_v1/page/html/HIV%2fAIDS [00:54:40] gwicke: my thought was maybe because we're passing this to the backend AQS? [00:55:11] like is there some decoding going on at each layer and then maybe the {+path} rewriting hits the "/" and breaks? [00:55:26] oh, hm [00:55:28] (just making stuff up, but we're not decoding anything anywhere) [00:55:51] yeah, {+path} is problematic [00:56:19] {title} or {/title} would be escaped as expected, but {+path} will contain multiple parts [00:56:43] ragesoss: it looks like maybe we have no views for October for that (or at least no loaded views): https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Fourth-wave_of_feminism/daily/2015100100/2015113000 [00:57:19] right, so because we're using {+path} maybe normal stuff needs to be escaped just once but "/" needs to be escaped twice [00:57:22] and I guess the system doesn't know the difference between zero views and not-loaded views? [00:57:34] hm... that would suck, I'll just change it to manually map instead of {+path} [00:57:53] ragesoss: we're having a bit of a hot debate about that, but it's a pretty tricky problem in general [00:57:56] milimetric: the issue would be in the decoding part [00:58:10] {+path} shouldn't be decoded [00:58:43] so you think it's a bug in how that's handled in restbase? [00:58:59] possibly, looking at https://github.com/wikimedia/swagger-router/tree/master/lib [00:59:29] ragesoss: so for now I think we decided sparse results with no difference between "not loaded" and "actually 0" are easier. But we can chat maybe at the developer summit in Jan. about finer points like that [00:59:50] ragesoss: but I still think it's weird that we don't have data for that if stats.grok.se does, we should have at least as much [01:01:02] ragesoss: lol, it was just created: https://en.wikipedia.org/w/index.php?title=Fourth-wave_of_feminism&action=history [01:01:19] yeah. so, data ends on Pagviews API when? [01:02:33] milimetric: here's another new-ish article, but definitely in the range where the new api has data for other articles: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Hartmann_alligator_forceps/daily/2015100100/2015103000 [01:02:48] oh, whoops. [01:02:51] ragesoss: should be current, so it will have the last full day [01:03:00] my query ended in october right before it was created. [01:03:02] okay. [01:03:04] user error. [01:03:18] sok, thx for poking at it [01:04:27] gwicke: I gotta run, but thanks a lot for looking at that, we can check it out tomorrow too [01:04:39] o/ all, nite [01:05:03] there is definitely a bug on the decoding side [01:05:16] milimetric: nite! [01:06:47] sweet. looks like all the cases except for the slash encoding that didn't return results were articles that got created outside of my query time brackets. [01:21:59] a fix for the slash encoding issue is ready at https://github.com/wikimedia/swagger-router/pull/28 [01:23:54] sweeet [01:24:07] thanks for the report! [01:28:44] oh god, URL decoding [01:41:38] lol Ironholds :) [01:42:26] milimetric, I wrote the standard URL decoder behind httr. [01:42:39] Anything that starts with "okay first, convert individual characters into their hex equivalents" - no. Just no. [01:42:41] Analytics-Kanban: AQS should expect article names uriencoded just once {slug} - https://phabricator.wikimedia.org/T118403#1802881 (Milimetric) Gabriel took a look at this and his fix looks good to me: https://github.com/wikimedia/swagger-router/pull/28 [01:42:50] the nicest thing about URL encoding/decoding is that it's not punycode [01:43:12] i don't know what any of what you just said is, and I absolutely refuse to look it up [01:43:14] Ironholds: I feel your pain ;) [01:44:10] basically I look at encoding and how it's handled in languages like python 2.x as one of the biggest fails of the engineering profession [01:44:17] milimetric, oh, punycode is a standard for handling UTF-8 URLs in an encoded/decoded fashion so they can be handled on machines where the service doesn't necessarily LIKE random kanji [01:44:19] it's like Software's Tacoma Falls Bridge [01:44:25] and it makes URL encoding look trivial [01:44:36] milimetric: so you are a Python 3.0 believer? [01:44:53] yeah [01:44:58] particularly if you implement it in say, C because then you're dealing with multibyte characters from the get-go and nobody actually agrees how big a wchar is (thank you Microsoft for implementing a minimum value that is smaller. than. some. characters. WTF.) [01:45:22] yeah, C and C++ have messed that up [01:45:25] yuuup [01:45:44] newer languages like Rust and Go have gone for UTF8 throughout, which I think is the best option [01:45:55] like, I wrote the standard URL encoder/decoder/parser for my environment, and it's got a C++ backend for speed. And it's v.fast! 1m URLs encoded in 930ms, 1m parsed in 1.3s [01:46:02] agreed, UTF8 [01:46:06] unless your main use case is measuring the length of strings ;) [01:46:07] and people keep going "you should implement a punycode part of it" and I go "nope nope NOPE NOPE" [01:46:24] Yeah, I've been looking at Rust. A friend and I decided we should learn Rust and Go and flipped coins. [01:46:33] The native UTF8/native parallelism elements are extremely attractive [01:46:39] the FFI needs a bit of work but it's close. [01:47:14] yeah, should revisit my Rust node module [01:47:41] just saw that compilation speed has improved a lot over the last six months [01:48:01] https://github.com/gwicke/html5tidy [01:48:17] da! [01:50:28] when there's a good HTTP2 server in Rust I'll be in trouble [01:56:30] haha [01:56:36] I mean, they've got a really got pure maths section already [01:56:47] I was thinking of implementing some spherical trig functions, since that's a prerequisite of geospatial work [01:56:51] just haversine and similar things [02:31:08] Analytics-Kanban: Pageview API showcase App {slug} - https://phabricator.wikimedia.org/T117224#1802918 (Milimetric) +2, all is well except we'll have to decide on a better URL (we can host it on limn1). I was thinking about this and had a couple of suggestions: analytics.wmflabs.org/demo/pageview-api demo.... [02:59:15] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1802927 (Milimetric) Wow, Erik, the commons numbers do indeed drop a lot on September 15th: https://vital-signs.wmflabs.org/#proj... [03:02:57] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Send email out to community notifying of change - https://phabricator.wikimedia.org/T115922#1802930 (Milimetric) @ezachte: oops, I think @nuria sent the email yesterday. Unless this is meant to go out more broadly on a different list I don't kno... [03:05:44] Analytics-Backlog: Make Pageview API date formats more flexible {slug} - https://phabricator.wikimedia.org/T118543#1802934 (Milimetric) NEW [06:35:48] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1803075 (kevinator) [06:51:48] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1803086 (Yurik) I suspect this might be related to T89688 [08:21:50] Analytics-Kanban, CirrusSearch, Discovery, Discovery-Cirrus-Sprint, Patch-For-Review: Setup oozie task for adding and removing CirrusSearchRequestSet partitions in hive - https://phabricator.wikimedia.org/T117575#1803168 (dcausse) Moving to done. Now waiting for T117873 to be reviewed. This is t... [09:36:10] Analytics-Backlog, Analytics-Cluster, operations: Audit Hadoop worker memory usage. - https://phabricator.wikimedia.org/T118501#1803197 (JAllemandou) @Ottomata: Let me know when you want to have a go on that, I'm interested in helping :) [09:49:09] joal: around? [09:58:53] hi dcausse [09:59:00] hi! [09:59:22] I'm trying to run camus job to pull everything from kafka [10:00:11] so I've deleted all my hdfs directories where camus stored its states from previous jobs [10:00:30] and set kafka.max.pull.hrs=264 and kafka.max.historical.days=11 [10:01:02] dcausse: I think on kafka side of things, you only get 7 days of data saved IIRC [10:01:18] but I see: 15/11/13 09:50:38 ERROR kafka.CamusJob: The earliest offset was found to be more than the current offset: ... [10:01:32] I wonder where this offset is stored? [10:02:26] hm dcausse [10:02:48] Let's take it from the beginning :) [10:02:52] :) [10:03:00] where is your camus properties files? [10:03:20] /home/dcausse/avro-kafka/mediawiki.properties [10:03:26] on stat1002 [10:03:50] * joal looks [10:04:50] note that the job is still running (/home/dcausse/avro-kafka/camus-run.log) [10:08:15] actually dcausse : /home/dcausse/avro-kafka/camus-run.log is yesterday's run, /home/dcausse/avro-kafka/camus-run-10.log is today --> still running, no error [10:08:29] and used this command : /srv/deployment/analytics/refinery/bin/camus --run --job-name camus-avro-test-dcausse -l ./refinery-camus-0.0.23-SNAPSHOT.jar mediawiki.properties --check &> camus-run-10.log [10:08:49] oops yes you're right [10:09:02] :) [10:09:33] but I think it will fail like yesterday after the mr job :/ [10:09:42] Some topics skipped due to offsets from Kafka metadata out of range. [10:11:31] or maybe not, let's see :) [10:19:24] dcausse: quick question about the number of mappers you set in camus: Why 25 ? [10:19:43] dcausse: I think you have 12 partitions, so 12 mappers should do, no ? [10:20:14] I have no idea, this file is from puppet [10:20:44] hm [10:20:57] I've just changed timestamp to ts and tweaked the kafka.max.historical things [10:22:40] dcausse: Still 1/2h to wait, let's wrap up at that time :) [10:22:55] ok :) [10:50:09] Ohhhhhh ! dcausse ! [10:50:13] I got the error :) [11:01:55] dcausse: you here ? [11:28:48] joal: yes [11:29:04] :( [11:30:30] I think kafka stores this offset on its side. Wondering if changing kafka.client.name can force it to "restart" [11:33:15] dcausse: not a kafka issue in fact :) [11:33:22] ah ? [11:33:29] I should have noticed straight from the beginning, I'm sorry [11:33:39] It's a checker issue [11:33:44] camus run went fine [11:34:17] But at 'first run', it seems camus logs current timestamp as the previsous one, triggering an error in checker [11:34:25] oh yes you're right, I can see data in /user/dcausse/camus/raw/mediawiki/mediawiki_CirrusSearchRequestSet/hourly/2015/11/06/23 [11:34:38] but _IMPORTED is not set [11:35:03] dcausse: because it's the checker job [11:35:29] ok, do you think the error is related to my patch? [11:35:51] https://gerrit.wikimedia.org/r/#/c/251267/ [11:35:51] So, I think the easiest is to have a second run of camus + checker, ensure everything is fine with that one, and manually add the flag for the first set [11:35:57] dcausse: can't say really [11:36:11] so I run it again with the same params? [11:36:19] dcausse: another way to do it: make a very short first camus run (instead of 55 minutes) [11:36:34] like that, the number of flags to create manually will be small [11:36:53] dcausse: I think it's fine to run again with the same params [11:37:01] ok will try [11:37:28] dcause: I'm monitoring the thing as well [11:38:23] ok running it with logs: camus-run-10-take2.log [11:39:25] k dcausse, next checkpoint in 55 minutes :) [11:39:32] :) [11:39:45] forgot to change this setting :( [11:40:03] dcausse: we could have been smarter, and make smaller camus runs ;) [11:44:49] git log [11:44:53] oops :) [12:06:19] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1803341 (JAllemandou) @ezachte, @Milimetric, I have the reason for the commons drop: [[ https://wikitech.wikimedia.org/w/index.p... [12:07:15] joal: it worked! [12:08:34] hmm.. but did not created the _IMPORTED flag on previous data :/ [12:09:04] dcausse: expected behavior :) [12:09:06] ok [12:09:30] so when we'll deploy and if we want to backfill to fix current partitions we will: [12:09:30] As I said, after the successfull run, we would have to manually created the flags for the failling first run [12:09:44] 1/ a small run to initialize the offset [12:09:50] 2/ a backfill run [12:10:05] 3/ create _IMPORTED manually on the first run [12:10:25] 4/ run oozie to create all those partitions in hive [12:10:28] dcausse: I'd say a very small first run (maybe not even _IMPORTED flags to be created ? [12:10:48] joal: oh I see, good idea [12:10:49] Then a cron with regular run (backfill will happen automagically :) [12:10:55] ah cool :) [12:11:29] I think the size of the very small run still need to be tested (like 1min ?) [12:11:41] dcausse: easy to do though :) [12:11:53] ok will check with a new hdfs path [12:13:26] dcausse: let me know how ti goes :) [12:13:32] sure [12:32:34] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1803374 (ezachte) @JAllemandou, Wow, great find! I guess this affects mostly wikis where a large percentage of page views is from... [12:37:56] joal: our data is too small, with 1 min it was able to pull 4 full partitions and partial one :) [12:52:34] joal: in the end with an initial run at 1min we will have to flag 2 partitions manually [13:13:42] (CR) DCausse: "We talked about this issue with Joal on IRC." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/251267 (https://phabricator.wikimedia.org/T117873) (owner: DCausse) [13:17:06] Analytics, Traffic, operations: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1803407 (BBlack) NEW [13:29:08] dcausse: sorry went away [13:29:26] dcausse: 2 partitions flaged by hand, that's still not too bad, no ? [13:30:23] joal: sure, also shoudl I write something somewhere about the things we will have to do when will deploy the fix? [13:30:43] would be a good idea dcausse :) [13:31:35] concerning the first at 1min, will we use gerrit or is it possible to set this property on the command line? [13:31:47] s/the first/the first camus run/ [13:32:29] dcausse: I think we'll ask ottomata to do it manually, then flag manually the partitions, the merge the puppet camus cron to be executed [13:32:42] ok [13:38:29] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1803443 (BBlack) In general, we shouldn't be trying to manually decode the XFF header: we should be moving towards relying on the X-Client-IP header we're setting in Varnish t... [13:39:53] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1803444 (BBlack) I should have said, at the top of the above comment: The new X-Client-IP code should resolve most issues, but there may be remaining issues specific to Intern... [13:41:20] Analytics, Traffic, operations, Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1803453 (BBlack) [13:55:31] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1803486 (Yurik) https://developers.facebook.com/docs/internet-org/platform-technical-guidelines mentions that IP is added to XFF, and the detection can be done by checking th... [14:05:52] Analytics-Kanban: AQS should expect article names uriencoded just once {slug} - https://phabricator.wikimedia.org/T118403#1803523 (mobrovac) The patch has been merged and a new version of the package is available. Hence, this should be corrected on the next deploy. [14:13:34] hey mobrovac [14:14:26] looking at your comment on article encoding: are you talking about restbase-aqs deploy, or restbase-restbase deploy? [14:14:31] mobrovac: --^ [14:17:56] joal: both :) [14:18:40] so if I understand correctly, for the patch to take effect, we need both restbase to be deploy mobrovac [14:19:40] lemme check [14:20:26] Analytics-Backlog, WMDE-Analytics-Engineering, Wikidata, Graphite, Patch-For-Review: Enable retention of daily metrics for longer periods of time in Graphite - https://phabricator.wikimedia.org/T117402#1803550 (Addshore) After many discussions in many places we have decided to try and push forwar... [14:21:17] Analytics-Backlog, WMDE-Analytics-Engineering, Wikidata, Graphite: Create a Graphite instance in the Analytics cluster - https://phabricator.wikimedia.org/T117732#1803551 (Addshore) Open>stalled After many discussions in many places we have decided to try and push forward with the config chang... [14:23:05] joal: euh, sorry, no deploy needed for restbase-aqs, only restbase^2 :) [14:23:24] ok mobrovac, understood :) [14:23:29] i'll edit the comment [14:23:45] mobrovac: would you let us know when the next deploy happens ? [14:24:27] joal: sure, you can expect one on monday most likely [14:24:45] today's friday, so no deploys :P [14:24:48] ok mobrovac, thanks ! [14:25:04] For sure mobrovac :) Weekend preservation is good for all of us [14:25:10] :) [14:49:31] Analytics-Backlog, Privacy, Varnish: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817#1803654 (Ottomata) @csteipp, yes. If `request_id` is on webrequests as well as on app server generated events, then researchers like @leila can assoc... [14:49:54] * joal is away, will be back for standup [15:00:39] Analytics-Kanban: Pageview API showcase App {slug} - https://phabricator.wikimedia.org/T117224#1803681 (mforns) Cool. Thx for the review! Between the two urls I lean towards 'demo.pageviews.wmflabs.org'. I think we should be consistent in using either 'pageview api' or 'analytics query service' everywhere t... [15:05:41] * milimetric thinks he solved his wifi issues by taking his laptop apart and putting it back together again. [15:06:03] Does anyone know what the distribution of page views looks like? How many pages are not read at all, etc.? [15:06:26] (English Wikipedia, specifically.) [15:08:53] dcausse: if you can write deployment steps in wikitech it will be best for future reference, I am a little lost with the need to have _IMPORTED files [15:10:20] nuria: sure, but I think this one is really exceptionnal [15:22:23] mforns: I forget what I said we should work on together this morning. was it the encoding thing that got fixed already? [15:22:27] no!! I remember, the spec [15:22:41] milimetric, yes the spec [15:22:56] cool, wanna work on it? [15:23:12] I was going to jump into it after I finish the engagement survey [15:23:15] yes sure [15:23:19] give me 2 mins [15:23:23] ah, good, I'll read your docs, no rush [15:29:01] Analytics, CirrusSearch, Discovery: Deploy refinery-camus 0.0.23 to fix partition issues with mediawiki.CirrusSearchRequestSet - https://phabricator.wikimedia.org/T118562#1803763 (dcausse) [15:33:36] Analytics-Kanban: Pageview API showcase App {slug} - https://phabricator.wikimedia.org/T117224#1803769 (Nuria) Let's please use pageview-api as it is a lot more self-explanatory. [15:38:05] Analytics-Kanban: Pageview API showcase App {slug} - https://phabricator.wikimedia.org/T117224#1803778 (Milimetric) There are only a few people that have rights to make gists on wikimedia, I think. Everyone else just shares their own personal gists, I think that's totally fine. I agree the naming has been... [15:38:42] Analytics, CirrusSearch, Discovery: Deploy refinery-camus 0.0.23 to fix partition issues with mediawiki.CirrusSearchRequestSet - https://phabricator.wikimedia.org/T118562#1803779 (Nuria) On my opinion we should be ok loosing data, this is analytics data, of tier-2 nature and we should be fine starting... [15:41:51] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1803784 (Milimetric) @JAllemandou - thank you, I've added an annotation to explain: https://vital-signs.wmflabs.org/#projects=com... [15:46:42] Analytics, CirrusSearch, Discovery: Deploy refinery-camus 0.0.23 to fix partition issues with mediawiki.CirrusSearchRequestSet - https://phabricator.wikimedia.org/T118562#1803788 (dcausse) Created https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Deploy_a_fix_to_incorrect_camus_partitionning but t... [15:46:43] joal: we should sync up real quick on data loading [15:46:57] milimetric, done [15:47:10] mforns: https://ide.c9.io/milimetric/restbase [15:47:14] let's give that a try [15:47:15] the last part was the largest [15:47:21] ok [15:47:21] mforns: can you send me your link with pageview API docs? [15:47:38] nuria: https://wikitech.wikimedia.org/wiki/Analytics/Analytics_query_service_(v1) [15:47:52] milimetric: can we move that to pageview API? [15:48:01] that's exactly what mforns and I are doing now [15:48:09] (mforns I'm also in the batcave) [15:48:13] it is a lot more self-explanatory and easy to find when we serach on wikitech ccmf [15:48:18] oohhh [15:48:18] oh wait [15:48:29] you mean move the page title [15:48:32] yes [15:48:38] ok [15:48:39] um... it's called AQS actually [15:48:55] I was saying on the thread that we should probably stick with that or risk confusing the hell out of everyone [15:48:59] but we can chat [15:49:09] on, going to batcave [16:01:31] Analytics-Backlog, Privacy, Varnish: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817#1803803 (Ottomata) @bblack, would it be possible to add a uuid `X-Request-Id` header at the varnish level? [16:01:55] Analytics, Traffic, operations, Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1803813 (Ottomata) > Add client_ip field based on the above to the webrequest varnishkafka log format (We can probably start getting this into e.g. oxy... [16:31:29] milimetric: I'm here [16:31:48] milimetric: wassup on data loading ? [16:39:53] milimetric: batcave ? [16:40:28] joal: sorry :) yes, we're in batcave [16:40:33] we're working on the spec [16:40:35] ok; joining [16:43:08] joal: nice work on plot on https://phabricator.wikimedia.org/T114379 to explain the differences between bot detection before and after [16:43:14] Analytics-Kanban, CirrusSearch, Discovery, Discovery-Cirrus-Sprint, Patch-For-Review: Setup oozie task for adding and removing CirrusSearchRequestSet partitions in hive - https://phabricator.wikimedia.org/T117575#1778160 (dcausse) Adding a blocking task as we need to make a last minute update to... [16:44:00] nuria: :) [16:48:43] Analytics-Kanban: Create celery chain or other organization that handles validation and computation {kudu} [8 pts] - https://phabricator.wikimedia.org/T118308#1803890 (madhuvishy) [16:50:41] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1803898 (BBlack) Yeah it kinda sucks if they can't actually give us a list of proxy IPs or networks we can maintain. The Via header is nice, but that still leaves us with the... [16:57:12] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1803910 (dr0ptp4kt) I agree with @bblack, "leftmost" on its own isn't a good heuristic (it seems to be okay in this specific case, but not the general one). It has to go right... [17:00:11] Analytics-Backlog: MobileWikiAppDailyStats should not count Googlebot - https://phabricator.wikimedia.org/T117631#1803919 (dr0ptp4kt) [17:00:51] nuria: ottomata: standup :) [17:00:56] milimetric: tring to join [17:01:02] want me call you? [17:01:02] *trying [17:01:42] gah [17:04:49] Analytics-Kanban: Bring wikimetrics staging up to date {dove} [1 pts] - https://phabricator.wikimedia.org/T118484#1803922 (Milimetric) [17:10:14] Analytics-Kanban: Troubleshoot Hebrew characters in Wikimetrics {dove} [2 pts] - https://phabricator.wikimedia.org/T118574#1803948 (mforns) NEW a:mforns [17:12:36] Analytics-Kanban: Troubleshoot Hebrew characters in Wikimetrics {dove} - https://phabricator.wikimedia.org/T118574#1803968 (mforns) [17:12:42] (PS1) DCausse: Add 2 payloads map fields to CirrusSearchRequestSet avro schema [analytics/refinery] - https://gerrit.wikimedia.org/r/252956 (https://phabricator.wikimedia.org/T118570) [17:12:51] Analytics-Kanban: Troubleshoot Hebrew characters in Wikimetrics {dove} - https://phabricator.wikimedia.org/T118574#1803948 (mforns) I checked that the mentioned cohorts are storing the Hebrew usernames correctly. And also created a cohort with Hebrew usernames and created reports on that. Everything seems to... [17:14:57] (PS1) DCausse: Add 2 payloads map fields to CirrusSearchRequestSet avro schema [analytics/refinery/source] - https://gerrit.wikimedia.org/r/252958 (https://phabricator.wikimedia.org/T118570) [17:16:55] Analytics, CirrusSearch, Discovery: Deploy refinery-camus 0.0.23 to fix partition issues with mediawiki.CirrusSearchRequestSet - https://phabricator.wikimedia.org/T118562#1803995 (EBernhardson) I'm fine losing the old data, we havn't built up anything around this yet. [17:23:27] (CR) DCausse: "tested with oozie and existing data." [analytics/refinery] - https://gerrit.wikimedia.org/r/252956 (https://phabricator.wikimedia.org/T118570) (owner: DCausse) [17:38:57] Analytics-Backlog, Analytics-Kanban, operations, Monitoring, Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1804028 (Ottomata) [17:38:59] Analytics-Cluster, operations, Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1804027 (Ottomata) [17:39:38] Analytics-Cluster, Analytics-Kanban, operations, Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1804032 (Ottomata) p:Triage>Normal [17:40:10] Analytics-Cluster, Analytics-Kanban, operations, Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1238209 (Ottomata) Let's turn of the udp2log instance on erbium. @Jeff_Green, if you say this ok, we will do it! [17:40:35] Analytics-Backlog, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1804038 (BBlack) Well, even then, junk or untrusted ones we don't want to pay attention to may not always be private IPs. People also use browser extensions or other local ha... [17:41:16] Analytics-Cluster, Analytics-Kanban, operations, Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1804040 (Jgreen) >>! In T97294#1804032, @Ottomata wrote: > Let's turn of the udp2log instance on erbium. @Jeff_Green, if you say this ok, we will do i... [17:46:09] Analytics-Kanban, EventBus, Services: Package EventLogging and dependencies for Jessie - https://phabricator.wikimedia.org/T118578#1804053 (Ottomata) NEW a:Ottomata [18:03:17] milimetric: great article :) [18:03:39] milimetric: I added acomment about top per year (probably not feasible, too big) [18:03:50] but except for that; looks great to me ! [18:04:39] a-team, I'm off for tonight. I'll monitor / restart cassandra backfilling over the weekend. [18:04:46] Have a good one ! [18:05:03] laters! [18:34:28] Analytics-Cluster, Analytics-Kanban, operations, Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1804301 (Jgreen) [18:34:55] Analytics-Backlog, Reading-Infrastructure-Team, User-bd808: Create user defined function to classify network origin of an IP address - https://phabricator.wikimedia.org/T118592#1804308 (bd808) NEW a:bd808 [18:38:34] Analytics, Design: Collect font support metrics - https://phabricator.wikimedia.org/T108879#1804340 (Tgr) Existing examples of doing font detection this way are [[ http://flippingtypical.com/ | flipping typical ]] and [[ http://www.lalit.org/lab/javascript-css-font-detect/ | fontdetect.js ]] (the latter w... [18:41:17] milimetric, do you remember if top is available montlhy and yearly? [19:25:34] Quick Question: Is the "event_userSessionToken" field in eventlogging the same as the "xxwikiSession" cookie? [19:30:05] omg, it's so sweet to see ewulczyn_____ here. :D [19:30:57] ottomata: I made a task for hashed IP discussion and assigned it to you, not as an authoritarian assignment, but as a friendly assignment. :D ewulczyn_____ and I can benefit to have a longer chat with you today if you're around. [19:37:36] sure, am around gimme a few mins [19:53:47] hey nuria yt? [19:54:22] oh no, its friday [20:25:13] Analytics-Kanban, EventBus, Services, Patch-For-Review: Package EventLogging and dependencies for Jessie - https://phabricator.wikimedia.org/T118578#1804606 (Ottomata) Turns out that one I just made for dotted we won't need. We do need a later version of tornado. There is this: https://packages.de... [21:40:25] Analytics-Backlog: Community has a Stats landing page with links - https://phabricator.wikimedia.org/T117496#1804707 (Addshore) [22:03:47] Analytics, Traffic, operations, Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1804749 (BBlack) >>! In T118557#1803813, @Ottomata wrote: > Hm, not sure about this. `ip`, maybe. 'ip' is mostly-pointless even now, as it's almost... [22:08:39] ottomata: are you still around for a quick meeting? [22:11:19] ya [22:11:22] am here [22:11:39] ewulczyn: ja [22:14:27] ewulczyn: where? [22:14:50] ottomata: hangout? [22:16:32] ja, where? :) [22:16:33] batcav? [22:16:38] ja [22:16:42] https://plus.google.com/hangouts/_/wikimedia.org/a-batcave [22:16:47] oh you are tehre [22:38:46] (PS1) BryanDavis: Rename ipAddressMatcherCache -> trustedProxiesCache [analytics/refinery/source] - https://gerrit.wikimedia.org/r/253045 (https://phabricator.wikimedia.org/T118592) [22:38:48] (PS1) BryanDavis: Add UDF for network origin [analytics/refinery/source] - https://gerrit.wikimedia.org/r/253046 (https://phabricator.wikimedia.org/T118592) [22:56:53] i tried to run a query in hive but got 'GC overhead limit exceeded'. There are no joins, this is a simple filter on the wmf.webrequest table over a few days. The query works with limit set to a single hour, but then expanding it to do 10 days (days when our test was running) i run into this issue, any ideas? [22:57:16] the query is: INSERT OVERWRITE LOCAL DIRECTORY '/home/ebernhardson/hive' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select uri_host, uri_path, uri_query, accept_language, referer from wmf.webrequest where year=2015 and month=11 and day between 3 and 13 and x_analytics_map['wprov'] = 'iwsw1'; [22:57:41] also, relatedly but not the same issue, running that query set to a single hour works fine in hive, but if i do it in beeline i don't get an output directory [22:59:03] although beeline does say: INFO : Copying data to local directory /home/ebernhardson/hive from hdfs://analytics-hadoop/tmp/hive/ebernhardson/80234ac4-34ad-447a-93f4-0b4a265d95b4/hive_2015-11-13_22-55-23_399_1362233851818051683-6/-mr-10000 [22:59:07] there just isn't anything there [23:01:58] ebernhardson: Hmmm not too sure but have you tried doing export HADOOP_HEAPSIZE=1024 before running it using hive? not sure about beeline's behavior [23:02:12] madhuvishy: i have not, will try. thanks [23:05:48] madhuvishy: looks to have done the trick, its counting down now. thanks again! [23:05:58] oh cool! [23:06:02] np [23:21:37] Analytics-Backlog, Fundraising research, Research-and-Data: FR tech hadoop onboarding - https://phabricator.wikimedia.org/T118613#1804907 (DarTar) NEW [23:39:36] Analytics-Backlog: MobileWikiAppDailyStats should not count Googlebot - https://phabricator.wikimedia.org/T117631#1804931 (Tbayer) FWIW, here is a closer look on how frequently this has occurred over time - while the ratio of Googlebot-run active apps is low currently, it has reached up to 83.6% in the past (...