[01:23:51] halfak: in the Scorer class, what is "source" in "def __init__(self, source)" ? [01:24:15] danilo, something like revscores.APIExtractor. [01:25:26] hmm, ok [01:26:59] Yeah. It's a little weird. I made a card for that tech debt. [01:27:08] https://trello.com/c/NsNp5OSD/20-apiextractor-source-featuresource-datasource [02:12:10] halfak: some languages will not provide some features, right? when writing a MLScorerModel, how can I say to model only use a feature if the language has suport for it? [02:23:35] danilo, in the MLScorer case, that'll be OK. [02:23:44] You pick the features when you train a model. [02:23:57] In the case of a custom rule-based scorer, I guess that's not so clear. [02:36:31] halfak: MLScores extract the features using 'self.model.features', if a particular model uses features that is not provide for some languages, we need to use a filter for features that depends on the language, won't we? [02:37:15] example: self.source.extract(rev_id, self.model.features) -> self.source.extract(rev_id, self.model.features(language)) [12:22:16] morning. [13:42:57] darnit, where's leila? I need a leila [13:43:10] also, I've just realised her name has the right number of syllables to fit in with Eminem's "I Need A Doctor". [13:43:14] perfect. [14:12:07] morning halfak :) [14:12:15] o/ Ironholds [14:12:26] I've been doing STATS [14:12:29] it's mad. [14:12:57] the dip test shows that the sessions-per-user is bimodal but not significantly enough to factor into testing. [14:14:25] I wanna talk to leila just to make sure I'm not totally crazy, though. [14:33:54] ooh, it's sunny outside [14:34:00] I declare today spraypainting day! [14:41:01] whatcha sparypainting? [14:41:06] 3Quarry, Tool-Labs-tools-tsreports: Quarry-TSreports feature parity - https://phabricator.wikimedia.org/T78549#848037 (10valhallasw) 3NEW [14:43:34] I want to redo my haloween costume [14:43:52] say, do we have any machines that have R but don't have libmaxminddb? [14:43:54] I wanna test something. [14:44:59] Ironholds, you could use your local machine. [14:45:09] that does have libmaxminddb [14:45:12] that's where I built the library :D [14:47:22] wait, got ti [14:47:23] *it [14:48:24] yay, it breaks! [14:51:26] halfak, mind installing an R package locally and testing if a couple of commands run? I promise it's benign. [14:55:56] halfak: i prepared some more data for experimenting! [14:56:05] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/xmldumps#Results [14:56:27] got time today to show me how you run a streaming task? doesn't matter what, just something you've already got [14:56:34] that we can run on each format [14:59:11] am I meant to be in the analytics standup? [14:59:40] only if you want to [14:59:46] i think everyone was just invited [15:00:51] ottomata, will email [15:01:41] huh. First invite I've got, I think. [15:02:10] Ironholds, it's so that we can all show up in the batcave. [15:03:05] gotcha [15:04:05] Google is weird about invites. [15:04:52] huh. this query just won't launch [15:05:01] there's nothing on the cluster, it just...sits there, for no apparent reason [15:07:56] Ironholds, you sure the cluster is free? [15:08:13] That sounds like the behavior when I was hammering the cluster with a streaming job. [15:08:27] oh, you're right [15:08:32] there's now a load of stuff running [15:08:45] but none of them are crazy-big [15:09:50] Could be one of the jobs that ottomata is testing re. diff parsing. [15:11:20] naw, not running anything right now [15:11:53] nothing much there except for the usual camus/webrequest jobs [15:12:06] Gotcha. email sent. [15:12:23] adnke [15:12:24] danke [15:12:48] ottomata, still want me to run tests or should I leave them to you? [15:17:36] oh, looks like it was running out of heapsize [15:17:47] halfak, apparently you and I have to negotiate over ottomata's time today [15:17:51] I...don't know what that looks like. [15:17:56] Do we have a rap battle? dance-off? [15:17:58] knife fight? [15:18:39] ha, its ok, we'll just do it asynchronously :) [15:18:59] Nope. I'm good. Can run tests myself :) [15:19:01] I am going to try to run some jobs and do experiments to fill in this table: [15:19:01] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/xmldumps#Results [15:19:12] halfak, okie! [15:19:14] halfak: you can run those? [15:19:42] ottomata, well, I can run the streaming tests I have on the datasets you produced. [15:19:57] ok awesome, do you know if the streaming tests will report cumulative cpu time and # mappers? [15:20:11] Not sure. I'm looking into that now. [15:20:13] ok [15:20:18] also, halfak, don't run on avro-uncompressed for now [15:20:21] i think it is bad data [15:20:23] not correct [15:20:24] kk [15:20:32] i tried to uncompress the snappy one and I think I did it wrong [15:20:39] but that was end of day friday so I stopped trying :) [15:21:00] the json data is yours, so in your format (with page.redirect.title vs page.redirect_title) [15:21:22] the parquet one will need a different input format [15:21:23] hm [15:21:45] ottomata, we could just handle both schemas in the script. [15:21:49] hmmm, the hive table is using: parquet.hive.DeprecatedParquetInputFormat [15:21:52] aye [15:21:53] Rather, the script doesn't look for redirect_title [15:22:50] Oh! Wait, I see. You must specify schema. Hmm.. I could convert the json to the new schema. I have a script for that. [15:23:27] k if you like...you must specify schema? [15:23:41] i thikn the data will look like json to your jobs either way, right? [15:23:51] and i thank that is the only real difference between the schemas [15:23:53] It did when I ran snappy/avro [15:24:10] yes, you will need the schema specified when running with avro, because the input format needs it [15:24:18] but, your json test doesn't need that, (right?) [15:24:33] and, you could just have the logic check for redirect_title vs redirect.title [15:24:34] re. "you must specify schema" -- I figured that avro stored the field names in a schema that was referenced for each record. [15:24:40] hm. [15:24:44] wait [15:24:50] But either way, the python doesn't care. [15:24:55] it should be ok without schema, in this case, the schema should be stored in the header of each file [15:24:57] hm. [15:24:59] So long as the input format converts to json then. [15:25:02] (it isn't always) [15:25:06] will have to look into that [15:25:12] also, will ahve to look into how to do parquet + streaming [15:25:16] not sure atm [15:25:20] kk. [15:25:30] anyway, i gotta run for a bit, i will be back shortly (getting coffee and running an errand) [15:25:58] So, I have some other stuff to do before 3PM PST, but I'll pick up this stuff as soon as I can. [15:26:06] If you beat me to it, let me know :) [15:27:52] k [15:59:07] wow [15:59:30] the query is not running, with no jobs in front of it :/ [16:03:42] java.lang.OutOfMemoryError: Java heap space [16:03:44] that explains it. [16:10:41] ok Ironholds back [16:10:51] yay! [16:10:53] wasssuuup [16:12:54] so, I wrote a prototype query for grabbing pageviews under the new definition [16:12:57] testing it on an hour's worth of data [16:13:10] one MINOR problem: even with export HADOOP_HEAPSIZE=1024, it idles for ages and then runs out of memory. [16:13:15] ...this seems suboptimal. [16:13:30] how much data are you trying to run it on? [16:13:42] an hour of text and mobile [16:13:56] midday UTC on 7 August, if that makes a difference ;p [16:14:33] 7 august? [16:14:38] there is no data for august [16:14:47] we only keep 30 days [16:17:21] Ironholds: ^? [16:18:04] December [16:18:06] sorry, brainfart [16:22:40] ottomata, December! sorry, my brain melted [16:23:25] ah ok [16:24:02] Ironholds: tell me how to reproduce [16:24:44] ottomata, https://gist.github.com/Ironholds/96558613fe38dd4d1961 [16:25:58] halfak: [16:25:59] oh [16:25:59] btw [16:26:10] ah, nm, will reply via email :) [16:47:03] Ironholds: how long does it take for you to get the OOM? [16:47:10] a good five minutes [16:47:13] in the meantime it just idles [16:47:56] Ironholds: I think this is not a hive client OOM [16:48:04] hurm [16:48:08] what do you think's going on? [16:48:14] i thikn this oom is on the cluster, not sure yet [16:48:18] 999 reducers is a lot. [16:49:46] Ironholds: I'm looking through this presentation, there are some goodies in there, you might want to check it [16:49:47] http://www.slideshare.net/altiscale/debugging-hive-with-hadoop-in-the-cloud [16:50:58] thanks! [17:39:13] lzia: loved that link you gave on mailing list, which I didn't know :) [17:39:16] https://en.m.wikipedia.org/wiki/Kaarle_Krohn [18:22:20] hey y’all [18:22:44] problem solved, flight finally took off, 8 hours later [18:22:54] glad to hear it DarTar [18:23:56] yeah, I was freaking out when seeing other flights do the exact same thing (another AA flight stopped over Scotland, made a U turn and returned to LHR roughly around the same time) [18:46:21] Ironholds: I think we need to examine your query, i'm not really sure. i increased heapsize on the server. your job isn't actually being submitted to hadoop [18:46:26] so hive must be getting stuck on it somewhere [19:01:33] heya halfak, do you have bandwidth to jump into an invitebot meeting with fhocutt and I? [19:01:48] if not today, I can circle back with you later. [19:01:53] Yup. Running a little bit late from the last meeting. [19:02:48] ottomata, huh. Any theories? [19:02:50] halfak, sorry! :/ [19:04:58] Ironholds: not yet. [19:05:12] is it possible to get rid of the subquery? [19:05:22] and just run it all as one query? [19:05:22] 'hmn [19:05:32] i forget, can you reference the aliased fields (like project_suffix) inside the same query? [19:05:39] i am thinking you cannot... [19:05:44] I do not think so [19:05:49] they're only aliased BECAUSE it is a subquery, mind [19:05:57] yeah, i thikn i've done that before too [19:06:00] subqueries without aliases work fine for qchris but make hive complain whenever I use them ;p [19:06:11] I think the server likes his beard, or something. [19:07:21] {{cn}} [19:13:16] Ironholds: ~ how many different qualifiers will there be? [19:13:30] qualifiers? [19:13:46] oh [19:14:03] >= 700, I guess? [19:14:08] however many projects we have right now [19:14:33] ok [19:51:30] hey halfak wanna hear something hilarious? [19:51:35] sure [19:51:38] The gerrit ops seem to have accidentally given me +2...everywhere. [19:51:56] fundraising repos? Yep. MW core? Yep. [19:52:19] ...so that should change. [20:06:27] Ironholds, loool [20:17:55] halfak: we have things set up so that if you want to view graphite data, you automatically get ability t o +2 a lot of things [20:18:25] This is SO DUMB ;p [20:18:53] "I'm sorry, sir, but if you want a bottle of water you'll also have to accept this free tactical nuclear weapon" [20:19:03] "none of the buttons are marked in a language you read, have fun" [20:20:50] That... makes no sense. [20:21:22] halfak: so when I got +2 everywhere for same reason about… 2 years ago? I protested as well, and was told ‘well, when you were hired into WMF we trust you to not do stupid things, so do not do stupid things' [20:21:31] and nobody who has +2 has done stupid things so far [20:21:32] so... [20:21:47] Ironholds: I mean, SWalling had same amount of +2/-2 powers you do now :) [20:21:59] yes, but I don't consider him as incompetent as I consider myself. [20:22:13] it’s ok, we have a ‘revert’ button too [20:22:32] Oh. Maybe not so bad then. [20:27:30] okay, this memory leak is now officially pissing me off [20:27:37] it's a STACK OBJECT. How is a stack object leaking?! [20:29:05] bits of it get stuck? [20:31:01] I give up. [20:34:47] Ironholds: valgrind is your friend [20:36:28] oh, I've found the line causing it [20:36:32] I just don't know why it's causing it. [20:41:54] if only I knew someone who had a ton of experience writing C++, YuviPanda! :P [20:43:05] if only [20:45:32] YuviPanda, https://github.com/Ironholds/rgeoip/blob/master/src/lookup.h#L41 [20:45:40] that line. That line is causing memory leaks. And I do not understand HOW. [20:46:18] Ironholds: what does converters::stcp return? [20:47:20] YuviPanda, character pointe--oh. [20:47:40] see https://github.com/Ironholds/rgeoip/blob/master/src/converters.h#L20-L25 [20:48:06] yeah, so someone needs to free that [20:48:10] yup ;p [20:48:11] 3Tool-Labs-tools-tsreports, Quarry: RSS feeds - https://phabricator.wikimedia.org/T60830#848877 (10valhallasw) [20:48:13] hum [20:48:18] trying to think of how I'd even get to it... [20:48:26] also who is freeing ‘result’? [20:48:31] I guess I'll have to convert distinctly [20:48:38] Ironholds: put it in a temp var, free that at end? [20:48:41] exactly [20:48:54] re "result", the underlying struct is ALLEGEDLY freed when the database is. [20:49:02] aaah [20:49:03] right [20:49:05] I'll see what happens when I fix the pointer problem. [20:49:22] 3Quarry: Show a list of recently finished queries (possibly also via RSS) - https://phabricator.wikimedia.org/T60823#848882 (10valhallasw) [20:49:22] yeah [20:49:52] 3Quarry: Create 'reports' feature - https://phabricator.wikimedia.org/T78593#848886 (10valhallasw) 3NEW [20:50:07] 3Quarry: Show a list of recently finished queries (possibly also via RSS) - https://phabricator.wikimedia.org/T60823#601832 (10valhallasw) [20:50:32] 3Quarry: Create 'reports' feature - https://phabricator.wikimedia.org/T78593#848886 (10valhallasw) [20:50:53] 3Tool-Labs-tools-tsreports, Quarry: Quarry-TSreports feature parity - https://phabricator.wikimedia.org/T78549#848902 (10valhallasw) [20:53:09] 3Tool-Labs-tools-tsreports, Quarry: Provide redirects from old tsreports reports to new quarry reports - https://phabricator.wikimedia.org/T78595#848910 (10valhallasw) 3NEW [20:55:57] mwahaha, I’ve succeeded in bringing botspam to this channel [20:56:05] 3Quarry: Ideas for reports - https://phabricator.wikimedia.org/T78597#848934 (10valhallasw) 3NEW [20:56:18] oh? [20:56:27] motherFU- YuviPanda that call was the source of /all of the memory leaks/ [20:56:36] I have no idea why I didn't look at the arguments. I'm a frickin' idiot. [20:56:37] thank you! [20:56:43] Ironholds: [20:56:44] why [20:56:45] AND uri_path NOT LIKE('undefined') [20:56:47] Ironholds: :D [20:56:50] you are not using any wildcards [20:56:53] why not just != ? [20:57:11] 3Quarry: REPORTS-49 IP block SQL Query on en-wiki - https://phabricator.wikimedia.org/T60835#602808 (10valhallasw) [20:57:28] ottomata, because uri_paths consist of /dir/undefined [20:57:54] and I do not know if /wiki/undefined is the constant, or if all the localised project dirs can also have /undefined [21:29:31] helder, you around? [21:31:53] kind of [21:32:15] what is up halfak ? [21:32:35] Got a second to review a simple pull request? [21:32:42] I think so [21:32:44] https://github.com/halfak/Revision-Scoring/pull/12 [21:33:07] I just fixed an issue that occurred when an edit was the page creation. [21:33:26] danilo is currently using this, so I'd like to have the fix in master. :) [21:34:42] halfak: I commented too [21:35:45] halfak: does ".parent_id" corresponds to the "previous revision"? [21:35:52] It does. [21:36:15] and the "" in 'previous_rev_doc.get("*", "")' is used as a default value in case "*" doesn't exists? [21:36:21] Yup. [21:36:42] ok, so my only concern is about the language changes [21:36:49] from there: "I think the language should be changed to Portuguese (and the import command too), for consistency. [21:36:49] " [21:36:50] You're right. I'm updating that now. [21:38:40] how does it works when a pull request needs to be updated? [21:38:49] in gerrit we have a new patch set, [21:38:59] is there something analogous on github? [21:39:00] Helder: you just push another commit [21:39:31] will it override the current one? [21:39:44] Indeed. [21:39:51] hmm ok [21:42:53] halfak: should we listen to YuviPanda about the language changes? "Unrelated change, should've been in a separate commit" [21:43:35] I think that YuviPanda is trolling us. [21:43:39] :P [21:43:47] no! [21:43:48] honest [21:45:05] it makes sense, however the change will also help me when testing the extractor on ptwiki (for two times I forgot to change the English -> Portuguese and just replaced the en -> pt in the API url) [21:45:21] The change to the demo script tests the issue I resolved. [21:45:40] The demoscript is a crappy test, I agree. [21:46:29] and danilo also had problems figuring that out today [21:48:48] merged [21:49:14] Great. Thanks. [21:49:21] danilo, ^ [21:50:32] thanks [21:50:50] * Helder wonders if there is and/or it is viable to have some nice graphs for history pages on wikis, analogous to [21:50:50] https://github.com/halfak/Revision-Scoring/network [21:51:40] YuviPanda: I <3 the idea of providing a backward-compatible RCStream+ [21:52:00] DarTar: indeed, it’s really tempting to do as well. should be fairly simple. [21:52:03] backward compatible? [21:52:04] glad that Max started that thread, I’m partly the culprit [21:52:23] halfak: yeah, in the sense it’ll have same config as RCStream except will have diffs too [21:52:24] halfak: someone suggested making this a parameter [21:52:28] yup [21:52:39] Gotcha. [21:53:55] also, this is going to make it way easier to plug RCStream into IFTTT (without the hurdle of extra API calls on their end) [21:56:37] IFTTT? [21:56:47] brb [21:58:07] DarTar: oh, I don’t think we can actually put RCStream+ *in prod*, at least anytime soon... [21:58:26] well, we theoretically could, but that’ll kill save performance on the cluster :) [21:58:56] "kill save"? [22:00:28] halfak: yes, RCStream works by having us emit a message *during* the save operation on each edit on the culture, and then that goes through a pipeline... [22:00:42] halfak: so if we try to gather diff info during that, we’ll have toa ctually generate the diff on save, and that kills performance [22:01:00] Oh! [22:01:08] Good to know. [22:01:20] Yeah... generating diffs is computationally complex :) [22:01:38] halfak: now i got the error http://pastebin.com/e5q2PZD5 again, at the revision https://pt.wikipedia.org/w/index.php?diff=40837381 (a deleted revision) [22:01:39] BUT if we could stand up a service that listens to RCStream and creates its own diff... [22:02:16] danilo, gotcha. I might have to leave this one to you. Although at least this time, the issues is only in revscores. [22:03:01] Ironholds: do we want to continue using site qualfiers like webstatscollector does? [22:03:13] language_and_site + project_suffix [22:03:14] ? [22:03:19] danilo, I think that we'll want to skip deleted revisions either way. [22:03:37] ottomata, honestly I don't know. I mean, at the moment I just use a straight regex [22:03:54] right, but i mean, you are making this thing up from scratch [22:03:54] ((commons|meta|species)\\.((m|mobile|wap|zero)\\.)?wikimedia\\.)|(? my question is why do the weird suffix thing? [22:04:07] yeah, I just stole it on account of being convenient [22:04:08] the big case statement that abbreviates things [22:04:11] ok [22:04:13] i think it is confusing [22:04:18] I don't see any reason why we can't just filter with ((commons|meta|species)\\.((m|mobile|wap|zero)\\.)?wikimedia\\.)|(? don't do what webrequest does! [22:04:19] agreed [22:04:23] agree. [22:04:33] danilo, either way, if you could record a collection of revids that fail, I'd like to use them to test the system later. :) [22:04:47] halfak: yeah, that’s what I was going to do (setup a service, listen to RCStream, create own diff, stream) [22:04:50] halfak: should be fairly trivial to do [22:05:01] +1 YuviPanda. [22:05:17] Now, do these people really want a stream or do they want something they can replay if they go offline? [22:05:35] hmn. we need to filter out .m, then [22:05:44] urgh [22:05:51] an unrelated query is not launching and I cannot work out why [22:08:13] halfak: ok [22:08:31] YuviPanda, halfak: yes, *that* [22:08:40] > BUT if we could stand up a service that listens to RCStream and creates its own diff... [22:09:04] halfak: I know use cases for each of these two scenarios [22:09:13] Indeed DarTar [22:09:42] there are people interested in a stream just to be able to filter and generate other events, where losing data is not an issue [22:09:52] vs people interested in building an incremental corpus [22:10:19] edsu-style bots are a good example of the former [22:10:40] metamarkets or datasift reconstructing the whole RC log from the stream are example of the latter [22:12:37] Ironholds: link me to pageview definition on meta? [22:14:13] DarTar, or Snuggle. [22:14:27] Or counter-vandalism tools. [22:14:30] halfak: yup [22:14:32] And page patrollers [22:14:40] ottomata, https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters [22:14:42] Don't want to miss something when the system hiccups [22:14:59] Ironholds, dataset descriptions complete. New edit dataset uploaded. [22:15:13] yaaay! [22:15:14] I was thinking of actual *external* consumers (not directly affiliated with the Wikimedia movement), but you’re totally right [22:15:15] you happy with it? [22:15:39] Ironholds, I think we are ready to publish. If we get questions, we can expand the docs. [22:16:00] One thing that might be nice to do is to use your session code on the datasets we have here as a test. [22:16:36] Can do! [22:16:42] This evening, when my brain works :D [22:16:54] okay, this is ridiculous. Why will this query not even launch? [22:18:57] Ironholds: ok [22:19:01] i think this is not correct [22:19:02] but. [22:19:05] will you try something? [22:19:46] https://gist.github.com/ottomata/8a65b3468bc981d43c37 [22:20:54] sure! [22:21:07] the result i got doesn't look right to me, but it does something! [22:21:32] awesome! [22:21:47] Okay, see, THIS query WORKS [22:21:51] why the hell is the other one not launching? [22:22:10] i don't know, but hive sure is spending a long time doing something, eventually OOming [22:22:17] AND, this is a better way to do it anyway :) [22:22:36] no, different different query [22:22:37] https://gerrit.wikimedia.org/r/#/c/180023/1/refinery-core/src/main/java/org/wikimedia/analytics/refinery/Webrequest.java [22:22:39] something legal asked me for [22:22:43] oh [22:24:32] Ironholds: so, do those results look wrong to you? [22:24:36] the counts seem way too low to me [22:24:58] oh dear. yes, much too low ;p [22:25:13] also, wow, that was FAST! [22:25:40] halfak: DarTar well, it’s also trivial to cache, say, the last hour of changes, and let you provide a ?continue parameter to continue streaming from there [22:25:42] ok so, do you happen to have a good list of test requests? [22:25:51] ottomata, matches(): is that a direct match, or is that a regex? [22:25:57] that is a regex [22:26:01] its called on the pattern object [22:26:03] YuviPanda, what if I need to bootstrap? [22:26:08] and, does Java need to escape semicolons, or just Hive? [22:26:23] oh! [22:26:25] hm [22:26:25] lines 36-41 [22:26:26] probably not! [22:26:27] OOoo [22:26:29] yeah. [22:26:33] trying that [22:26:36] you need to do that in hive, because hive genuinely can't determine when ; is in a string or not [22:26:38] stupid hive ;p [22:26:43] halfak: ah, then you’ve to use the dump anyway, and RCStream won’t help you. also if you need to bootstrap, you probably shouldn’t be streaming all the requests from the start of time [22:26:45] err, revisions [22:26:47] aye [22:27:07] YuviPanda, but what if that is necessary for what you are doing? [22:27:11] e.g. Wikitrust [22:27:31] halfak: bootstrap from dump, hit the API until ‘current’, and then keep up with RCStream+? [22:27:32] Ironholds: if you are looking at that code, tell me how to get rid of proejctSuffix and languageAndSite [22:27:37]