[03:45:40] evening fhocutt [03:45:49] evening, Ironholds. [04:03:27] evening lzia [04:03:36] evening Ironholds. :-) [04:05:37] Evening all and GOODNIGHT! :) [04:05:53] good night halfak. [17:06:11] nobody is in the batcave? [17:08:19] i'm trying to python [17:08:31] gotcha [17:08:36] it's hard. [17:10:09] Right now I am looking for Python's equivalent to PHP's isset(); I'm probably a jerk for wanting one :( [17:14:17] There's a thing I am using that works but I don't know that it works for while loops? [17:14:58] ahh [17:15:08] luckily I know a total python-head or twelve! [17:15:13] DarTar, ^ [17:15:40] I'm trying to get around API limits by iterating over each continue [17:15:55] and I test whether I need to continue by checking for a certain bit of JSON [17:16:08] but Python gets upset at me for assuming this bit of JSON exists when it doesn't [17:18:25] http://pastebin.com/LVFScFdR [17:19:39] what are you trying to get out of the API? [17:20:44] a list of all subpages of a given page. the list itself isn't interesting so much as what i am going to do with the list [17:53:00] morning fhocutt, wb harej! [17:53:06] aye [17:53:18] I think I fixed my problem. I just needed to do more of what I was already doing. And now I have... other problems. [17:53:35] morning, Ironholds [20:53:47] Ironholds: is there a way in the page table to identify redlinks? [20:54:03] or, is that the right table to look into? [20:54:43] * Ironholds thinks [20:55:17] I mean, that'd just give you pages that used to exist [20:55:43] page table, Ironholds? I thought that's all the pages, deleted/archived or not [20:55:56] yes, but it doesn't have an entry for pages that never existed [20:56:02] because it doesn't know those are pages [20:56:05] oww, I see what you say [20:56:13] I have an idea. Let me run a test. [20:56:55] yup [20:56:59] leila, you want the pagelinks table [20:57:12] lemme look [20:57:22] you want to join it on to the page table in a way that leaves you with just "those entries where pl_title is not found in page_title" [20:57:38] that will unfortunately exclude pages that used to exist, though. [20:57:40] ah I see. that makes sense. thanks! [20:57:44] cool [20:57:50] well, those I can get out of archive? [20:58:54] actually, it's fine. I don't need to worry about pages that used to exist. I need the current snapshot of redlinks. but to your point, I think looking at archive table can solve that. [20:58:57] thanks Ironholds [20:59:51] np [21:19:19] wheeeee [21:19:24] Coren got web proxies working [22:43:27] grrrr [22:43:37] * Ironholds stares daggers at Shiny [22:44:21] DarTar, http://datavis.wmflabs.org/where/ [22:44:31] I got a Shiny instance spun up on Labs where we can host multiple visualisations/sets/etc [22:44:43] and yes, I am KEEPING datavis.wmflabs.org ;p [23:12:23] Ironholds: that’s cool, as long as it’s not dataviZ ;) [23:13:27] DarTar, nobody speaks your language [23:13:32] DEATH TO AMERICOCENTRISM [23:14:19] seriously, pretty cool to have a shiny server up [23:14:30] I gotta read some docs [23:14:32] DarTar, yep! Now go reply to my email about user agents so I have multiple things to host there ;p [23:14:43] I'm happy to do a walkthrough of spinning up shiny visualisations, too. [23:17:57] I missed the UA idea, but that’s pretty cool though. I don’t think there are privacy implications with extracting and tallying OS / browser information but looping in Michelle sounds like a good plan. I imagine you’d be bucketing anything that doesn’t match the top strings into other? [23:18:05] I like, not missed [23:18:25] sorry, my brain is running slow today [23:18:37] DarTar, actually it just wouldn't appear [23:18:50] and, yeah, I don't think there are problems with the extraction, but I want to release the raw UAs as well [23:19:08] using the (>50 editors) || (> 1000 requests) threshold to avoid including PII [23:19:24] hmmm that may misrepresent the stats then, what if 80% for some combination of dimensions are unrecognized [23:19:51] oh. wait. sorry; rephrase. [23:20:01] DarTar: non-recognised agents are bundled as "other" all the way down [23:20:09] I misunderstood what you were saying [23:20:14] ok cool, that’s what I meant [23:20:29] so, if "flippertigibbet" makes 200 requests from 60 editors, it'll come out as "Other, Other, Other, 200" [23:20:36] and device class identification is still an issue, right? [23:20:43] and it'll be aggregated with any other "Other, Other, Other" [23:20:52] yep; that's why the call on mobile/desktop is being done based on the site they used [23:21:14] ok, then you need to fix the column header :) [23:21:25] class sounds like device class [23:21:43] i.e. a propriety of the client, not the destination [23:22:08] (looking at the screenshot) [23:23:29] DarTar, will do! [23:23:35] okay, I should loop in michelle [23:23:42] I've spent two weeks doing work for her so she owes me a favour. [23:23:55] the raw UAs are what are potentially a risk [23:23:57] I think they're fine but [23:24:23] I’d stay a way from raw as much as I can [23:24:26] http://datavis.wmflabs.org/ is gonna be the front page, btw [23:24:29] away, even [23:24:31] can I ask why? [23:24:44] because, they are raw, and raw stuff brings diseases [23:24:56] yeah, which is why we set minimum thresholds for even being evaluated [23:25:12] I've actually just got back the reader-side ones, so I can do a test to see what N is in "exclude if fewer than N pageviews" [23:26:00] what extra value do I get from the raw dataset that I don’t get in your aggregate tables? [23:26:03] 500,000 [23:26:13] DarTar, you? None. The people who maintain the ua-parser we depend on to be able to do this? [23:26:26] me as in an external consumer, yes [23:26:28] the people we are no longer providing direct support to, because the project architect we shipped them isn't engineering enough to do engineering? [23:26:37] the people, I would reiterate, we depend on? [23:26:48] they get a lot of value, and then they update their parser, and then we get value because our mappings suck less. [23:27:04] the parsed logs means engineers make better decisions [23:27:13] ok, that’s a clear use case then, but before we release anything raw, even truncated above a threshold, we should have a serious privacy audit [23:27:17] the raw logs means upstream engineers build tools we can use to generate better parsed logs, making for better-better decisions [23:27:27] agreed. [23:27:32] you don’t have to sell to me why supporting ua-parser is important [23:27:34] DarTar, it's actually only 352 UAs (the reader-side stuff) [23:27:40] so we can hand-code it looking for stuff if needed [23:27:49] that's every UA with more than 500,000 pageviews through it in the last month. [23:27:59] sorry, 335 [23:28:13] wow [23:28:16] I can throw the raw files onto stat2 if we want to do an audit [23:28:22] (some of them are, frankly, going to be bots anyway) [23:28:48] yes, that would be a good start [23:28:52] kewl