[12:46:49] Analytics-Kanban: Clean up references on puppet code to old AQS cluster - https://phabricator.wikimedia.org/T147461#2696317 (elukey) a:JAllemandou>elukey [12:49:49] joal: --^ [12:50:20] if you give me the green light, I will cleanup the old aqs cluster on monday from puppet [12:50:43] this means also stopping some oozie jobs since cassandra will be stopped on aqs100[123] [14:05:48] Analytics-EventLogging, Technical-Debt: EventLogging uses deprecated EditFilterMerged hook - https://phabricator.wikimedia.org/T147564#2696672 (Reedy) [14:42:26] Analytics-Kanban: Create documentation for edit history reconstruction - https://phabricator.wikimedia.org/T139763#2696786 (mforns) I factored out some of the documentation about the page and user history reconstruction into a generic history reconstruction page. And moved the specific things in separate pag... [14:53:58] elukey: Hi ! [14:54:06] elukey: Just finished my classes :) [14:54:11] o/ [14:54:31] elukey: We'll discuss that tomorrow after me getting home (like in the afternoon) if it's ok? [14:54:57] sure sure even next week, I am not going to pull the plug this week [14:55:31] sure, still interesting to discuss ! [14:56:02] I ran the puppet compiler and everything looks sound [15:27:38] (CR) Nuria: "Updating per stanup conversation, there is no need to build the code we can just merge to master and current build is this same code from " [analytics/aqs] - https://gerrit.wikimedia.org/r/314284 (https://phabricator.wikimedia.org/T144497) (owner: Nuria) [16:04:14] Analytics-Kanban: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2697036 (Nuria) [16:40:56] (CR) Nuria: "Rephrasing last comment: there is no need to build the code, we can just merge to master." [analytics/aqs] - https://gerrit.wikimedia.org/r/314284 (https://phabricator.wikimedia.org/T144497) (owner: Nuria) [16:48:04] login nuria [17:08:03] elukey: ping [17:10:39] nuria: are there any plans for the aqs nodes that are coming down? [17:42:54] urandom: plans? as in recycling them for another service? [17:54:10] (same question, will read later on :) [18:01:06] (PS6) Nuria: [WIP] Service Worker to cache locally AQS data [analytics/dashiki] - https://gerrit.wikimedia.org/r/302755 (https://phabricator.wikimedia.org/T138647) [18:05:34] nuria: yeah [18:06:13] urandom: no, we have no plans , those hosts will not be well suited for cassandra , we are giving them back to the "pile" [18:06:35] nuria: why are they not well suited, is it just the disks? [18:07:14] urandom: yes, rotating disks no ssds [18:07:43] urandom: which made not possible to use the compaction we are using now in the new hosts [18:07:44] nuria: the TL;DR is that we're trying to put together a test/staging env for RESTBase/Cassandra that closer to production (something that we can reason about in relation to production, at least) [18:08:11] Ops is going to let us use to Varnish nodes that came down in AMS, and we're looking to buy disks for them [18:08:12] urandom: that is GREAT effort but this hosts are not the ones you want for that [18:08:25] meaning, we're already planning to buy disks [18:08:28] SSDs [18:09:01] the AMS nodes aren't ideal mostly because they're in AMS, and rely on mark if something needs hands-on [18:09:22] plus we need to be more careful about PII [18:10:15] anyway, we were going to use them because there was nothing else reasonable in the "pile", but maybe now there is (or will be)... [18:10:27] Whether SSDs can be fitted in these hosts I do not know, also do not know if ops already has plans for these hosts (space wise in datacenter, we are going to need new space in rack for other things) but we can ask [18:10:39] yup [18:10:47] will do. [18:10:51] just wanted to make sure i wasn [18:10:54] grrr [18:11:09] just wanted to make sure i wasn't beating down doors for hosts you already had plans for [18:12:28] ok, on our end we need new hosts for other things for which we cannot use these, ops might have plans for them, budget wise and space wise, so do check , you can coordinate with elukey [18:13:33] kk [18:13:41] elukey: gimme all yer nodes. [18:13:46] :) [18:14:30] i feel like the weird kid in the lunch room: "So... you gonna eat that?" [18:15:17] urandom: juas, having a prod like env to test will be great, we fully support that. [18:16:30] yeah, i've been wanting it for going on two years! [18:16:49] milimetric, I can not ssh into limn1.eqiad.wmflabs :[ [18:17:20] right, of course [18:17:46] mforns: only I can, but maybe yuvi can add you to the allowed-to-be-root user list? [18:17:55] I do ssh root@limn1.eqiad.wmnet [18:18:00] oh, or madhuvishy might be able to [18:18:05] milimetric, aha [18:18:05] otherwise just tell me what you want to do [18:18:14] milimetric, well, do I need to delete files there? [18:18:31] or we just leave them there until limn1 gets decomissioned [18:18:52] we shoudl disable anything thst is answering http requests [18:18:59] *should [18:19:18] nuria, even if the web proxy is deleted? [18:19:28] mforns: I can delete the files, easy enough, you can disable the proxies and tell me what to delete. I'll wipe out the apache configs [18:19:40] k [18:19:41] mforns: ah well, that's right i ALWAYS forget [18:19:55] nuria, milimetric, so.. leave everything there? [18:19:55] that wikitech manages the proxies [18:19:56] mforns: so , nah [18:20:42] ok, nuria, I'll leave the files in there [18:21:10] mforns: I still think we should all be able to ssh [18:21:40] nuria, makes sense, I'll ask Yuvi [18:22:12] nuria / mforns: I disagree, they'll delete this instance, the reason we can't ssh is because it's not maintained [18:22:24] so in this case it's not an emergency, I wouldn't bother them [18:23:18] milimetric: as you want but in order to clean up we might need (any of us) to ssh [18:23:42] milimetric: but sure, it is not a emergency [18:23:48] nuria, milimetric, sure, but there is no private data to clean, right? [18:23:50] nuria: I'd agree if they didn't warn us several times that this would happen [18:27:00] milimetric, nuria, another question, :] will we remove the files from datasets.w.o? [18:27:19] mforns: yeah, I think we should, will make our cleanup of that easier [18:27:23] mforns: for the dashboards no longer being used? [18:27:26] mforns: i think so [18:27:29] exactly [18:27:33] which btw I want to do soon [18:27:39] ok, will ask andrew when he comes back [18:27:56] mforns: I guess any reportupdater jobs should be disabled in puppet too [18:28:09] milimetric, yes, I have that on my list, too [18:28:31] I can create the puppet gerrit changes [18:53:20] Analytics-Dashiki, Analytics-Kanban: Dashiki should load stale data if new data is not available due to network/api conditions - https://phabricator.wikimedia.org/T138647#2697491 (Nuria) [19:29:12] milimetric: Hi! need any help? :) [19:29:46] madhuvishy: sorry to bother, no, we're good [19:29:56] okay cool [19:30:04] madhuvishy: but I was wondering about labs folks' plans with labsdb this quarter [19:30:12] you involved with that or just chasemp [19:30:24] chasemp: I wanna talk about that sometime [19:30:33] (labsdb) [19:30:33] milimetric: ah yes - i was in the meetings at the offsite - i probably will be involved with it [19:30:52] ok, we should talk [19:32:34] milimetric: sure [19:33:34] madhuvishy: whenever you have time, I can catch you up and chase separately, since you have a lot more context [19:33:44] or together, however you want [19:34:39] milimetric: we can do it together, whenever chase is around - https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q2_Goals#Technology lists the labs goal [19:35:59] k, should I just set up a hangout madhuvishy or wait for chase to ping on IRC? [19:36:46] milimetric: probably wait - haven't seen him around today [19:37:12] milimetric: do you wanna chat about how this plays with some of the analytics work? [19:38:05] madhuvishy: yeah, we now have three ways to extract and publish data from mediawiki databases [19:38:11] dumps, labsdb, and our wikistats 2 pipeline. [19:38:34] I'm about, just off doing involved things. [19:38:36] We probably should've reached out earlier but I didn't know there were plans to improve labsdb [19:38:48] chasemp: k, should I set up a hangout? [19:39:28] sure, we can also chat on irc a bit if you like [19:39:56] I read your response to the task thank you, I was mulling it over a bit and also some of way forward is purely in jynus's hands I think afa priority [19:40:21] iiuc the focal point of things is one sanitization process from which multiple streams can emit [19:40:40] so sanitarium equiv [19:54:05] chasemp: that's one part of it, but another is the way data is structured in mediawiki in the first place. I wonder if there are use cases in labs that actually need data to be in that schema, or if they would all benefit from a simpler analytics-oriented OLAP type schema [19:55:08] quick example: "count edits" becomes weirdly complicated when you consider deleted pages or namespace filters [19:55:20] I don't know anyone has ever asked that question yeah [19:55:38] users don't really truly even see that schema necessarily though the majority of views mirros [19:55:50] we do have some "custom views" that more or less change scheme but mostly for privacy reasons [19:56:08] so I think it hasn't received a lot of thought but the buy-in now on existing is probably fairly deep [19:59:38] that's ok, the existing wouldn't go away in my opinion, it would just be queried a lot less as people migrate [20:02:31] that makes sense [20:04:53] good point though: just because this might be better doesn't mean it'll be easy for people to switch, that I need to think about more, and ask people [20:05:26] it's the most difficult and long tailed problem with everything in this space [20:05:43] I think [20:05:58] long tailed and angry headed, yeah :) [20:07:13] we have a few use cases that we don't know what to do w/ one of them is "user tables" or tables created at some point by someone maybe used or not to do joins on "their" data against some or many wiki db's [20:07:30] it's not a thing that scales but it's a thing we have to quantify impact for effecting [20:07:33] one example [20:08:21] at the moment they are ephemeral (not backed up or replicated and very best effort) but that may not have prevented serious data from living there [20:08:45] so one thing the new approach lets us do very easily is turn enwiki.page, dewiki.page, ..., zhwiktionary.page into page with a wiki_db column [20:09:06] how does that turn out performance wise? [20:09:08] with that, people should have less reason to extract things like that themselves [20:09:10] we could do the same w/ a view now right? [20:09:31] well, in hadoop it's amazing [20:09:36] understanding what ppl are using in this capacity is on the agenda tbh [20:09:40] we know it's there :) [20:09:47] in postgresql with vertical partitioning, I'd imagine it's still amazing [20:09:47] that is msot of the extent of it [20:10:06] labsdb is mariadb 10.x atm [20:10:16] we were looking at 10.1 for a few features like user roles [20:10:26] iirc [20:11:13] looks like maria supports partitioning too [20:11:40] I think that'll probably perform a little better than views, but we'd have to ask jaime [20:12:54] it would be good to run a user survey, about what current folks do with labsdb [20:13:12] in my understanding of the world, tools like quarry are the biggest consumers. And for that, the denormalized schema is a strict upgrade [20:13:34] the problem is we get a somewhat low response rate % wise I think [20:13:46] people use it for data mining to be sure [20:13:49] but also as an eventfeed [20:14:02] we could also analyze the actual queries running on the servers [20:14:05] and also as a general sanity check for various oddities and to keep projects in sync [20:14:28] yeah, eventfeeds should really be slowly migrated to public event streams (coming up too) [20:14:33] right that was kind of hte thought on user tables [20:14:46] agreed and I think that's the most directly sunsettable case [20:14:54] sunsettable(tm) [20:14:59] nice :) [20:15:08] but it's not something we have the bw to tackle from a labs team pov [20:15:31] yeah, once the new tech is there it'll be a matter of outreach and monitoring to see if that use case dies down, then sending out some sunset emails [20:15:49] so public event streams [20:15:57] does this kill rcstream and irc and labdsdb hacks? [20:16:02] I mean in an ideal world [20:16:21] ideal or not ideal, it should kill all that or we haven't done our jobs properly [20:16:42] we'd have a socket.io interface, so it's a strict upgrade to rcstream [20:16:47] when does that become a thing ppl can consume? [20:16:52] (I don't know much) [20:16:53] because on top of what rcstream does we can implement resume and filters and whatever [20:17:03] our plan is by the end of the year [20:17:08] nice [20:17:25] would you be willing to give a brief "do this ...not that" session at the dev summit? [20:17:26] guys, the denormalized schema thing is nice too, comeon :) [20:17:31] evangelize and all that [20:17:37] it's the biggest missing piece I think for a lot of this [20:18:09] milimetric: yes yes :) [20:18:10] yeah, sure, I can, though I don't want to steal Andrew's spotlight, he'll be leading that effort [20:18:38] nobody cares madhuvishy, I'll die alone under a rock cuddling a hypercube [20:18:39] I'll bother otto about it too then [20:18:46] he's back from his trip next week or? [20:19:05] yes, next week [20:20:02] milimetric: coincidentally I've been banging this drum recently :) [20:20:06] so how's this for a plan: [20:20:08] the getting ppl to do the right thing express [20:20:10] https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Building_on_Wikimedia_services:_APIs_and_Developer_Resources [20:20:16] when public event streams are out, we evangelize and get people to stop using labsdb for that [20:20:16] or helping them know what that is [20:20:36] very much in favor for this slice of the use case [20:20:42] when the denormalized schema is out we evangelize, hook up quarry to it, and get people to switch [20:20:48] my impression is much of the rest that is ill advised is probably just that as well [20:20:52] then we monitor what's left on labsdb and sunsetizize [20:21:15] well research has a plan to give quarry/paws it own dbs as well [20:21:19] concurrently? [20:21:31] the denormalized stuff is all sanitized from hadoop? [20:21:46] no, but it should be if we clean up the sanitization logic [20:22:04] ok so end of whatever quarter this is idea from our pov [20:22:24] is some new labsdb machiens we have that are big with an haproxy setup in front and users move to using "service urls" for accessing the replicas [20:22:37] one for slow queries, one for short ones, and one for quarry/paws (roughtly) [20:22:47] I said that to say this, we'll be more flexible [20:23:35] and we could have a 4th service url to a setup rooted in your thinking here for some version of a migration [20:23:39] but I don't know where the hardware would come from for it or if it exists necessarily [20:23:48] not sure if I'm making sense [20:24:06] yeah, well, here's the good news [20:24:19] if you get a metric like "rolling active editor" from the current labsdb setup vs. the denormalized schema, it's literally 1000 times slower [20:24:41] I'm no math wiz but that seems significant [20:24:44] so the machine that runs this new schema would just need a decent amount of space and good partitioning. And performance shouldn't be a problem. [20:25:03] so we can skimp on performance for it, is what I mean [20:25:05] can you ballpark decent amount of space [20:25:50] well, that depends on what we want to load. If it's all history for everything, then right now it's page + user + revision from each wiki. And probably more as people ask for more stuff to be accessible [20:26:19] but we have a lot of control over what we load, so we can start with something small and useful and grow too [20:26:28] (having different tables for different types of analysis needs) [20:27:23] I'm running some -du in hadoop to find out how big [20:28:45] probably like ... 200G I know of right now [20:29:14] 200G for all of history? [20:29:43] (that's just page, revision, user, logging, and archive from all wikis, but kind of roughly estimated for enwiki) [20:29:58] let's say 300G to be safe because I'm not sure enwiki copied properly [20:31:24] fwiw the normalization of history as otto described briefly at the ops offsite is pretty awesome [20:32:40] current replicas are about 2.7T I think [20:33:05] 2+ in any case [20:33:16] that gets bigger though as we move to innodb [20:34:33] yeah, I'm not sure what engine would be best for OLAP-purposes. [20:36:20] right, so our data would never be bigger than 2.7T because we're re-arranging stuff, not creating anything new. And tables like logging are going away [20:37:01] atm this is tokudb which has one thign is does ok -- compression, not sure how it grows from here [20:37:06] it's a surmountable problem though I'm sure [20:37:44] milimetric: I didn't get a cal invite? do you want to kind of meet minds and include otto late next week or so? [20:37:47] what about sanitization, are there plans to clean that up? [20:37:54] oh yeah, I'll do that [20:38:10] yes but not radically or seriously at the sanitarium level [20:38:22] most of this is a constraint of time / people [20:40:02] it's tail of woe that is long but if I were to boil it down for myself. We have dropped tables ouright at sanitarium, then we have removed structured data or so via triggers, then we have to replicate that to labsdb to create views to further obfuscate and limit [20:40:17] a lot of this afaict is because mediawiki does not easily denote things that are sensitive [20:40:33] so we have layers on layers to compensate for the unstructured nature of what can't be revealed [20:40:56] and that's where the really hard work would have to happen and that's a time sync of epic proportions possibly [20:41:12] that's a very cavalier and broad assessment [20:42:13] even as recently as a few months ago we had an "oops this table has things no one realized because x" incident which resurfaced 2 days ago due to other issues [20:42:42] where x is basically no disernment of this type of reasoning from the start [20:42:57] maybe I'm overstating it, I'm not sure, but this is my impression [20:43:12] k, sent [20:43:27] and the sanitarium stuff is mostly just scripts run to generate sql to do the masking that is not ideal and difficult to maintain and audit [20:43:51] so I can imagine furthering that w/ more things stacked on top probably seems like bad news from a dba perspective [20:44:30] the most significant thinking on this I'm aware of in the last year or so is https://github.com/wikimedia/operations-software-labsdb-auditor [20:44:33] chasemp: you don't scare me :) I'll tackle this problem myself if need be, can't be harder than what we just went through (hell) [20:45:29] but really, I'm happy to do this. We have a mandate on the team to think public-first and if this is an obstacle, then fine [20:45:43] (reading that github thing) [20:46:08] that was a "let's whitelist what exactly people should see at the labsdb layer" [20:46:21] which in theory could happen one layer up w/ a more sane sanitarium equivalent [20:46:31] which had output that was entirely ...sanitary [20:49:00] milimetric: not intending to scare :) I thikn much of this the questions will generate more questions for a bit but it seems pretty important and you may be the best situated mentally person to reason on it [20:49:03] seems ideal :) [20:49:41] yeah, I want this problem solved for too many reasons to not solve it. dumps should be refactored on this process too, because that does its own "sanitariumization" [20:49:56] right, it seems nuts doesn't it? [20:50:18] w/ all of these cheese cloth approaches one of them is bound to fail in a novel way [20:50:26] everything does 'till you find a nutcracker. Then it's all delicious [20:50:51] I hear some nuts pass right through you and can't be digested [20:50:57] ew [20:51:04] not touching that [20:51:41] if that was a pun, well done, if not, it worked out [20:52:50] I think if you were to determine that tackling a one-stop-shop sanitization process was feasible it would clear away a lot of other issues, much of the labsdb layer as it were is tied up in covering those bases [20:53:15] another consideration is, if you make a public service anyone in labs could reach it [20:53:38] what do you mean by that? [20:54:34] labsdb is a thing we offer only to labs mostly due to logistics [20:54:51] and we even have guidelines that say, don't make some service that let's anon folks hammer labsdb [20:54:55] quarry not withstanding :) [20:55:05] chasemp: what I'd hate to see happen is yall put a ton of work into getting labsdb cleaned up and then we replace part of it with PES and the rest of it with the denorm table and cross-wiki tables next year [20:55:05] so it's shielded to only labs vm's [20:55:27] right, makes sense [20:55:48] we have a sense of immediacy w/ our current problem set [20:56:02] oh yeah, sure, but if people put up mondrian / saiku on top of the denorm table, it caches really nicely and is built to handle many users a lot better than quarry [20:56:03] in that labsdb is down to 2 machines that have no storage redundancy and no failover mechanism for maint [20:56:39] yep, that's totally priority 1 and I'd never ask you to wait on us for that [20:57:27] I think all of the things we have on the table are probably have-to's in the very near term, so it would be awesome to navigate beyond that, even ideal at this point [20:58:00] but something like making sanitarium sane we basically said, we have no ability to do this so let's fix over there [20:58:03] ok, sweet. Then we'll keep chugging and work together on a use-case by use-case migration to better tools? [20:58:47] yeah, and I got sanitarium / readactron / whatever else it's called, I'll refactor all that into something nice and simple [20:58:58] * milimetric has said the famous last words, countdown to doom commencing [21:00:05] I thik yuvi's rational on the auditor was probably in the right direction [21:00:13] we spend so much time blacklisting and poorly [21:00:20] let's toss it all out and whitelist and see where we stand [21:00:31] (paraphrasing the idea) [21:00:48] if we can't whitelist our own data, we got real problems [21:01:15] but something like the user table as it were is the canoncial example for complexity maybe [21:01:22] ppl want to query across all users and user count etc [21:01:35] but it's not normalized and kept in the same table as passwords themselves [21:02:00] so we fire a trigger and blank that column and keep the table [21:03:04] CommTech is doing some work in CentralAuth for a cross-wiki watchlists that may help with the user linking problem. A new column is being added to the local wiki table to store the per-wiki user id [21:03:40] so we will end with a table of (central id, wikidb, username, wiki id) [21:03:59] or something like that anyway [21:04:02] I'm not sure what else is in that table but it may cover that case yeah [21:04:12] a separate auth service ideally does too? [21:04:27] yeah. on security's roadmap [21:04:32] but not holding my breath [21:04:36] right [21:04:59] there is some code for it on github from the services folks [21:06:23] bd808: any interest in chatting about possibles next week? [21:08:40] Sure. I'll be out Monday and Tuesday but could talk after that [21:09:42] having new OLAP data sources will be neat for one-off queries and new code, but there is honestly no way that you are going to get all tools, bots, and cron jobs to rewrite their data layer to migrate [21:10:02] that cross-wiki watchlist would totally benefit from our data, too bad this sanitizing monster is holding us back :( [21:10:25] especially for anything that was purposefully written to work with any MediaWiki install [21:11:00] if the data isn't in MediaWiki's tables then we really can't build a special page around it [21:11:09] bd808: not right away, definitely [21:12:04] milimetric: there is a lot of code running in Tool Labs that hasn't been touched since it was moved from toolserver and only touched then to stop on one box and start on another [21:12:52] and 1500+ tool accounts today running an uncounted number of scripts, bots and tools [21:12:56] so long term data from Public Event Stream will feed into this store as well, and people running other mediawiki instances should be able to upgrade their servers with an analytics package [21:13:59] like I said, good stuff and useful, but the long tail is years long [21:14:09] yeah, I mean, I'm not looking forward to helping people rewrite 1500+ tools. But I'm thinking a lot of those will be come unnecessary with better infrastructure [21:14:36] most of them are probably unnecessary today ;) [21:14:48] :) maybe, but definitely a big unknown right now [21:14:48] that's not reason to kill a project [21:15:37] the flip is the longer we don't have good alternatives the more things are written on bad alternatives [21:16:20] yeah, but I take the point, if we get 5 new tools on this infrastructure, it probably wouldn't be worth it [21:16:50] it's worth it internally for us anyway, and for refactoring dumps [21:17:29] maybe that's an approach: leave it internal for now, demo at hackathons and see who likes it and wants to switch [21:17:56] the best thing we can do is evangelize it, show why it's neat, and use it for nontrivial things ourselves [21:18:10] *nod* [21:18:11] yeah, wikistats 2.0 will be anything but trivial [21:18:12] yeah i'm with evangelizing it [21:18:48] cool, good talk everyone, I have a much clearer path forward, thank you [21:19:35] milimetric: btw, awesome work with the data migration <3 saw some of the new edit dashboards on analytics.wikimedia :) [21:20:33] this thing? https://analytics.wikimedia.org/dashboards/standard-metrics/ [21:20:40] but, madhu, that's a secret! [21:20:41] :) [21:21:02] ;) ha ha I saw on some email you sent! [21:21:14] but also, all the things that needed to happen to get here :) [21:24:04] thanks madhuvishy, I know you know how much work that was, and I'll subject the rest of the foundation to tech talks so they know too :) [21:24:26] milimetric: looking forward to it :) [22:07:54] Analytics-EventLogging, Technical-Debt: EventLogging uses deprecated EditFilterMerged hook - https://phabricator.wikimedia.org/T147564#2697594 (Legoktm) a:Legoktm https://gerrit.wikimedia.org/r/307039 [22:08:18] Analytics-Kanban: Kill dashboards on limn1 that are no longer used - https://phabricator.wikimedia.org/T147000#2697641 (mforns) @Nuria @Milimetric Hey, I have a couple questions: 1) analytics.wmflabs.org just contains the pageview-api demo that I did. Do you think this has a reason to exist after musikanima... [22:08:40] Analytics-Kanban: Improve mediawiki data redaction - https://phabricator.wikimedia.org/T146444#2697661 (Milimetric) [22:08:42] Analytics-Kanban: Improve mediawiki data redaction - https://phabricator.wikimedia.org/T146444#2661250 (Milimetric) a:Milimetric [22:08:51] Analytics-Kanban: Improve mediawiki data redaction - https://phabricator.wikimedia.org/T146444#2697668 (Milimetric) FYI: I scoped this to "just" a refactor of Sanitarium and friends. I'll try to tackle it this quarter in between other work, but we're not committing to it. I'd appreciate any help in the for... [22:09:01] ooh, wiki bugs catching up :) [22:09:08] slacker [22:17:08] Analytics-Kanban: Improve mediawiki data redaction - https://phabricator.wikimedia.org/T146444#2661250 (AlexMonk-WMF) >>! In T146444#2697668, @Milimetric wrote: > * what the custom views in labsdb hide in addition to Sanitarium Theoretically https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/main... [22:19:02] thanks Krenair [22:19:45] milimetric, the script that should be controlling this is called maintain-replicas [22:20:15] there's a little bit of a story behind why it's not actually currently in use: https://phabricator.wikimedia.org/T138450 [22:20:37] don't know how relevant this is to what you're looking at though [22:20:55] well, I've heard a few different stories about how this is actually supposed to be working :) [22:21:14] so I'm going to try and bring all of that logic in one place [22:21:35] what are you planning to do wrt. labsdb? [22:22:02] we have to export data from the mediawiki dbs into hadoop, Krenair [22:22:15] and ideally it would be as sanitized and public as labsdb [22:22:18] ah [22:22:34] and there's lots of potential for refactoring / restructuring pipelines of data there [22:22:38] so you just want to make it generic enough to send data from core production to analytics with the same level of security as if it were going to labs [22:22:46] but for now it seems the place to start would be a generic sanitizer [22:22:48] that's fine [22:22:59] yep [22:24:14] Right now I'm trying to get labsdbs back under the control of the script [22:26:09] two wikis created this year don't have any labs views because it's not currently being run [22:30:31] Analytics-Kanban: Kill dashboards on limn1 that are no longer used - https://phabricator.wikimedia.org/T147000#2697908 (Nuria) >analytics.wmflabs.org just contains the pageview-api demo that I did. Do you think this has a reason to exist after musikanimal's pageview tool? I guess we can kill that, right? Yes... [22:49:43] (PS1) Milimetric: Improve build a bit more [analytics/dashiki] - https://gerrit.wikimedia.org/r/314622