[10:48:33] 10Analytics-Tech-community-metrics, 06Developer-Relations (Oct-Dec-2016): Deployment of Mediawiki panels - https://phabricator.wikimedia.org/T138006#2813712 (10Qgil) [12:56:20] hi a-team :] [13:07:09] mmmmmm, alarms are crazy... [13:10:00] oh ok [13:39:35] (03CR) 10Joal: [C: 031] "Looks good to me." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/305989 (https://phabricator.wikimedia.org/T142955) (owner: 10Addshore) [13:46:17] hey yall [13:58:37] Hi guys [14:03:33] mforns, milimetric: from yesterday backlog - Should we do some scala today ? [14:07:06] joal: sure, I'm down [14:07:13] I'm learning Mediawiki extension development [14:07:20] wow [14:07:21] (slooooooooooooooooooooooo [14:07:22] oooooooooooooooo [14:07:24] oooowly) [14:07:28] :D [14:07:31] :) [14:09:45] but joal, yeah, anytime, whenever mforns is around [14:10:55] sure milimetric, thanks :) [14:11:11] joal: oh but I can catch you up at least [14:11:23] so we found two main things we'd like to fix for the next run [14:11:24] We can indeed do that :) [14:11:31] k [14:11:31] it's easy enough to type [14:11:54] 1. archive records with ar_page_id is null will not be imported as revision/create records [14:12:39] It's only before 2007 and they do have ar_user, so we can import them fairly well [14:12:52] they'd just not be linked to a page, which I guess is ok [14:13:33] good so far joal? [14:14:01] so far ok [14:14:06] milimetric: I have comments [14:14:12] on that topic particularly [14:14:27] 2. some altergroups records are missing from the logging table because the user_groups_latest doesn't match the user groups of some users as they are at the time of the import [14:14:51] k [14:15:00] so we probably should use the latest anyway [14:15:03] all that means is that we should back-propagate the user_groups as they are today, at least for user_groups_latest and maybe for user_groups until we find different in the logs [14:15:20] milimetric: that makes sense [14:15:20] yep [14:15:27] milimetric: 2. is rather a big thing [14:15:32] why? [14:15:35] milimetric: 1. is fairly easy I think [14:16:02] 2. involves joining users with groups table, there reading-extracting refactor [14:16:15] completely feasible, but not a minor change [14:16:20] While 1. is minor [14:16:22] I think [14:17:18] also milimetric, about data imported in druid and clickhouse: rows with no or badly formatted event_timestamp are not imported on those systems [14:18:58] that makes sense, joal, maybe we can default them to like 20000101000000? [14:19:09] so that they're outside of history but... nah [14:19:25] one thought I had is a separate data source for the more trashy data [14:19:31] milimetric: While we could, I'd rather remove them than having misleading data points [14:19:36] yeah [14:19:42] milimetric: That is doiable [14:19:45] we'll see how their loss affects metrics [14:19:54] so far I didn't see that being a factor in the differences [14:20:05] makes sense what you say about 2., it's more involved than 1. [14:20:34] yep, so that's all we really found different, the rest of the data seems really clean, which is amazing [14:20:50] milimetric: I AM SO HAAAAAAAAPPY about that :) [14:20:59] we need those two things and that page_is_redirect_latest and I think we're gold [14:21:12] milimetric: That rocks [14:21:31] ottomata: Hi sir, just had a thought related to you [14:22:13] ottomata: I don't know if you recall, but I told you some time ago that I'd rather wait for spark 2.0 being included in CDH before upgrading [14:22:27] ottomata: Well, we have a good argument in favor of upgrading sooner [14:22:40] ottomata: Let's discuss whenever you're around [14:23:21] milimetric: I assume mforns has started to code on changes 1. and 2. and 3. (being page_redirect_latest) ? [14:23:47] joal: oh ya? [14:26:38] joal: not sure, I think we were waiting for you [14:28:23] ottomata: details: we use GraphX in history reconstruction and graphX is dead [14:28:53] ottomata: Now we should use GraphFrames (another graph lib for spark, better integrated with dataframes than GraphX) [14:29:15] ottomata: In addition to being better integrated with DF, graphFrames also soves a bug we ran into using GraphX ! [14:29:44] ottomata: And the lowest version of GraphFrames we could use that solves this bug goes with Spark 1.6 minimum [14:29:51] ottomata: You know it all :) [14:30:07] latest cdh spark is 1.6? [14:30:15] milimetric: ok, let's see when mforns comes back [14:30:24] ottomata: I think so yes, let me check [14:30:51] joal: so you'd prefer if we upgraded before we fully productionized history job? [14:31:09] ottomata: That'd be awesome [14:31:48] ottomata: We could also wait and not merge the GraphFrame update before next upgrade, but if it's not too expensive, an upgrade would be great [14:34:10] ottomata: CDH 5.9.0 has Saprk 1.6.0 and is fully compatible with Spark 2.0 (they have it ready in the download section, even if not yet integrated) [14:34:45] ok great, cool. lets talk about that at goals today [14:34:50] awesome [14:34:57] (afk for a bit... gotta help move 2 giant woodstoves out of a basement....) [14:35:10] good luck ! [14:43:07] ooh, watch out not to mess up your back! [15:15:53] lift with your legs! [15:15:54] i did! [15:15:55] we did it! [15:27:19] hey milimetric and joal, I'm back, sorry had unexpected visits [15:27:26] I read your scrollback [15:27:39] Hi mforns, in interview now, later for me [15:27:45] ok [15:57:25] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2814635 (10Tbayer) >>! In T150545#2810760, @Tbayer wrote: > The task description currently reads (in full): > //"Hash IPs on webrequest table if the research for understanding how this data is being curr... [16:00:32] a-team: finishing interview, will join standup in minutes [16:37:45] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2814736 (10Jsalsman) @tbayer, I proposed that the IP address and HTTPS proxy information both be included in the hash. Please correct me if I am wrong, but that would not be reversible. Also, everyone, I... [16:52:12] 06Analytics-Kanban: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2814788 (10HJiang-WMF) @elukey I tried to access pivot with my Wikitech credential, and now it finally worked. Thanks for pointing me to the right direction! Much appreciated. [17:16:00] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2814898 (10Nuria) @bd808 : if we are to do this (which is not clear, depends on ops use cases, it might not happen) we will not use a method that exposes us to a dictionary attack. A HMAC with a rotating... [17:25:07] 10Analytics: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2814901 (10Nuria) Ping @bblack, now that varnish4 migration is over do you think we could get to deploy a global cookie soon? [17:45:27] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2789216 (10awight) We should also be discussing how often we can rotate the salt used for hashing, and whether we can destroy the salt permanently at the end of each period. The EFF seems to rotate on a... [17:50:31] (03PS2) 10Mforns: Add template for migration to reportupdater+dashiki [analytics/limn-ee-data] - 10https://gerrit.wikimedia.org/r/320433 (https://phabricator.wikimedia.org/T126358) [17:51:23] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2815042 (10bd808) >>! In T150545#2814898, @Nuria wrote: > @bd808 : if we are to do this (which is not clear, depends on ops use cases, it might not happen) we will not use a method that exposes us to a di... [17:55:04] 10Analytics, 06Reading-analysis, 06Research-and-Data, 10Research-consulting: Propose metrics along with qualifiers for the press kit - https://phabricator.wikimedia.org/T144639#2815069 (10leila) Update: We met with Neil from Editing and discussed the Editing related metrics. Next steps: There are at least... [17:56:39] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2815070 (10awight) >>! In T150545#2815042, @bd808 wrote: > If it is not reversible and it is not stable over time then why would we keep the hashed data at all? What specific use case is supported by reta... [17:59:14] Hi a-team, do we meet in batcave or dedicated room? [17:59:22] batcave [17:59:24] joal, I'm in batcave [17:59:32] nobody external's coming [18:00:13] ACK [18:00:33] a-team - joining da cave [18:01:42] nuria: we're in batcave if you're looking for us [18:01:53] ping nuria ^ :) [18:02:40] AGH [18:03:27] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2815121 (10Nuria) >If it is not reversible and it is not stable over time then why would we keep the hashed data at all? Because we enable computation of temporary signatures we use for counting. This i... [18:05:54] wow, why internet die right when meeting [18:08:35] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2815138 (10bd808) >>! In T150545#2815070, @awight wrote: >>>! In T150545#2815042, @bd808 wrote: >> If it is not reversible and it is not stable over time then why would we keep the hashed data at all? Wha... [18:12:18] ottomata: it happened to me yesterday. I had internet for the whole day with no interruption and right at the time of the one meeting I had, it died. :/ [18:13:05] :) [18:13:06] yeah [18:13:36] nuria: the point Brandon made about hashing of IPs basically means that this would not be an effective defense against subpoenas or security letters [18:13:59] gwicke1: right, on meeting i can talk later [18:33:27] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2789216 (10Tgr) The whole IPv4 space is roughly equal to that of a six-character alphanumeric password (a few billion possibilities); a password cracker running on a decent GPU can brute-force that in abo... [18:39:24] 10Analytics, 10Pageviews-API: No results for Special:BlankPage or Special:BlankPage/RTRC - https://phabricator.wikimedia.org/T151363#2815250 (10Mattflaschen-WMF) [18:49:50] 10Analytics-EventLogging, 06Editing-Analysis: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#2815319 (10Jdforrester-WMF) p:05Triage>03Low [19:00:52] oh sorry joal! [19:00:53] what? [19:01:06] ottomata: just said it was not fun time for you :) [19:01:41] haha [19:01:45] it was fine! :) [19:01:47] but i need lunch! [19:05:11] gwicke1: back, yes i understand that. And i made that point earlier on an e-mail thread [19:06:12] gwicke1: It reduces the possibility of anyone making an (honest) mistake and cuting and pasting raw ips in a ticket [19:07:06] gwicke1: the only way to reduce subpoena concerns is reduce data retention [19:07:33] gwicke1: and that is why we do not have IPs in any eventlogging data [19:08:58] halfak: Hi ! quick check for tomorrow's meeting --> Do we have new stuff to discuss or can we cancel that round? [19:09:34] Hey joal, theres some stuff to discuss. Will PM in a moment. [19:10:01] k halfak, noted [19:10:05] thanks for quick answer [19:12:10] halfak: joal's leaving soon if you wanted to pre-chat, but we can meet tomorrow, no problem [19:16:14] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2815492 (10Nuria) [19:20:23] success!! "Class undefined: Dashiki\DashikiContent" [19:28:47] (03PS2) 10MaxSem: Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) [19:29:14] (03CR) 10jenkins-bot: [V: 04-1] Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) (owner: 10MaxSem) [19:29:51] (03PS4) 10MaxSem: reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) [19:31:39] (03PS3) 10MaxSem: Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) [19:31:59] (03CR) 10jenkins-bot: [V: 04-1] Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) (owner: 10MaxSem) [19:32:06] <3 jerkins [19:33:57] (03PS4) 10MaxSem: Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) [19:37:08] (03CR) 10jenkins-bot: [V: 04-1] Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) (owner: 10MaxSem) [19:47:25] (03PS5) 10MaxSem: Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) [19:48:54] (03CR) 10jenkins-bot: [V: 04-1] Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) (owner: 10MaxSem) [19:49:45] 10Analytics-Tech-community-metrics: Missing time units for percentile values - https://phabricator.wikimedia.org/T145425#2815650 (10Lcanasdiaz) It seems clear we need to polish up these panels. In the meantime, @Dicortazar could you please clarify the data shown in the widget? [19:53:13] (03PS6) 10MaxSem: Optionally record data to Graphite [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/322365 (https://phabricator.wikimedia.org/T150187) [20:00:24] 10Analytics-Tech-community-metrics, 07Regression: Git repo blacklist config not applied on wikimedia.biterg.io? - https://phabricator.wikimedia.org/T146135#2815705 (10Lcanasdiaz) a:05Dicortazar>03Lcanasdiaz I confirm that blacklist is not working. Working on it .. [20:01:00] MaxSem: Super thanks for your changes to reportupdater, I probably cannot get to review those this week but will try to do it as soon as possible [20:01:10] :) [20:27:24] 10Analytics-Tech-community-metrics: Check whether mailing list activity per person on korma is in sync with current "mlstats_mailing_lists.conf" - https://phabricator.wikimedia.org/T132907#2815834 (10Lcanasdiaz) We're investigating it. [20:30:15] 10Analytics-Tech-community-metrics: Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#2815877 (10Lcanasdiaz) [20:30:17] 10Analytics-Tech-community-metrics: Maniphest support to be added to GrimorieLab - https://phabricator.wikimedia.org/T138003#2815876 (10Lcanasdiaz) 05Open>03Resolved [20:30:19] 10Analytics-Tech-community-metrics, 10Phabricator: Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#2815878 (10Lcanasdiaz) [20:31:45] nuria: My advice would be to clearly document which threats you are interested in; the current discussion started by asking for the feasibility of a solution, and people are trying to figure out the threat model / problem that motivated the proposed solution. [20:32:43] gwicke1: but see that is already side tracked from where the original discussion started which was "use cases for 60 day retention of raw ip data" [20:32:50] /cc leila [20:33:11] 10Analytics-Tech-community-metrics: Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#2815927 (10Lcanasdiaz) This new panel should be published within the next 7 days. I'm asking the devops team to get confirmation. [20:33:50] gwicke1: We are not proposing a solution to better improve privacy but rather 1st trying to understand (beyond the use cases we know) what is that data used for [20:33:51] the discussion seems to be on track under the assumption that the goal is a general "protect user's IP addresses against all common threats" [20:34:08] gwicke1 thanks for the note. We are still trying to keep this a contained discussion. we are just interested in learning how the raw IP is being used at the moment. [20:34:51] whatever changes will be made in the future, if any, will need to make sense from the technical pov, but if, as we see now, Ops says we absolutely need this data for reasons x and y, we need to hear this and take into account for any future solution. [20:35:24] gwicke1: well, i have sent about 5 e-mails saying that is not the case but everyone feels compelled to move discussion to cryptography [20:35:33] :P [20:36:29] 10Analytics-Tech-community-metrics, 06Developer-Relations: Sudden rise of changesets in wikimedia.biterg.io metrics - https://phabricator.wikimedia.org/T145849#2642955 (10Lcanasdiaz) Gerrit data was updated. This issue does no longer exist as far as I see. Check it out here: https://wikimedia.biterg.io:443/go... [20:36:56] nuria, leila: The other bit that's unclear to me is which kinds of logs / data this would apply to. [20:37:24] gwicke1: do you want to jump in a call? It may be faster to converge there. [20:37:27] it sounds like it's about access logs only, but at the same time the protection goals seem more broad [20:37:46] gwicke1: well , you might have not seen ticket right? Research to understand how raw IPs on webrequest table are used now. [20:37:49] sure [20:37:55] gwicke1: https://phabricator.wikimedia.org/T150545 [20:38:04] gwicke1: title is pretty telling [20:38:27] gwicke1: so we are just compiling use cases [20:38:45] okay, that helps -- just mentioning it as the question about use of logs seems to have become broader [20:39:29] leila: are you in the office? [20:39:34] gwicke1: because ahem ... nobody took time to read teh premise of ticket [20:39:39] no, but available in Hangout gwicke1. [20:41:10] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2815971 (10Nuria) [20:41:33] gwicke1: edited description again to make it more clear [20:45:56] 10Analytics-Tech-community-metrics: "Backlog" widget on "Gerrit-Backlog" seems to cover only last two years, misses oldest open changesets - https://phabricator.wikimedia.org/T146893#2815979 (10Lcanasdiaz) a:03Lcanasdiaz You can find reviews created on 2013 using these filters: * https://wikimedia.biterg.io:44... [20:50:57] 10Analytics-Tech-community-metrics: "Last Attracted Developers" on Git-Demographics has incorrect date values for "First Commit Date" - https://phabricator.wikimedia.org/T151161#2815993 (10Lcanasdiaz) a:03Dicortazar I'm not sure, so please @Dicortazar correct me if I'm wrong, but the Git demographics panel is... [21:05:38] 10Analytics-Tech-community-metrics, 10Differential, 06Developer-Relations (Jan-Mar-2017): Make MetricsGrimoire/korma support gathering Code Review statistics from Phabricator's Differential - https://phabricator.wikimedia.org/T118753#2816033 (10Lcanasdiaz) I'm trying to get a realistic date for this new pane... [21:09:04] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2816062 (10leila) [21:10:06] gwicke1: we updated the task https://phabricator.wikimedia.org/T150545 and I've asked nuria to review and correct whichever part needs correction. [21:10:09] thanks for the feedback. [21:11:29] that's much clearer, thanks! [21:11:57] 10Analytics-Tech-community-metrics: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2816074 (10Lcanasdiaz) The thing about this panel is to measure the amount of time need for a ticket to reach its final state. Having only //open// changesets makes sense when you are... [21:13:15] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2816090 (10Nuria) @tgr and @bd808 please be so kind as to re-read premise of this ticket, this is not a security measure but rather we are trying to compile use cases to see how rawIPs are used to evaluat... [21:24:29] (03CR) 10Yurik: [C: 031] reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [21:30:14] (03PS5) 10MaxSem: reportupdater queries for EventLogging [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/322007 (https://phabricator.wikimedia.org/T147034) [21:32:07] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2816132 (10bd808) Thanks for clarifying the scope in primary description of the ticket @Nuria > We would like to reduce the chances of IP addresses (as one form of sensitive data) to be by mistake expos... [21:38:34] 10Analytics, 06Research-and-Data: Hash IPs on webrequest table - https://phabricator.wikimedia.org/T150545#2816152 (10GWicke) It might help to clarify who would have access to the time -> salt / pepper mapping. If only very few researchers had access to this information & the salt was rotated frequently, then... [22:28:42] 06Analytics-Kanban, 06Operations, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2816480 (10RobH) I should note that the spare pool system WMF4726 was purchased in December of 2015. It is 1/3rd of the way through its 3 year warranty.