[01:02:55] (03PS8) 10Awight: Schema for ORES scores [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) [01:02:57] (03PS4) 10Awight: [WIP] Oozie jobs to produce ORES data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/482753 [01:04:53] 10Analytics, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Wire ORES scoring events into Hadoop - https://phabricator.wikimedia.org/T209732 (10awight) Updated patches should have working DDL and HQL scripts, but I still need to refine and smoke test the job definitions. Denormalized outpu... [01:08:03] (03PS5) 10Awight: [WIP] Oozie jobs to produce ORES data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/482753 (https://phabricator.wikimedia.org/T209732) [06:48:58] morning! [07:40:32] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10srishakatux) @Milimetric Thanks! I'm now able to see the journal logs. But, still running into exactly the same error as... [08:12:17] mooorning [08:12:46] hola :) [09:08:56] Goat Morning [09:53:05] 10Analytics, 10Analytics-Kanban, 10DBA, 10Data-Services, 10Core Platform Team Backlog (Watching / External): Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Banyek) [09:53:09] 10Analytics, 10Analytics-Kanban, 10DBA, 10Data-Services, and 3 others: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 (10Banyek) 05Open→03Resolved a:03Banyek I cleaned up the tables, so I close the ticket [11:12:59] ok so we are almost ready to flip the first camus job to systemd timer [11:13:07] everything is in code review [11:13:24] I chose to migrate only netflow as testing use case [11:13:31] if good I'll move all the others [11:13:43] and also Marcel's Hive2Druid stuff [11:14:34] ah and also sanitization [11:14:48] I'd need to move fast since we keep adding crons! :P [11:32:31] fdans: going afk for lunch + errand, if you want we can catch up with Superset when I am back [11:33:04] elukey: sounds good! [12:57:22] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10Milimetric) I'm sorry, I focused on the devserver problems and completely missed the more obvious error you posted. Tha... [13:02:48] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10Milimetric) And yes, file an access request with #sre-access-requests for analytics-privatedata-users, you can cc me in... [13:09:06] elukey: o/ [13:09:39] elukey: I'm a bit lost in the discussion in T172410 . Our team was generally fine with the original proposal you had in the description, but things may have changed recently? [13:09:40] T172410: Replace the current multisource analytics-store setup - https://phabricator.wikimedia.org/T172410 [13:47:31] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: Code sample for extension.json is wrong - https://phabricator.wikimedia.org/T213285 (10Milimetric) p:05Triage→03Normal [13:54:22] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic: Support all Product Analytics data needs in the Data Lake - https://phabricator.wikimedia.org/T212172 (10SBisson) >>! In T212172#4849443, @Milimetric wrote: >>> The ultimate purpose of collecting this data is to personalize new users' exper... [13:55:29] leila: o/ [13:55:58] so there are two main points [13:56:28] 1) we need to move dbstore1002 to a 3 hosts solution, each one running multiple mysql instances that replicate a wiki section (like s1, s2, etc..) [13:56:52] together with the staging db and some others (I pinged you to verify one of them in a subtask) [13:57:05] this is basically what we have been discussing during these months, nothing changed [13:57:52] 2) eventually in the bright future we'd have only the Data Lake on Hadoop and nothing more, so the more use cases moved to Hadoop the better in the medium future [13:57:59] -- [13:59:15] what we are discussing in the task now is how to support some use cases from the data analysis world to avoid breaking people's daily workflows when we decommission dbstore1002 [13:59:54] so in theory you guys should be ok, we'll have of course to sync for the migration to the new hosts but nothing more [14:00:13] Not sure if this is clearer or not :( [14:00:25] elukey: great. this is clear. [14:00:37] * leila looks for the subtask that elukey mentions [14:03:34] elukey: if by subtask you mean T212487, I have already responded. [14:03:35] T212487: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 [14:07:02] leila: yep yep sorry I didn't mean that you didn't, it was only to add context :) [14:07:05] sorry [14:07:36] no worries. I'm catching up with emails and I may have missed it. happy that I'm not /that/ behind this one. ;) [14:08:16] :) [14:13:03] hey fdans I have a couple other things to catch up with this morning [14:13:19] I had to put out a couple fires yesterday [14:13:28] so let's do our next scheduled eye bleed tomorrow morning [14:13:51] !log shutdown all the hdfs datanode daemons on the decom nodes (analytics1028->41) [14:13:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:14:11] from the logs those daemons are only collecting deleted [14:14:14] *deletes [14:14:20] I'll start with a couple [14:29:00] ottomata: o/ [14:30:29] o/ [14:31:11] 10Analytics, 10MediaWiki-API, 10PageViewInfo, 10Pageviews-API: API Analytics - page views by country - https://phabricator.wikimedia.org/T213221 (10Anomie) [14:31:14] if you are caffeinated - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/482767/ - I'd merge this and then restart namenodes to complete decom [14:36:09] ottomata: --^ [14:37:15] elukey: +1 [14:37:20] and last https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483136/ [14:37:32] elukey: should we merge those superset patches? [14:37:58] ottomata: I didn't have time today but please do if you have, you are surely more knowlegdeable then me and less prone to failures :D [14:38:09] there is a deployment server in labs + superset node [14:38:19] I can quickly deploy and let Fran test [14:39:10] i think if superset runs from the latest stuff there we should just merge [14:39:16] if we need more patches to fix bugs we can make them [14:39:46] sure [14:40:02] actually, i'm goigng to go ahead and merge the first 2, the last one that actually bumps the version we can wait to verify that ^^ [14:40:23] I am going to restart namenodes in the meantime [14:40:43] k! [14:40:49] (i'm also a little confused about the state of my pathces...) [14:41:11] I messed up the last one with a rebase, sorry [14:42:29] elukey: ottomata helloooo I can test whatever if you want :) [14:45:44] just restarted an-master1002, the old nodes are gone [14:46:05] I am going to wait a bit, failover, restart namenode on an-master1001, wait a bit, failover again [14:46:12] and then clean up hosts.exclude [14:50:52] gr8 :) [14:57:05] (03PS2) 10Ottomata: Use wikimedia superset fork to build_wheels. @wikimedia branch currently at 0.26.3 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/481053 [14:57:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use wikimedia superset fork to build_wheels. @wikimedia branch currently at 0.26.3 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/481053 (owner: 10Ottomata) [14:58:30] ottomata: whenever you have time, sanity check for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483136/1/manifests/site.pp [15:01:17] (03PS2) 10Ottomata: Update to build from wikimedia's superset fork [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/481054 [15:02:03] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update to build from wikimedia's superset fork [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/481054 (owner: 10Ottomata) [15:06:04] elukey: looks right toi me [15:06:22] thanks :) [15:08:40] (03PS5) 10Ottomata: Bump to superset version 0.26.3-wikimedia1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/481056 [15:20:27] (03PS6) 10Ottomata: Bump to superset version 0.26.3-wikimedia1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/481056 [15:20:47] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/482790/ \o/ [15:21:01] today I've worked a bit to adding the new stuff to camus [15:21:06] still wip but looking good [15:21:44] oh nice [15:22:19] eventually everything should be all timers and adding kerberos support should be easy [15:22:54] awesooome [15:23:37] elukey: how do you deploy superset in analytics labs? [15:23:40] i see the deployment-server [15:23:44] but no scap environments [15:23:52] i'm going to go ahead and deploy there so fdans can check [15:24:09] coool beans [15:24:12] there is a /srv/deployment/etc.. super set dir [15:24:35] yes [15:24:37] and then a superset.eqiad.wmflabs host (works but scap host list needs to be updated with it) [15:24:49] ahhh ok I got the scap environment thing now [15:24:49] ah you jsut manually edit? k [15:24:52] you mean the host list [15:24:56] ya [15:24:56] yeah sorry [15:25:08] np! [15:27:51] ok hadoop nodes officially decommed! Going to add some notes to the admin docs [15:28:04] next step is to build the testing cluster [15:31:43] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 (10elukey) Nodes completely removed: * removed from the network topology and restarted namenodes * assigned role::spare:system and removed... [15:31:45] 10Analytics: Add Chinese Wikiversity edit-related metrics to Wikistats2 - https://phabricator.wikimedia.org/T213290 (10mforns) [15:32:19] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 (10elukey) As mentioned before these nodes will become a new testing cluster, more info in T212256 [15:32:25] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 (10elukey) [15:36:29] ok elukey this superset thing is not yet working....my change to build the static files i thought only ran during build, but now its trying to run on deploy too (when setting up[ the venv) [15:36:32] so i gotta figure that out [15:36:36] i'll work on that later today [15:37:05] super [15:37:22] at some point I hope that 0.29 goes out so we can go back to the previous release [15:37:29] (if it is stable of course) [15:38:24] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic: Support all Product Analytics data needs in the Data Lake - https://phabricator.wikimedia.org/T212172 (10mpopov) >>! In T212172#4853129, @chelsyx wrote: > Here're some use cases from my work for the iOS app team: > > - Of course, as @Neil_... [15:44:57] 10Analytics, 10Research, 10WMDE-Analytics-Engineering, 10User-Addshore, 10User-Elukey: Provide tools for querying MediaWiki replica databases without having to specify the shard - https://phabricator.wikimedia.org/T212386 (10mpopov) By the way, on ouR side, [[ https://github.com/wikimedia/wikimedia-disco... [16:17:23] hey a-team, any thing you want me to mention in SoS? [16:17:39] mforns: maybe the thing about user partial blocks [16:18:01] https://phabricator.wikimedia.org/T202781#4865947 [16:18:05] ottomata, you want me to flip the table? [16:18:46] right [16:18:55] mforns: you mean (╯°□°)╯︵ ┻━┻ ? [16:19:02] yea xD [16:19:05] haha [16:19:12] only if you do this after: [16:19:19] (•_•) [16:19:19] ( •_•)>⌐■-■ [16:19:20] (⌐■_■) [16:19:23] xDD [16:24:49] 10Analytics, 10Operations, 10ops-eqiad: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10jcrespo) [16:26:51] 10Analytics, 10Operations, 10ops-eqiad: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10jcrespo) I rebuilt db1082- we are no blocker for any maintenance on those servers, but we would prefer to stop mysql if there is a chance for the server to lose power, while it does not... [16:31:43] elukey: ops sync? [16:51:48] a-team I’m not sure I’ll make stand up, I’m out to get some meds for lauren who’s sick in bed [17:10:24] 10Analytics: Reportupdater should not fail if pid file is malformed - https://phabricator.wikimedia.org/T213308 (10Milimetric) p:05Triage→03High [17:11:59] 10Analytics: Reportupdater should alert if it fails over and over - https://phabricator.wikimedia.org/T213309 (10Milimetric) p:05Triage→03High [17:12:16] 10Analytics: Add Chinese Wikiversity edit-related metrics to Wikistats2 - https://phabricator.wikimedia.org/T213290 (10JAllemandou) It's not present in the wiki-list we sqoop: https://github.com/wikimedia/analytics-refinery/blob/master/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv Providing a patch... [17:19:44] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Add partial blocks to mediawiki history tables - https://phabricator.wikimedia.org/T211950 (10Milimetric) p:05Normal→03High [17:19:58] * elukey afk for a bit [17:22:21] (03PS1) 10Joal: Add zhwikiversity to the labs sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/483186 (https://phabricator.wikimedia.org/T213290) [17:22:29] milimetric: --^ if you want [17:23:08] (03CR) 10Joal: [V: 03+1] "Access tested on labsdb" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/483186 (https://phabricator.wikimedia.org/T213290) (owner: 10Joal) [17:23:18] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add zhwikiversity to the labs sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/483186 (https://phabricator.wikimedia.org/T213290) (owner: 10Joal) [17:23:35] Thanks milimetric :) [17:23:39] ty! [17:24:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add Chinese Wikiversity edit-related metrics to Wikistats2 - https://phabricator.wikimedia.org/T213290 (10JAllemandou) a:03JAllemandou [17:42:40] thanks for Chinese Wikiversity joal :] [17:44:49] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Add partial blocks to mediawiki history tables - https://phabricator.wikimedia.org/T211950 (10dbarratt) [17:57:14] (03PS1) 10WMDE-Fisch: Add script to count user setting for disabled AdvancedSearch [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/483194 (https://phabricator.wikimedia.org/T211090) [18:02:13] (03PS2) 10WMDE-Fisch: Add script to count user setting for disabled AdvancedSearch [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/483194 (https://phabricator.wikimedia.org/T211090) [18:04:55] * elukey off! [18:19:16] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Based on what I see and know I can't spot any mistake. But I don't feel like I know enough to be qualified to merge this." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/483194 (https://phabricator.wikimedia.org/T211090) (owner: 10WMDE-Fisch) [18:39:48] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10Legoktm) [18:39:52] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant, 10Patch-For-Review: Code sample for extension.json is wrong - https://phabricator.wikimedia.org/T213285 (10Legoktm) 05Open→03Invalid It's correct, as long as your extension is using `manifest_version: 2` (https://www.medi... [18:40:03] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [18:41:34] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) @elukey I have updated the original task, to add the last statuses of the curr... [18:55:24] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [18:55:58] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) @elukey For those databases that we have decided, so far, to backup and archiv... [18:57:23] (03CR) 10Joal: [C: 04-1] "A bunch of comments, nothing major, but still can't go as-is." (037 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [18:57:27] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) I asked @chasemp about `fab_migration` and I think we need to have a final wor... [18:57:54] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [18:59:01] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [19:06:59] (03CR) 10Joal: [C: 04-1] "Small addition to one comment." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [19:07:59] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10srishakatux) > "EventLoggingSchemas": { > "CentralAuth": 5690875 > } By registering the schema in the... [19:11:46] hmmmm, EL sanitization's rotating salt seems not to be compatible with runnig sanitization in 2 steps... :( [19:14:56] mforns: oh y? [19:15:38] yes... at the end of quarter salt rotates, so 45 days before, second pass of sanitization also changes salt [19:16:01] and starts overwriting data and changing the old salt by the new salt [19:16:58] there's a backup of the salt that lives for 2 weeks now... [19:17:07] we could use that for the second pass maybe [19:17:43] something like: if exists backup, use it, otherwise, use the actual salt file [19:18:10] oh [19:18:22] hm [19:18:56] i think keeping the salt longer is ok? [19:19:03] we can just always keep the last quarter's salt? [19:19:09] is that bad? [19:20:03] ottomata, keeping the old salt is, theoretically, the same as not hashing the marked fields for that period [19:20:13] so, yes, a bit bad [19:20:42] we're keeping it already for 2 extra weeks, to allow for backfilling in case of fireworks [19:21:16] maybe I can setup the second pass also after 2 weeks of initial sanitization [19:21:48] or maybe keep it for 3-4 weeks? I think one full quarter would be too much [19:22:05] and maybe 4 weeks too [19:23:05] ottomata, would it be possible in puppet to pass one path if exists, otherwise pass another path to the job? [19:24:14] bash hack [19:26:53] $(if [ -f $old_salt_path ]; then $old_salt_path; else $new_salt_path; fi) [19:43:12] mforns: hm not easily in puppet, but in a shell script yes [19:43:23] yea [19:43:29] we already deploy a wrapper for spark jobs... but we might need a custom one to do that [19:44:17] ottomata, so you think it deserves a specific wrapper? I can do that [19:44:29] mforns: not sure, woudl be nicer if it didn't... but [19:44:30] hm [19:44:35] yea [19:44:46] it's not going to be reused... [19:45:08] what's the 2 pass plan btw? to re-refine everything in bulk later? [19:45:21] what's the period? [19:45:24] for the second pass? [19:45:32] wait...1 week, sanitize last 4? [19:45:35] i don't remember [19:45:49] since=46days until=45days [19:45:52] (03CR) 10Awight: Schema for ORES scores (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [19:45:57] or sth like that [19:46:07] oh, so refine a full day from 45 days ago [19:46:08] i see. [19:46:13] yea [19:46:17] and the salt rotates every quarter? [19:46:20] yes [19:46:27] so we'd need the salt from 45 days ago to do that properly? [19:46:38] yes, but that is super-easy [19:46:47] (03CR) 10Awight: Schema for ORES scores (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [19:46:47] already working, just have to change a number in puppet [19:46:59] right, it seems we'd want to keep it longer, is what you are saying? [19:47:03] just in case? [19:47:10] yes [19:47:34] keep the old salt for extra 45 days and use it in the second pass if present [19:47:50] if not present, use the regular salty [19:47:52] salt [19:48:00] joal: o/ I responded to the path question in CR, but might need more chatting because I think I'm failing to understand something here. [19:48:37] ottomata, can you pass a dict with the params to refine_job? [19:48:54] hm, mforns, maybe it'd be best to make the logic detect which salt to use based on time period? [19:48:57] so that I can pass the same dict (with overrides) to both refine_jobs? [19:49:02] the salt is passed into the job direectly, right? its not discovered by job? [19:49:12] yes, passed [19:49:21] Hi awight :) [19:49:23] mforns: yes you can do that, we'd need to likely use merge() to merge the overrides onto the hashes [19:49:30] Reading your comments [19:49:32] mforns: hm. [19:49:40] its too bad the salt finding logic isn't soemthing more like [19:50:59] date = 2018-12-01 [19:50:59] salt_file = ${path_to_salt}/${date}.salt [19:50:59] if !exists(salt_file) [19:50:59] salt_file = $path_to_salt/current.salt [19:51:00] ? [19:51:09] we could use: $(backup="foo" if [ -f $backup ]; then echo $backup; else echo "bar"; fi) [19:51:30] that way the salt to use is based on the time period (assuming day only) [19:51:31] hm [19:51:41] then its way more flexible. [19:51:48] or even some index somewhere, that maps time periods to salt files [19:52:32] it seems fragile to just assume two salts, and use one if not the other. also hard to test [19:52:51] better if the salt to use is preditible [19:53:21] mforns: i think when you ask me questions i make your life more complicated... :p [19:53:27] hehehe, no [19:54:09] awight: My bad about partition-ordering - If models/versions belong to wikis, then indeed let's use the order you specified (maybe not for the public one though :) [19:54:17] I was looking what the filename of the salt was, checking if it already contains the date, but no [19:55:23] joal: oh good point! The way that table will be queried is only by wiki, and we'll probably only purge by snapshot [19:55:29] (03CR) 10Ottomata: Schema for ORES scores (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [19:55:38] oops--queries will be by (wiki, snapshot) [19:55:46] mforns: what is the salt named now? [19:55:50] just eeventlogging.salt [19:55:50] ? [19:56:01] yes [19:56:08] awight: also, the snapshot is a group for all the joint-subfolders :) [19:56:12] /user/hdfs/eventlogging-sanitization-salt.txt [19:56:16] aye [19:56:22] and what is the backup called? [19:56:35] /user/hdfs/eventlogging-sanitization-salt.txt.old I think [19:56:37] lookin [19:56:39] right ok [19:56:43] somethign like that anyway [19:56:59] yeah, it sounds like we shoudl jsut expect a list of salts and some way to figure out which ones should be used for which time period [19:57:09] just rotating one backup is a little fragile [19:57:15] joal: cool, so now I have /wmf/data/ores/revision/score_public/snapshot=2018-12/wiki=enwiki, if that sounds right to you? [19:57:18] better to include the date even on the current one [19:57:26] then the logic to find the proper salt will always be the same [19:57:32] might even be better to not fall baack to the latest salt [19:57:37] but then... how does puppet know the name of the salt? [19:57:45] maybe if the expected salt is not present, the job shoudl just die [19:57:47] (open to a better name than "score_public", I want to say "score_with_context") [19:57:48] awight: yes thank you :) [19:58:04] refining with the wrong salt will lead to unexpected data, right? [19:58:09] yes [19:58:09] awight: I'll leave you with our naming champion ottomata ;) [19:58:18] mforns: puppet won't ... [19:58:22] somethign will have to [19:58:28] wrapper script...or logic in scala somwhere? [19:58:43] haha :) I would trust him with a nick like that [19:58:46] haha [19:59:31] awight: i'd advise that if possible, you should try and keep your database/table_name dirs flat [19:59:35] ottomata, OK, will think, going to eat sth, will be back in a bit! [19:59:45] that way it is easy to know exactly what table a file path belongs to [19:59:49] (03CR) 10Joal: [C: 04-1] Schema for ORES scores (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [19:59:57] so /wmf/data/ores is your base location path for all tables in the ores database [20:00:07] and then any tables in there should have directories named after the tables themselves [20:00:24] so if your tables is ores_revision_score_public, the path would be /wmf/data/ores/ores_revision_score_public [20:00:30] ok mforns_brb [20:00:37] ottomata: thanks, will do [20:01:14] ottomata: Nice one - I've messed up long ago in some places on the cluster (pageview/hourly for instance), and now realize that it doesn't help [20:02:08] ottomata: is "ores_revision_score_archive" a consistent name to give a table which is essentially a copy of "ores_revision_score" but with mediawiki_history metadata included? [20:02:23] (03CR) 10Ottomata: Schema for ORES scores (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [20:02:46] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic: Support all Product Analytics data needs in the Data Lake - https://phabricator.wikimedia.org/T212172 (10mpopov) Nevermind, per T170022#4800915 & T170022#4866564 I guess there's nobody actually managing Maps and RI is just doing maintenance... [20:03:17] awight: is this your _public table? [20:03:25] yes [20:03:42] It's for creating dumps. [20:04:12] awight: q, aren't all of these fields avail on the original mediawikI_revision_score table? [20:04:38] from the event? [20:04:43] ottomata: almost--the big catch is that I want to honor new suppressions. [20:04:46] or...am i consfused [20:04:47] ah [20:05:05] that's just deleting removing the appropriate row? [20:05:06] also, the schema is notably different for being normalized to one model per row [20:05:15] yes, schema different ya [20:05:16] ottomata: no, it's redacting the page_title and possibly user_text [20:05:21] ash [20:05:21] ah [20:05:29] so you need to use mw history to know that then i see [20:05:41] yeah, it's nasty [20:05:58] I'm open to any approach here, but this is all I've come up with so far. [20:06:24] awight: re: predicition dont' some models make a list of predicitons? [20:06:33] yes [20:06:35] I actually view this as nice :) Being able to follow page-moves/user-renames is fun :) [20:06:47] you just gonna join with comma? [20:07:02] just noticed that you have prediciton as a string [20:07:05] ottomata: oh sorry, not a list of predictions, but all models have a list of probabilities. current type is: [20:07:09] `probability` array> comment 'Predicted probability for each class.' [20:07:16] prediction will always be one string [20:07:20] awight: i think some have a possible list of predicitions [20:07:29] otherwise we wouldn't have made predicitons an array [20:07:34] at least, i think that's what aaron told us [20:07:38] 10Analytics, 10Research, 10WMDE-Analytics-Engineering, 10User-Addshore, 10User-Elukey: Provide tools for querying MediaWiki replica databases without having to specify the shard - https://phabricator.wikimedia.org/T212386 (10Neil_P._Quinn_WMF) >>! In T212386#4862292, @jcrespo wrote: > There is already an... [20:07:42] hmm /me checks that [20:08:08] ottomata: ah that may be to deal with multiple models per row? [20:08:31] I'd expect it to be a map from model_name to prediction actually. [20:08:39] no, because scores itself is already an array [20:08:47] each revision has an array of scores [20:09:03] and each score has an array of probabilities, and an array of predictions (which is usually single entry, but not always) [20:09:27] gotcha: `scores` array,probability:array>>>, [20:09:42] yup [20:10:16] 10Analytics, 10Contributors-Analysis, 10Product-Analytics: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10Neil_P._Quinn_WMF) >>! In T206894#4698640, @Milimetric wrote: > Sure, no problem. It's probably a good idea to... [20:10:29] i don't remember when the prediciton has multiple value though...but i'm pretty sure its possible. since hte schema needs to be the same for every score, we needed to support it [20:10:32] ottomata: okay we do have a model that can return a list of predictions, thanks for the catch! [20:10:38] :) [20:10:57] as for naming.... i don't know! i dont' think _archive is quite right... [20:11:14] _dump? [20:11:24] _with_context? [20:11:33] i kinda like _public better, but it would be nice if it was clearer about how ores_revision_score is a different schema than ores_revision_score_public [20:11:36] _dump makes sense if the format is dump-oriented (CSV for instance) [20:11:39] yeah [20:11:48] the only reason you'd use this table is for the dump right? [20:11:56] otherwise folks would just query the event? [20:11:58] evnet table* [20:11:59] ? [20:12:22] ottomata: possibly not actually - missing a lot of event data in the explode version [20:12:33] explode+d [20:12:35] ya [20:12:45] hm [20:13:09] and its not adding more than the event table has [20:13:09] ottomata: I think the ores_revision_score table will become the most valuable for joins, actually. events only have the contemporary models, but the ores_revision_score table will be backfilled with new models run against old revisions. [20:13:19] oh! nice. [20:13:32] _export [20:13:33] ? [20:13:48] careful--I'm happy to run with any name here :) [20:13:54] haha [20:14:08] I still like _public the best [20:14:19] we do that with some other tables I think? [20:14:37] ...do we? [20:14:53] My only hesitation is that is makes ores_revision_score seem implicitly private, whereas it's just lacking the metadata columns entirely [20:14:54] maybe we don't! [20:15:02] hm [20:15:03] true [20:15:11] _dump is ok? [20:15:22] The other option is to rename the core table to something else, so that the table to be used by user is revision_score? [20:15:24] maybe _export is bette rthan _dump ? [20:15:50] heh awight... [20:15:58] IF things other than revisions will be scored in the future [20:16:03] ... [20:16:09] you could make the small score even more generic [20:16:15] ores_score [20:16:32] `id` bigint, [20:16:32] `entity` string (e.g. revision), [20:16:59] join on id where ores_score.entity = 'mediawiki_revision' [20:16:59] :p [20:17:05] probably not a good idea^ [20:17:08] but it is AN idea :p [20:17:48] ottomata: I've been moving away from this sort of polymorphism for ORES data in the MediaWiki DB and API, fwiw. [20:17:57] ok ok [20:18:00] its bad? [20:18:11] I think it introduces extra complexity in the long run [20:18:14] aye [20:18:16] you are probably right [20:18:19] ores.revision_score_raw [20:18:28] ores.revision_score [20:18:28] naw not raw [20:18:35] ok [20:18:37] :) [20:18:38] also, use cases are very distinct, there's never a workflow that will query both page scores and revision scores in the same query [20:18:52] ores_revision_score_composite [20:18:54] hmm, naw [20:19:07] its not adding anything that the event table doesn't already have [20:19:14] ores_revision_score_context ? although the trick is that it's a snapshot of context [20:19:15] i dunno _export is fine...its really only for exporting right? [20:19:19] yes ^ [20:19:28] also, if the DB is ores, do we need the ores prefix for the table awight ? [20:19:37] of course, someone might get the idea that it's a fun table to use directly in hadoop :) [20:19:39] 10Analytics, 10Research, 10WMDE-Analytics-Engineering, 10User-Addshore, 10User-Elukey: Provide tools for querying MediaWiki replica databases without having to specify the shard - https://phabricator.wikimedia.org/T212386 (10Marostegui) I haven't found much on wikitech, so: ` marostegui@tools-bastion-03:... [20:19:45] awight: that's fine [20:19:47] joal: I'd love to drop [20:19:48] awight: that's my idea [20:20:01] joal: i'm fine with dropping the ores_ table prefix [20:20:02] so [20:20:23] awight: I think the data joint with mediawiki-history is actually more usefull to others than the `raw` one [20:20:23] joal: I guess that's fine, but the tradeoff would be much older (1-30 days older) data [20:20:42] for many consumers, that'll probably be okay [20:21:03] awight: we're talking stats and trends - for real-tim-ish, use events ;) [20:21:05] /wmf/data/ores/{revision_score, revision_score_error, revision_score_export} [20:21:19] heh [20:21:26] revision_score_history ? :[ [20:21:50] ottomata: kk [20:21:57] ottomata: the more I think of it, the more I'd like the 'export' table to be the base got others (in parquet etc - Because it ocntains metadata) [20:22:07] aye [20:22:14] joal that makes sense too, if that's the case _export is a bad name [20:22:18] :) [20:22:18] rig [20:22:19] as is _dump [20:22:30] correct, and CSV format is wrong as well [20:22:31] revision_score_with_context? [20:22:48] revision_score_augmented? [20:22:51] mwarfv [20:22:53] hehe [20:23:01] Now I've really brought my bikeshed with me [20:23:26] this is more like where to put the doornknob on the bikeshed, not what color to paint :p [20:23:34] shoudl we put the doorknob on the roof? [20:23:35] probably not! [20:23:59] :) let's make it big enough for electric cargo bikes [20:24:13] _public is still fine with me! [20:25:02] ottomata: or we reuse the webrequest approach using databases: ores_raw, ores [20:25:03] i think as long as there's docs about what the tables are, it is fine [20:25:09] +1 --^ [20:25:10] ok I'm going with that. It actually makes sense usage-wise, since the more normalized revision_score tables will often be in a semi-backfilled state. [20:25:19] joal: naw because raw as we mean it isn't what this is, [20:25:37] raw is event ... [20:25:41] raw is more like unrefined input data, yes the _public table comes from somewhere else but [20:25:47] right [20:26:03] public it'll be :) [20:26:05] calling this raw woudl be like calling page_history raw but mediawiki_history refined [20:26:13] this is just a step in the pipeline, not raw [20:33:00] (03CR) 10Joal: [C: 04-1] Schema for ORES scores (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [20:33:15] awight: Just added a comment about format for the public table --^ [20:38:20] (03PS9) 10Awight: Schema for ORES scores [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) [20:38:33] ^ integrates our discussion so far [20:42:14] (03CR) 10Awight: Schema for ORES scores (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) (owner: 10Awight) [20:42:40] Arf I might have been misleading in my comment awight - For the public table I didn't mean remove model/version from the partitions, but rather move snapshot at the top-level (as you did) - I can think of use-cases where having all models/versions in the same files will be usefull - But I also can view the advantage of having them split [20:43:58] Interesting, I was thinking that the rebuild will always be the entire set with no model/version distinction, but reconsidering, if this will be used for queries "model" might help a lot [20:44:23] maybe not "model version" since end-users will be agnostic, and I don't think we'll be purging by model or model version ever [20:48:04] awight: If most querying should be done by model and version, then it's probably usefull to add the version partition - If most queries are about comparing versions inside a model, then not [20:48:33] I think I don't understand the "end-users will be agnostic" part of your sentence :) [20:48:54] I think no queries will filter on version, in other words. Only one version (the latest) should be provided in any given snapshot. [20:50:29] Ah I had missed that [20:50:33] It's only included as a column for informational purposes, so a researcher can look at our errata later and say "nuts, I used enwiki-damaging-0.4.0 which had these known problems" [20:50:46] thanks for helping me think through this stuff! [20:51:51] That great awight - I had a demo doing the same exxact thing you do (joining events with history) for analytics purpose :) Thanks for making it happen! [20:51:52] (03PS10) 10Awight: Schema for ORES scores [analytics/refinery] - 10https://gerrit.wikimedia.org/r/481025 (https://phabricator.wikimedia.org/T209732) [20:52:26] awight: For the export use-case, last-version for models makes sense - For analytics puposes, more makes sense :) [20:52:46] Being able to compare versions should be great I assume [20:53:27] hmm interesting. We do have version history for overall health statistics of each model version, but detailed scores might be neat also [20:53:41] I'll make a note about that... [20:54:02] awight: I'll show you my demo tomorrow if you wish (too late for me tonight) ;) [20:54:39] I can be a test audience, or let me know when you present to a larger group! [20:55:34] I showed halfak a while back - it didn't evolve since then - mostly showing fun stats about models [20:56:14] ok - gone for tonight folks - See you tomorrow [20:56:17] o/ [20:56:38] Ooh. Pull me in for that demo too. I want to talk more about it. [20:56:39] o/ [20:56:44] good evening/night joal [20:57:19] halfak: Interesting point above about preserving scores from older model versions... [21:01:02] My general sense is that we shouldn't purge old scores if storage space isn't an issue. [21:01:23] I'd really like to make it easier for consumers to experiment with old models/old scores. [21:01:42] But I don't really see keeping old scores in hadoop as a good solution for that. [21:03:15] They'll be nicely contained in directories, so it's easy to ignore them or monitor storage usage. [21:03:46] i.e. the partitioned data paths are like: /wmf/data/ores/revision_score/wiki=enwiki/model=damaging/model_version=0.0.1 [21:05:31] Makes sense. [21:13:49] either way, I'll plan the import jobs with the assumption that older model scores may or may not be present. [21:46:17] 10Analytics, 10Research, 10Wikidata: Copy Wikidata dumps to HDFs - https://phabricator.wikimedia.org/T209655 (10bmansurov) [21:47:54] 10Analytics, 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) [21:48:47] 10Analytics, 10Research: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10bmansurov)