[02:30:41] 10Analytics, 10Product-Analytics, 10Epic: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10nshahquinn-wmf) 05Open→03Declined The main point of this task was to explain the data access needs of the Product Analytics team in the con... [02:58:33] 10Analytics: Automatically upload public Wikimedia datasets to Commons - https://phabricator.wikimedia.org/T244441 (10Yair_rand) [04:45:26] 10Analytics: Automatically upload public Wikimedia datasets to Commons - https://phabricator.wikimedia.org/T244441 (10Nuria) Not sure I understand how would the data be used in commons. Can you maybe upload a file and provide an example? [05:02:08] 10Analytics: Automatically upload public Wikimedia datasets to Commons - https://phabricator.wikimedia.org/T244441 (10Yair_rand) @Nuria Sorry, I didn't mean that it would be used on Commons itself. Data uploaded to Commons can be used on all Wikimedia wikis, via Lua modules. This would be useful on Meta and on s... [06:10:54] morning [06:55:57] hola! [07:19:25] o/ [08:02:01] Hi folks [08:32:36] 10Analytics, 10Multimedia, 10Tool-Pageviews: Add ability to the pageview tool in labs to get mediarequests per file similar to existing functionality to get pageviews per page title - https://phabricator.wikimedia.org/T234590 (10Gilles) I've never seen filenames collide, the hash part is a legacy thing that... [08:36:44] ok so presto seems to authenticate via kerberos/https/etc.. but now when I issue a query, it gets stuck in [08:36:49] Query 20200206_083454_00003_wq5p8, WAITING_FOR_RESOURCES, 0 nodes, 0 splits [08:36:52] lovely [08:36:54] hm [08:39:00] ah no there are https problems, the logs are a little bit weird [08:40:15] javax.net.ssl.SSLHandshakeException: General SSLEngine problem [08:40:17] very informative [08:41:05] elukey: presto non-verbosity that is nice for users can also be problematic is seems [09:03:57] joal: bonjour! quick question: whenever you do changes in aqs, do you run tests, etc in your local mac environment or you use vagrant or something like that? [09:04:06] aqs in mac seems super broken right now [09:04:17] fdans: I run debian [09:04:49] hmm [09:04:57] * fdans hmmmming intensifies [09:07:19] joal: ok it seems I'm the only aqs maintainer that uses mac, so I'm switching to vagrant [09:09:18] fdans: "seems super broken" --verbose [09:10:49] elukey: I think the sqlite and cassandra drivers, when upgraded to their latest version, point to linux distributions that don't have a mac counterpart [09:10:50] (i think) [09:11:15] sure but what is the error? [09:11:34] there might be some variable to tweak to make it work [09:14:41] joal: do you want to be the first to test presto kerberized ? :) [09:15:49] elukey: sure ) [09:19:11] oh elukey it might actually be my node version 😑 [09:28:05] * elukey coffee [09:55:57] joal: does it work? (if you didn't have time to test don't worry, it is fine even tomorrow) [09:56:09] elukey: I'm not sure how I should test :) [09:56:22] elukey: prod cluster? test cluster? [09:56:27] joal: prod prod [09:56:32] ah [09:56:34] ok :) [09:57:14] elukey: and obviously I can't recall the host nor the procedure :( [09:57:39] joal: ah ok! So for the moment the presto CLI is only avaialble on an-coord1001 [09:57:42] I usually do [09:57:49] that's what I recalled, but wasn't sure [09:57:52] ok [09:57:57] presto --catalog analytics_hive [09:58:03] and then a sql query etc.. [09:58:07] sure [09:58:35] if this works then we could start thinking about adding the presto cli on all the analytics clients [09:58:40] \o/ [09:58:56] and let people use it as it is, maybe marking it as "experimental" [09:59:04] and then we'll see bottlenecks etc.. [09:59:20] elukey: more importantly to me, adding presto to all hadoop worker nodes would be the first step (maybe I'm wrong) [09:59:33] elukey: collocating presto and yarn [10:00:24] joal: yes Andrew has some ideas about doing in on the new hadoop worker nodes, but I think that we should let people test it as it is to see if anything comes up etc.. [10:00:31] just to get some feedback [10:00:36] before starting a massive work [10:01:01] mforns: I can confirm that https://superset.wikimedia.org/superset/dashboard/73/ doesn't work now :( [10:03:36] elukey: works for me :) [10:04:09] elukey: and fails for me without kinit [10:04:12] \o/ [10:05:12] yessss [10:05:25] also traffic between nodes is now encrypted [10:05:29] and authenticated via TLA [10:05:31] *TLS [10:07:30] Man this is great :) [10:08:42] elukey: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto [10:14:34] nice [10:15:44] I just sent an email to the superset dev@ mailing list, let's see if anybody comes up with an idea about pyhive and kerberos [10:30:37] from now on, there are Druid and Kafka left to Kerberize [10:45:20] sigh https://github.com/apache/incubator-superset/pull/7091/files [10:50:05] asked as well to dev@ [10:50:24] so in theory, this should be the final state of things [10:50:46] 1) users authenticate to superset via 2FA/cas and a username is set [10:51:11] 2) superset authenticates itself to Druid via a keytab [10:51:41] 3) superset then runs a query as the username (so on behalf of the user) [10:51:54] same thing for Turnilo [10:53:03] Then we'd need to figure out how to authenticate all the hadoop jobs that index data on Druid, but it shouldn't be that hard [10:53:21] in this way Druid would be locked down as well [10:53:27] does it make sense? [10:54:34] kinda [10:54:58] not a good start for me then :D [10:55:01] what it is not clear? [10:55:19] questions (random): are superset/turnilo set up to tun queries on behalf of other users? [10:56:11] another one (the one making me not answer 'yes it does' previously): Aren't druid indexing jobs already authenticated on hadoop? [10:57:39] (03PS1) 10Fdans: Url encode file name before querying the data store [analytics/aqs] - 10https://gerrit.wikimedia.org/r/570608 (https://phabricator.wikimedia.org/T244373) [11:00:16] joal: so turnilo and superset should be able, in theory, to act as proxy, but I am not sure if the functionality is already there [11:00:32] about the indexing jobs, we currently auth druid itself [11:00:40] not hadoop -> druid [11:00:52] elukey: about turnilo/superset that was my question (feature not yet there) [11:01:10] elukey: hadoop -> druid ??? [11:02:12] joal: yes I mean sending indexing jobs is currently not authenticated to druid [11:02:23] say a worker that wants to index pageviews hourly [11:02:32] it just sends a HTTP request to druid [11:02:35] Ah so you mean authenticating to druid in order not only to query, but also to make actions [11:02:46] yes it is all or nothing [11:03:15] to me it is user -> druid auth, but through different services/apis [11:04:16] we want auth to be enabled when user access brokers(query), coordinator(manage), and overord (indexation) [11:05:50] and we probably want historical as well, even if we don't query it manually (queryable even if we don't do it) [11:06:07] joal: yes but those are all http calls, that will start requiring auth when kerberos is set [11:06:22] right [11:06:23] so hadoop will need to comply as well, as regular user [11:06:31] hadoop is no druid user [11:06:53] and who sends indexation jobs to druid then? [11:07:04] users :) [11:07:44] sure joseph I meant that from a worker node the http call needs to be authenticated [11:07:56] through oozie, but done as an http call, from user analytics [11:07:59] right [11:08:00] yes [11:08:14] I get it :) It's not hadoop itself as a system [11:08:28] I don't follow but ok [11:08:31] We're gonna find that usual shared understanding elukey :) [11:09:33] sure sure [11:10:13] elukey: for me, in order for indexcations job to be launchable from oozie, we'll need analytics keytabs available on all worker nodes, and the script to use it [11:10:46] And having keytabs everywhere worries me - Maybe druid indexation jobs should be moved to airflow fast? [11:11:57] yes probably it would be great [11:12:22] there might be some workaround like what spark does (copy the keytab to the yarn shared cache etc..) [11:12:33] it would mean having an analytics keytab on the airflow, runner, but only there [11:12:43] probably elukey [11:13:24] elukey: if we copy keytabs, I wish we encrypt all files :) [11:14:07] joal: spark doesn't do it, but Yarn is now authenticated+encrypted since it uses Hadoop RPC [11:14:29] so should be relatively safe [11:14:53] anyway, speaking of airflow! We have this among our goals for this q - https://phabricator.wikimedia.org/T241246 [11:15:04] is this something that we can work on during the next weeks? [11:15:15] yessir [11:15:21] It's on my plan [11:17:24] could airflow replace camus? [11:17:48] I am wondering now what should we work on next, gobblin or airflow [11:17:53] elukey: airflow could replace camus crons(or timer) [11:18:25] but copying data is not something airflow does out of the box I think [11:19:41] maybe it could be done as part of the airflow spike [11:19:47] (checking that I mean) [11:20:03] if the answer is "better to use gobblin" then we can work on it [11:20:31] elukey: airflow is a scheduler - if we give to airflow a job that is copying data, it'll execute it - But the 'copying data' aspect needs to be taken care of [11:21:05] I guess it would need something like snakebite [11:21:10] elukey: sharing my way of thinking of airflow/gobblin --^ [11:21:50] elukey: and more: we're talking about reading from kafka in parallel, storing advancement, writing in timely partitions [11:22:06] ah yes [11:22:24] elukey: so airflow on its own will not be able to replace camus I think - We'll need something else [11:22:36] we are also assuming that gobblin will work with TLS Auth or SASL Auth to Kafka, but it is a big ? [11:22:50] Indeed! [11:22:56] sigh [11:23:04] on top of this mess there is also bigtop [11:23:10] Mayve we should first investigate that bit before starting playing with it [11:23:20] definitely [11:27:25] elukey: about wikidata dumps rsync - can you tell me which of the 2 syntax I should use? once this is settled and fixed, we can move forward (hdfs-rsync deployed yesterday, and new version works - test) [11:30:45] yes sure checking [11:30:54] elukey: about --^, this is my last patch regarding data rsync for a while, I promise :) [11:32:50] joal: in my mind, all those long commands should go in a usr/local/bin script, I can take care of it if you want. It seems really the cleanest option [11:33:42] it would remove the need to think about what's wrong on the ExecStart command when needed etc.. [11:34:11] elukey: I can do it no problem - I will try to have a single template for both commands [11:34:28] super thanks [11:34:41] again I can take care of the extra hassle, I know it is painful [11:35:24] but when I see more than 10 escapes on the ExecStart line I start to wonder if in 6 months it will be debuggable or not [11:35:33] :) [11:44:15] * elukey lunch! [13:39:01] 10Analytics, 10Event-Platform: Evaluate possible replacements for Camus: Gobblin, Marmaray, etc. - https://phabricator.wikimedia.org/T238400 (10elukey) Note for whoever will test this - we need to make sure that the new tool works with either TLS or SASL auth when pulling data from Kafka. [13:46:45] 10Analytics: Upgrade Druid to its latest upstream version (currently 0.17) - https://phabricator.wikimedia.org/T244482 (10elukey) [13:54:18] joal: --^ [13:54:31] yes elukey - just read that - makes a lot of sense [14:02:21] my next step is now to work on the db1108 replica/backups, then I'd say I'll start working on BigTop [14:02:28] Druid could be a nice goal for next Q [14:23:52] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10EYener) [14:51:04] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10JAllemandou) Hi @EYener, * About the query: ` use cps; show partitions centralnoticebannerhistory20191202; OK partition year=2019/month=12/day=02/hour=16 year=2019/month=12/day=02/hour=17 year=2019/month=12/day=02/hour=... [15:01:05] (03PS1) 10Fdans: Moves all dist assets to ./assets-v2 in the production build [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/570667 (https://phabricator.wikimedia.org/T237752) [15:14:24] 10Analytics, 10Event-Platform, 10Pywikibot: EventStreams first message never found - https://phabricator.wikimedia.org/T244491 (10TheSandDoctor) [15:33:30] mforns: yoohoooo coming to thisi meeting? [15:41:04] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Security Readiness Reviews, 10user-sbassett: Security Review For EventStreamConfig extension - https://phabricator.wikimedia.org/T242124 (10Ottomata) I see this is In Progress...how goes?! :) [15:42:25] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [15:47:47] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Security Readiness Reviews, 10user-sbassett: Security Review For EventStreamConfig extension - https://phabricator.wikimedia.org/T242124 (10sbassett) @Ottomata - hope to have results by EOD tomorrow (2019-02-07). [15:48:54] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Security Readiness Reviews, 10user-sbassett: Security Review For EventStreamConfig extension - https://phabricator.wikimedia.org/T242124 (10Ottomata) Woohoo ty [15:56:06] sorry ottomata and nuria, I completely missed the meeting [15:56:38] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10EYener) Thank you, @JAllemandou! I've learned several new Hive features and commands. I have one additional (small) question. Using your guidance, I was able to CREATE and ALTER a table for CentralNoticeBannerHistory201912... [15:56:46] mforns, ottomata : i have a bit of time, we can touch base if you want , bc? [15:56:53] ok [15:57:52] omw [15:58:24] ping ottomata: bc? [16:07:35] 10Analytics, 10Analytics-Kanban, 10Product-Analytics (Kanban): Add new dimensions to virtual_pageview_hourly and pageview_hourly - https://phabricator.wikimedia.org/T243090 (10Nuria) @cchen: we need to deprioritizer this a bit, sorry about that. [16:07:43] 10Analytics, 10Analytics-Kanban, 10Product-Analytics (Kanban): Add new dimensions to virtual_pageview_hourly and pageview_hourly - https://phabricator.wikimedia.org/T243090 (10Nuria) a:05mforns→03None [16:12:27] 10Analytics, 10stewardbots, 10User-Elukey: Deprecation (if possible) of the #central channel on irc.wikimedia.org - https://phabricator.wikimedia.org/T242712 (10elukey) I tried a couple of things: * kafkacat -b kafka-jumbo1001.eqiad.wmnet:9092 -t eqiad.mediawiki.recentchange -C | grep login * curl https://... [16:12:45] (03PS1) 10Mforns: Add dimensions to druid pageview_hourly and virtualpageview_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/570681 (https://phabricator.wikimedia.org/T243090) [16:14:15] (03CR) 10Mforns: [C: 04-2] "This is still WIP." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/570681 (https://phabricator.wikimedia.org/T243090) (owner: 10Mforns) [16:16:38] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Product-Analytics (Kanban): Add new dimensions to virtual_pageview_hourly and pageview_hourly - https://phabricator.wikimedia.org/T243090 (10mforns) I pushed the code that I had, for when we resume this task, see above. [16:22:50] 10Analytics, 10Analytics-Cluster, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10elukey) [16:26:43] nuria: hola! Are we going to do standup + grooming or staff meeting? [16:29:10] elukey: i was going to go for standup and grooming and watching staff later, can be convinced otherwise [16:36:15] fine with that, I just wanted to know to what meeting to join :D [16:46:46] 10Analytics, 10Privacy Engineering, 10Research, 10Privacy, 10Security: Release data from a public health related research conducted by WMF and formal collaborators - https://phabricator.wikimedia.org/T242844 (10Miriam) 05Open→03Resolved a:03Miriam The data was published: https://analytics.wikimedia... [16:52:52] Silesian Wikipedia looks fun: [16:52:52] https://stats.wikimedia.org/v2/#/szl.wikipedia.org/content/pages-to-date/normal|line|all|~total|monthly [16:54:17] oh hello https://usercontent.irccloud-cdn.com/file/qORl9He0/Screen%20Shot%202020-02-06%20at%205.54.00%20PM.png [16:55:28] awesome fdans :) [16:56:33] joal: it's kind of sad I think [16:56:38] look at that yellow line [16:56:46] organic, nice growth [16:56:58] and then HELLO I"M A BOT [16:57:07] fdans: I wasn't actually talking about the core, more about the finding itself [16:57:28] joal: hehe, I was putting together wikigrowth 2019 :) [16:57:36] nice :) [16:59:42] heya dcausse or ebernhardson - camus is failing to import eqiad.mediawiki.cirrussearch-request and eqiad.mediawiki.api-request - any idea? [17:01:29] still in a meeting, joining standup in a min [17:01:48] oh shoot [17:01:52] sorry i meant to loook ath that joal [17:01:54] will do [17:01:56] forgot beacuse meetings [17:02:04] ping elukey coming to standup? (fine if you want to go staff} [17:02:25] nuria: see above sorry, joining in a sec [17:02:32] joal: not sure, looking [17:02:40] OHHH i know what that is [17:02:41] dcausse: ottomata is also giving a look [17:02:54] i thikn alex did a failover test for eventgate-analytics [17:02:58] in prep for datacenter switchover [17:03:03] so the data is in codfw [17:03:05] will double check [17:03:22] joal: no clue [17:03:29] acl ebernhardson - thanks :) [17:03:32] ack sotrry [17:03:46] pfff - big fingers [17:13:48] 10Analytics: Request for database on statserver - https://phabricator.wikimedia.org/T244504 (10jkumalah) [17:15:40] 10Analytics: Request for database on hadoop user space - https://phabricator.wikimedia.org/T244504 (10Nuria) [17:16:56] 10Analytics, 10Editing-team (Tracking), 10Product-Analytics (Kanban): Enable Editing Team members to run queries independently - https://phabricator.wikimedia.org/T224029 (10JTannerWMF) 05Open→03Resolved Thanks @Nuria we are good to go! [17:20:17] 10Analytics, 10Analytics-Kanban: Presto access on jupyter notebooks - https://phabricator.wikimedia.org/T244505 (10Nuria) [17:21:06] 10Analytics, 10Product-Analytics: Enable shell access to presto from jupyter/stats machines - https://phabricator.wikimedia.org/T243312 (10Nuria) Let's please have a simple wikitech page that explains how to access presto as is now. [17:22:00] 10Analytics, 10Product-Analytics: Enable shell access to presto from jupyter/stats machines - https://phabricator.wikimedia.org/T243312 (10Nuria) [17:22:02] 10Analytics, 10Analytics-Kanban: Presto access on jupyter notebooks - https://phabricator.wikimedia.org/T244505 (10Nuria) [17:22:13] 10Analytics, 10Product-Analytics: Enable shell access to presto from jupyter/stats machines - https://phabricator.wikimedia.org/T243312 (10Nuria) [17:23:24] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10RobH) p:05Triage→03Medium [17:23:35] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10RobH) [17:25:53] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10wiki_willy) test [17:46:22] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Create common http subobject for re-use in event schemas - https://phabricator.wikimedia.org/T242363 (10Ottomata) 05Open→03Resolved [17:46:25] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) [17:46:48] 10Analytics, 10Analytics-Kanban: Presto access on jupyter notebooks - https://phabricator.wikimedia.org/T244505 (10fdans) p:05Triage→03Medium [17:47:43] 10Analytics, 10Analytics-Cluster: Consider replacing Cloudera's hadoop-hdfs-fuse with a newer, better and writeable HDFS FS mount - https://phabricator.wikimedia.org/T243460 (10Ottomata) 05Open→03Declined The candidates are bad. Declining. Candidates: ===== [[ https://hadoop.apache.org/docs/r2.6.0/hado... [17:47:45] 10Analytics, 10Patch-For-Review: Newpyter - First Class Jupyter Notebook system - https://phabricator.wikimedia.org/T224658 (10Ottomata) [17:49:04] 10Analytics: Request for database on hadoop user space - https://phabricator.wikimedia.org/T244504 (10fdans) a:03JAllemandou [17:49:27] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10fdans) p:05Triage→03High [17:57:48] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10Ottomata) @EYener hiya! It seems you are using the raw json data. It'd be better to use the refined table in the event database: `event.CentralNoticeBannerHistory`. See also https://wikitech.wikimedia.org/wiki/Analytics/... [17:58:41] 10Analytics: Upgrade Druid to its latest upstream version (currently 0.17) - https://phabricator.wikimedia.org/T244482 (10fdans) p:05Triage→03Medium [18:01:02] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) These hosts need to be in 10G racks :) [18:02:22] 10Analytics: Analytics Hardware for Fiscal Year 2019/2020 - https://phabricator.wikimedia.org/T244211 (10elukey) [18:04:20] 10Analytics: Automatically upload public Wikimedia datasets to Commons - https://phabricator.wikimedia.org/T244441 (10fdans) 05Open→03Declined In order for this data to be of more use we might want to put it on an API. But uploading dumps to commons requires no action from us. Feel free to do so! [18:05:45] 10Analytics: Ingest data quality into druid for visualization - https://phabricator.wikimedia.org/T244388 (10fdans) 05Open→03Declined [18:07:01] 10Analytics: Convert siteinfo dumps from json to parquet - https://phabricator.wikimedia.org/T244380 (10fdans) p:05Triage→03Medium [18:07:08] 10Analytics: Convert siteinfo dumps from json to parquet - https://phabricator.wikimedia.org/T244380 (10fdans) a:03JAllemandou [18:07:54] 10Analytics: Add druid load job for data quality table - https://phabricator.wikimedia.org/T244379 (10fdans) a:03mforns [18:08:32] 10Analytics, 10Analytics-Kanban, 10Multimedia, 10Tool-Pageviews, 10Patch-For-Review: Fix double encoding of urls on mediarequests api - https://phabricator.wikimedia.org/T244373 (10fdans) p:05Triage→03Unbreak! [18:08:37] 10Analytics: Request for database on hadoop user space - https://phabricator.wikimedia.org/T244504 (10JAllemandou) Hi @jkumalah, you should actually be able to create the table yourself. I took the ooportunity to write https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive#Create_your_own_database .... [18:09:29] JAllemandou: thank you. I will do that shortly then. [18:09:54] 10Analytics, 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Homepage: purge sanitized event data through 2019-11-04 - https://phabricator.wikimedia.org/T244312 (10fdans) Do we need to delete all data in the tables or just some specific partitions? [18:11:11] 10Analytics, 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Homepage: purge sanitized event data through 2019-11-04 - https://phabricator.wikimedia.org/T244312 (10fdans) a:03fdans [18:13:01] 10Analytics: Database creation in Hive - https://phabricator.wikimedia.org/T244292 (10JAllemandou) Hi @EYener, You can setup the database yourself: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive#Create_your_own_database Also if you're new to hive, I suggest you have a look at the page above a... [18:13:18] 10Analytics: Database creation in Hive - https://phabricator.wikimedia.org/T244292 (10JAllemandou) a:03JAllemandou [18:14:01] 10Analytics: Unify puppet roles for stat and notebook hosts - https://phabricator.wikimedia.org/T243934 (10fdans) p:05Triage→03High [18:16:26] mforns_: can you re-join? [18:16:31] elukey, sure [18:18:03] 10Analytics: Database creation in Hive - https://phabricator.wikimedia.org/T244292 (10EYener) Thank you, @JAllemandou! I am new to Hive, and was not aware I could do this myself. It worked well, though, and I appreciate the resources. You can close this ticket; much appreciated. [18:26:38] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10JAllemandou) Hi again @EYener, Partitions in hive are a SQL representation for folders. Adding a partition only tells hive that it should look into a folder to find files related to the values of the partition (for instance... [18:30:38] (03CR) 10Nuria: "Let's please modify the commit message to acknowledge that these urls are really not stored properly, they should be stored unencoded enti" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/570608 (https://phabricator.wikimedia.org/T244373) (owner: 10Fdans) [18:31:22] 10Analytics: Database creation in Hive - https://phabricator.wikimedia.org/T244292 (10JAllemandou) 05Open→03Resolved [18:43:48] ottomata: did you see the camus errors? Do you think that it was a temp issue? [18:45:07] 10Analytics, 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Homepage: purge sanitized event data through 2019-11-04 - https://phabricator.wikimedia.org/T244312 (10nettrom_WMF) >>! In T244312#5857137, @fdans wrote: > Do we need to delete all data in the tables or just some specific partitions?... [18:45:13] 10Analytics, 10Event-Platform, 10Pywikibot: EventStreams first message never found - https://phabricator.wikimedia.org/T244491 (10TheSandDoctor) [18:46:55] 10Analytics: Kerberos password for Trey Jones (tjones) - https://phabricator.wikimedia.org/T244416 (10elukey) Hi! You seem to already have a kerberos principal registered: ` elukey@krb1001:~$ sudo manage_principals.py get tjones Principal: tjones@WIKIMEDIA [..] ` Is is a request for a password reset? [18:53:25] 10Analytics: Kerberos password for Trey Jones (tjones) - https://phabricator.wikimedia.org/T244416 (10TJones) 05Open→03Resolved a:03TJones @elukey, thanks for the reminder! I'd never tried to use kerberos before yesterday so I totally spaced on the fact that passwords were sent out in November. I was able... [18:57:35] * elukey off! [18:59:43] 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog: EventLogging MEP Upgrade Phase 1 - https://phabricator.wikimedia.org/T244521 (10jlinehan) [18:59:52] 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog: EventLogging MEP Upgrade Phase 1 - https://phabricator.wikimedia.org/T244521 (10jlinehan) [19:22:18] I was scraping https://wikitech.wikimedia.org/wiki/Server_admin_log/ to find dates of deployments [19:22:35] elukey: yes it was due to alex's test of the eventgate-anatlyics dc swtichover. [19:22:39] i mentioend in standup! [19:22:40] will email. [19:22:59] I was scraping https://wikitech.wikimedia.org/wiki/Server_admin_log to find dates of deployments, but entries for January and the first few days of feb are missing [19:23:10] oops [19:23:11] sorry [19:23:17] for double-posting [19:27:39] groceryheist: Yeah, that's not a long-term source. See https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_39 if you really want to scrape, or https://tools.wmflabs.org/sal for a searchable proper version [19:28:42] hi James_F so the archive still doesn't have entries for Jan 2020 [19:28:53] groceryheist: Yeah, those archives are made manually. [19:29:03] (I think?) [19:29:05] Is there a way to scrape and parse the wmflabs tool [19:29:07] ? [19:29:30] I mean, is there an API for it or anything? [19:29:39] No, don't think so. [19:29:50] not very useful to me if I have to page through manually, and scripting that will be a PIA. [19:30:08] Yeah, I think there's a source of deployments data in statsd somewhere. [19:30:24] It's a source in Grafana used in some graphs. [19:30:50] E.g. https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&refresh=5m has the sync-wikiversions dashed line. [19:31:31] cool [19:34:01] so getting stuff out of statsd seems like more effort than I can put into this [19:34:23] I can stop my analysis at Dec. 2019 [19:34:44] but I'm still mildly concerned about why entries are missing from the wiki [19:37:12] groceryheist: They're manually archived by a volunteer who's not been around for a few weeks. [19:37:24] The wiki is not a canonical source forever. [19:37:33] It's just a "glancable" page. [19:44:42] I see [19:47:25] thanks [19:56:59] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10EYener) Thanks for the suggestion, @Ottomata - I'll take it back to the team and see what makes sense. Since we've been using the json_string format since 2016, it might make sense to have the 2019 data in the same format a... [20:04:31] ottomata, I'm not pinging you because I'm still fighting with vagrant virtualbox and guest additions... [20:04:43] oof whaaa [20:10:22] Gone for diner [20:48:18] mmmwwwaaaarghh vagrant :( [20:56:07] yargghgh [20:56:15] mforns: i usually have trouble but not that much if i start brand new [20:56:18] waht's the problem? [21:00:32] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10EYener) I'm curious, @Ottomata and @JAllemandou, if there is an elegant solution to dynamically filling partitions. IE, once a table is created with the partition types declared and a main location established, is the best... [21:08:20] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Homepage: purge sanitized event data through 2019-11-04 - https://phabricator.wikimedia.org/T244312 (10nettrom_WMF) [21:09:21] 10Analytics: Issues querying table in Hive - https://phabricator.wikimedia.org/T244484 (10Ottomata) Heh, this is one of the reasons not to use the raw data; partitions are added automatically to the refined tables in the `event` database. The refined table also includes pre-geocoded information. I think there... [21:11:37] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add new dimensions to virtual_pageview_hourly and pageview_hourly - https://phabricator.wikimedia.org/T243090 (10LGoto) [21:12:13] ottomata, I don't know exactly. It hangs when trying to mount the nfs shared folders. The docs say it's the guest additions version not matching virtualbox version. I didn't manage to upgrade guest additions as explained in the docs, the plugin that is supposed to do that is not working for me. I tried to downgrade virtualbox to the same version as the guest additions that I have, but no luck [21:12:47] your host os is linux? [21:12:50] or mac? [21:12:54] ubuntu [21:12:56] ah [21:12:57] yeahhhhhh [21:12:58] right. [21:12:59] hm [21:13:16] might be why, i think dan has simliiar problems [21:13:20] hm [21:13:37] I remember having had problems with the guest additions already [21:13:42] but I managed to fix them [21:15:06] mforns: can't you do away with guest additions entirely? [21:15:19] mforns: it should work just fine [21:15:55] nuria, you mean without nfs shared folder? [21:16:27] mforns: if guess addtions is needed to see the /vagrant folder ya, you need it, is that what you mean? [21:16:36] yes [21:20:21] (03CR) 10Nuria: Moves all dist assets to ./assets-v2 in the production build (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/570667 (https://phabricator.wikimedia.org/T237752) (owner: 10Fdans) [21:26:59] maybe you don't need it? [21:27:02] you can just log in and edit? [21:27:06] instead of using hostOS apps? [21:27:07] :) [21:40:00] ottomata, nuria it was my firewall... [21:40:16] seems to be working now [21:40:40] ! [21:40:40] :) [21:40:49] :( [21:41:38] mforns: fire wall on your computer? all right, we ARE ALL different [21:42:05] xD it was the ubuntu default one! I didn't put it there [21:43:34] 10Analytics: Automatically upload public Wikimedia datasets to Commons - https://phabricator.wikimedia.org/T244441 (10Yair_rand) APIs can't normally be used from wiki pages. And doing just a one-time upload of existing data sets means that Commons won't reliably have an up-to-date dataset, meaning wikis wouldn't... [21:44:47] mforns: if you will [21:44:52] cd into the puppet dir [21:44:58] and checkout this patch [21:45:02] oo lemme rebase it [21:45:12] https://gerrit.wikimedia.org/r/c/mediawiki/vagrant/+/556221 [21:45:16] then [21:45:27] vagrant roles enable eventlogging [21:45:29] vagrant provision [21:45:44] fingers crossed and you will get kafka and eventgate and eventlogging all ready to go. [21:46:06] oh you don't need to cd into puppet, sorry [21:46:13] that is part of the main mw vagrant repo [21:46:14] but ya same deal [21:46:48] ottomata, OK will try, still executing vagrant up, though, will take a while [21:47:03] ok [22:29:00] mforns: i'm out for the day, try to get that stuff up i guess and we can work on this together tomorrow yA? [22:58:21] (03PS6) 10Nuria: Classification of actors for bot detection [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) [22:59:16] (03CR) 10Nuria: [C: 04-1] "Still need to test oozie workflows and augment labels per our conversation so "reason" for labeling (such us "too many nocookies request")" (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria)