[05:06:52] morning! [06:04:15] early Francisco is early :) [06:04:18] morning fdans [06:11:40] 10Analytics, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey) [07:10:04] hi team [07:11:45] (03PS21) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [07:16:35] bonjour! [07:54:47] (03PS2) 10Fdans: Add UDF that transforms Pagecounts-EZ projects into standard [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/597740 (https://phabricator.wikimedia.org/T252857) [07:54:54] 10Analytics: Revision_text field of mediawiki_wikitext_current is Not properly mapped - https://phabricator.wikimedia.org/T253484 (10JAllemandou) Hi @DED - The problem is with showing the data only, not decoding rows. For instance: ` select page_id, page_namespace, page_title, page_redirect_title, page... [07:55:30] 10Analytics: Revision_text field of mediawiki_wikitext_current is Not properly mapped - https://phabricator.wikimedia.org/T253484 (10JAllemandou) 05Open→03Invalid [08:00:30] bbiab [08:02:30] Morning! [08:03:24] joal: I saw your reply to T253484. Is this a limitation in Hive ? [08:03:24] T253484: Revision_text field of mediawiki_wikitext_current is Not properly mapped - https://phabricator.wikimedia.org/T253484 [08:03:45] djellel: I don't know :) I think it;s a bug in Hive UI [08:03:58] djellel: data is fine however [08:04:05] djellel: it's only the printing that failds [08:04:17] I figured. though, same goes for the command line. [08:04:34] what do you mean command-line? [08:05:03] (03CR) 10Joal: Add special explode UDTF that turns EZ-style hourly strings into rows (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/596605 (https://phabricator.wikimedia.org/T252857) (owner: 10Fdans) [08:05:13] hive CLI [08:05:41] on a stat machine, I was trying to run my sql query from file and output to csv. [08:05:55] yes djellel - Hive has a bug printing, whether in CLI or in HUE (same engine) [08:06:54] not sure it's UI [08:07:31] It is reading a newline as a new record. [08:08:26] I guess I have to work with Spark [08:08:45] djellel: from the 2 other queries I ran, data seems corretly extracted but printed incorrectly [08:09:13] yes, I confirm [08:09:56] using spark you mean ? [08:10:08] ? [08:10:28] which queries did you run and where? [08:10:29] Ah - no printing bug in spark nope [08:10:56] ok. [08:11:29] thank you for the review joal :) [08:12:22] I flagged an issue with Abstracts dumps. Who is responsible for that? I believe the bugreport should be picked up from the backlog. [08:12:49] np fdans :) Let's wait for others input on whether to extract business logic or not :) [08:15:19] no idea djellel - I don't even know what `Abstracts dumps` are - If they are XML related, it's probably in Ariel Glen realm (apergos on IRC) [08:16:15] T111775 [08:16:16] T111775: Infoboxes are mistaken for abstracts in page abstract dumps. - https://phabricator.wikimedia.org/T111775 [08:16:43] ack djellel [08:20:34] ok, I'll ping @apergos on the task [08:25:21] joal: may I suggest that mediawiki_wikitext_current contains only the last snapshot, and mediawiki_wikitext_history all snapshots? [08:25:30] nope djellel [08:25:45] djellel: we have snapshots for both, history contains full historical dumps [08:26:32] djellel: we will at some point add a 'latest' partition (see https://phabricator.wikimedia.org/T252148) [08:29:05] it's more from a speed perspective. and +1 for latest! [08:29:26] though, if `latest` is standalone table, you can partition it further. [08:29:52] djellel: the snapshot field is a partition, therefore only data from the selected snapshot is read, not all [08:30:16] and actually, wiki_db is a partition field as well [08:30:45] good point [08:39:00] joal: I found a solution to run my query on Hive :) [08:39:04] set hive.query.result.fileformat = SequenceFile; [08:40:48] 10Analytics: Revision_text field of mediawiki_wikitext_current is Not properly mapped - https://phabricator.wikimedia.org/T253484 (10DED) 05Invalid→03Resolved To run the query on Hive if some fields contain a newline char: > set hive.query.result.fileformat = SequenceFile; reference: https://www.phdata.i... [08:48:18] nice djellel [08:51:27] I checked a bit gunicorn configs for superset, and atm we are using a prefork-sync scheme with 8 workers [08:51:42] what I want to try after 0.36 is using gevent [08:51:56] so allow each process to go async when doing I/O [08:52:06] hm [08:52:11] elukey: --verbose? [08:53:12] joal: gevent is a python lib to manage greenlet based on libevent. We use gunicorn as WSGI backend for Superset, currently set to just create 8 processes that each one can handle one request at the time [08:53:58] gunicorn can be set to use greenlets instead, more or less like how node works [08:54:42] I think I get it - This for HTTP-level threads [08:54:42] I think that this is what Nuria experienced when she talked about Superset slowness [08:54:52] right [08:55:27] she usually works when more people from SF use superset, so I think that 8 workers are not enough [08:55:32] in sync mode I mean [08:55:35] makes sense [08:56:03] elukey: I was also thinking - How complicated is it for us to start moving presto to collocated? [08:56:54] joal: it shouldn't be super hard but requires time to plan everything correctly, I guess that we could do it next Q [08:57:14] but we have to drop something probably if we want to prioritize it [08:57:15] ok elukey - I think there would be tremendous value :) [08:59:51] (03PS1) 10Elukey: Release upstream version 0.36.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/598423 (https://phabricator.wikimedia.org/T249495) [09:01:11] Just in case, in enwiki: ns = 0 is 15Mio, ns=3 is 15Mio, other namespaces have 19Mio records. [09:50:01] * elukey errand + lunch! [10:21:14] (03PS22) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [11:16:40] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: Upgrade to Superset 0.36.0 - https://phabricator.wikimedia.org/T249495 (10elukey) New version deployed in staging, waiting for another round of tests before releasing to production. [11:16:51] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: Upgrade to Superset 0.36.0 - https://phabricator.wikimedia.org/T249495 (10elukey) p:05Triage→03Medium [11:17:33] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) [11:23:26] 10Analytics: Test superset running on gunicorn + gevent - https://phabricator.wikimedia.org/T253545 (10elukey) [11:43:49] 10Analytics, 10Analytics-Kanban, 10Operations: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) p:05Triage→03Medium [11:48:42] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10elukey) [11:52:36] 10Analytics, 10Operations, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [11:53:58] 10Analytics, 10Operations, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) p:05Triage→03Low [11:54:51] 10Analytics, 10Operations, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [12:05:54] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [12:51:21] (03PS23) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [13:02:30] joal: wdqs meeting? [13:02:46] gehel: YES [14:01:32] hellooo team :] [14:02:55] Hi mforns [14:03:06] :] [14:17:34] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/593092 (https://phabricator.wikimedia.org/T247099) (owner: 10Nuria) [14:44:05] a-team: will be working today on and off as schedule allows but not on meetings, it is a US holiday [14:44:23] ack nuria [14:55:37] (03CR) 10Nuria: [C: 03+2] "Merging as results coincide with the ones jennifer had earlier" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/593092 (https://phabricator.wikimedia.org/T247099) (owner: 10Nuria) [14:55:39] (03CR) 10Nuria: [V: 03+2 C: 03+2] Automate calculations for number of pages using wikidata items [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/593092 (https://phabricator.wikimedia.org/T247099) (owner: 10Nuria) [15:28:59] heya elukey :] I'd like to install mysql under my user in an-launcher to be able to test Airflow with the ParallelExecutor. Any things I should know, be careful with or any heads up? I wouldn't like to break anything there. Also, I'm not sure I can apt install there... [15:30:49] mforns: if you need a temporary database, I can create one for you on analytics1030 (test host) so you can skip that [15:31:15] something like 'airflow_mforn_test' [15:31:17] elukey: oh, I think that would work, would be great! [15:31:44] sure, airflow_mforns_test is awesome [15:37:32] mforns: on an-launcher1001 you have a file called airflow_db_creds (in your home) [15:37:39] the db is created [15:37:51] analytics1030.eqiad.wment:3306 to access it [15:38:00] test it and tell me if it works :) [15:44:07] 10Analytics: Grant not able to access superset - https://phabricator.wikimedia.org/T253281 (10elukey) What error is returned? Is the LDAP access failing? The uid is `gsingers` but I can't find it in httpd's logs.. [15:45:27] elukey: super thanks! :D [15:45:44] configuring airflow now [15:55:33] elukey: Airflow docs ask me to: We rely on more strict ANSI SQL settings for MySQL in order to have sane defaults. Make sure to have specified explicit_defaults_for_timestamp=1 in your my.cnf under [mysqld] [15:56:41] Where in analytics1030 can I find that? [15:59:22] it is under /etc/my.cnf [15:59:36] IIRC we had to add it with Erik at the time [15:59:38] I found /etc/mysql/conf.d/mysql.cnf [15:59:52] oh [16:00:22] ah yes [16:00:22] # Required for search/airflow installation. This will also [16:00:22] # be the mariadb default some day, as the old behaviour [16:00:22] # is deprecated. [16:00:23] explicit_defaults_for_timestamp = on [16:00:28] so all good [16:01:20] ok! thx [16:01:47] going to step away for a bit! [16:01:56] k! [16:30:14] mforns: all good with the db? [16:44:49] elukey: one question [16:45:07] the password you created is for my user [16:45:28] now, I need to execute airflow as analytics, because the job is spark and needs access to the metastore [16:46:10] so, is test mysql also kerberized? will it have problems when the analytics user tries to connect to mysql using mforns credentials? [16:46:17] mforns: yes I used your username but in theory I could have used also "batman", the important bit is that you configure it in airflow's config [16:46:32] ok, cool :] [16:46:36] yes no krb :) [16:46:46] shame you didn't use batman though :[ [16:46:55] xD just kidding [16:50:58] I wish I had! [16:54:56] elukey: no working... :[ it can not connect to mysql, maybe I built the connection URI wrong.. but I think it's OK, hmm [16:55:27] says access denied to mforns, using password yes [16:57:34] I used mysql://mforns:@analytics1030.eqiad.wmnet:3306/airflow_mforns_test [17:00:04] should I use analytics1030's IP directly? probably [17:01:31] in theory no [17:01:35] maybe I missed some grants [17:03:48] no, same problem [17:04:41] yes I can repro via mysql cli, lemme check what I missed [17:04:48] ok, thanks :] [17:05:41] I noticed you wrote airflow_mforn_test before (without the 's'), maybe that's it? [17:08:56] mforns: can you retry now? [17:09:01] sure [17:09:42] elukey: working! [17:09:49] mforns: sorry my bad, PEBCAK [17:10:02] no no, the opposite, thanks! :D [17:10:06] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10elukey) Upgrade a second time and failed with a different issue. This time, I ended up with a lot of missing/under-replicated blocks and also ~7% of... [17:24:20] mforns: all good? If so I'll log off, otherwise I can help :) [17:27:43] elukey: all seems good! thanks a lot [17:31:23] super [17:31:26] * elukey off! [17:59:54] 10Analytics: Measure DNT usage across geographies and wikis - https://phabricator.wikimedia.org/T187376 (10Nuria) 05Open→03Declined [18:00:27] 10Analytics: Measure DNT usage across geographies and wikis - https://phabricator.wikimedia.org/T187376 (10Nuria) declined per DNT being (mostly) a failed experiment not supported going forward by a number of browsers [18:05:24] 10Analytics, 10Product-Analytics: [Spike] Should EventLogging support DNT? - https://phabricator.wikimedia.org/T252438 (10Nuria) >We have a whole group of smart, technically educated, privacy-minded experts weighing in and even we do not entirely agree what DNT means. could not agree more, this is been the ca... [22:56:26] 10Analytics, 10Event-Platform, 10Growth-Team, 10MediaWiki-Recent-changes, and 2 others: Remove deprecated RCFeedEngine support - https://phabricator.wikimedia.org/T250628 (10Krinkle) @Milimetric The above finishes the deprecation we started as part of setting up the EventBus/EventStreams consumer of RCFeed...