[07:23:00] goood morning! [07:24:05] !log create /usr/lib/x86_64-linux-gnu/libcrypto.so on all the analytics nodes via puppet [07:24:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:55:13] really interesting [07:55:15] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&fullscreen&panelId=25&from=now-90d&to=now [07:55:36] we are close to 1.9PB of space used on Hadoop, out of the 2.5 available [07:58:21] !log roll restart hadoop yarn node managers to pick up new libcrypto.so link (shouldn't be necessary but just in case) [07:58:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:21:15] Good morning :) [08:21:30] elukey: I wanted to check disk space with you indeed [08:21:46] bonjour! [08:22:09] joal: all the hadoop workers are now openssl-aware [08:22:15] elukey: on a more personal matter: my car crashed on me this morning, I'll be changing the battery later this morning [08:22:19] \o/ ! [08:22:25] ouch! sure! [08:22:56] elukey: There are dowsides in living in the coutryside ;) [08:30:22] joal: if you are ok I can deploy the spark 2.4 encryption settings [08:30:32] +1 elukey [08:30:41] elukey: let's manually test first I guess [08:31:15] joal: everything is running fine in test but we can do something like [08:31:26] 1) stop timers and drain the cluster [08:31:44] 2) deploy settings and roll restart nodemanagers (to enable the auth settings) [08:31:47] 3) test manually [08:31:50] 4) re-enable [08:32:10] works for me elukey :) [08:32:18] all right :) [08:32:36] elukey: I'm gonna change the battery now, and hopefully should be back in 1h or so help [08:32:49] joal: super, otherwise I can do it tomorrow [08:32:54] there is really no rush [08:32:58] let's do it ! [08:33:21] gone doing some mecanics, back in a bit [08:34:43] !log stop timers on an-coord1001 to drain the cluster and ease the deploy of spark encryption settings [08:34:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:57:57] brb coffee [09:14:22] so this was my fear https://docs.confluent.io/current/kafka/authentication_sasl/index.html [09:14:34] "Apache Kafka® brokers supports client authentication via SASL. SASL authentication can be enabled concurrently with SSL encryption (SSL client authentication will be disabled)." [09:15:06] atm varnishkafka authenticates via TLS client certs to kafka jumbo [09:15:46] so enabling kerberos to lock down some topics would be a problem [09:15:46] sigh [09:22:34] (kerberizing the cp hosts for this use case seems to be a little bit too much :P) [09:23:10] (or maybe it could be even thinkable, not sure) [09:27:59] will open the task with some ideas [09:32:47] !log roll restart yarn node managers again to pick up spark encryption/authentication settings [09:32:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:40:20] done! [09:53:26] 10Analytics, 10WMDE-Analytics-Engineering, 10Privacy, 10User-GoranSMilovanovic: Public data set review for T237728 - https://phabricator.wikimedia.org/T239393 (10GoranSMilovanovic) 05Open→03Invalid @Nuria Thank you very much for your assessment and your suggestions. Closing the ticket. [09:53:52] I am running some spark sql queries now on stat1004, so far it looks good [09:54:04] both pyspark and scala-spark [09:56:51] interesting, when closing i get warnings like [09:56:52] 20/02/03 09:55:11 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 146 at RPC address 10.64.5.25:57668, but got no response. Marking as slave lost. [09:58:17] elukey: the above happened already before [09:58:36] elukey: sometimes spark doesn't manage to kill it's executors properly [10:00:35] joal: okok, good :) [10:00:41] joal: let me know if you find anything weird [10:01:24] sure elukey [10:10:41] elukey: my small test look good as well :) [10:10:59] gooooood! [10:11:08] shall we re-enable timers? [10:11:24] Please let's do so [10:11:52] !log enable all timers on an-coord1001 after spark encryption/auth settings [10:11:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:12:29] today I'll try to tackle the presto auth settings [10:12:36] then there is kafka and druid [10:12:44] ok [10:14:45] I am wondering if airflow picks up the new settings without any issue [10:14:48] in theory yes [10:15:02] elukey: I'm going to manually restart a spark oozie job, to triple check [10:16:26] joal: +1 [10:16:51] going afk for 10 mins, will be back soon to check [10:16:51] Actually it's not even needed elukey - Monitoring the next run of https://hue.wikimedia.org/oozie/list_oozie_coordinator/0000375-191216160148723-oozie-oozi-C/ should be enough [10:16:57] ack elukey [10:28:49] lovely this java.io.IOException: Stream is corrupted [10:28:52] in refine [10:33:45] trying to re-run refine_mediawiki_events [10:35:06] the issue seems something related to https://issues.apache.org/jira/browse/SPARK-18105 [10:35:13] in fact we are now encrypting shuffle files [10:35:36] so I am wondering if we hit a bug [10:36:12] (in the sense that in hadoop test I didn't check compression + encryption of shuffle files) [10:37:10] spark.io encryption may probably be not needed in our case [10:37:18] yeah the failure seems consistent [10:42:39] trying to restart the refine job without io encryption settings [10:44:11] no failure report this time [10:49:38] joal: I am inclined to disable spark io encryption in the default settings, and possibly report the issue upstream [10:49:43] what do you think? [10:50:14] RPC auth + encryption seems a must, shuffle files encryption on worker disks probably not so much (but desirable of course) [11:04:59] ok done :) [11:05:07] (removed from puppet) [11:18:31] ahh now moar failures [11:18:31] ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend - Executor self-exiting due to : Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled? [11:19:04] this is from api-coord and data quality [11:19:18] that tells me it is oozie not picking up the changes [11:19:20] ufff [11:20:44] !log restart oozie on an-coord1001 [11:20:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:21:09] 2020-02-03 11:20:48,525 INFO SparkConfigurationService:520 - SERVER[an-coord1001.eqiad.wmnet] Loaded Spark Configuration: *=/etc/spark2/conf/spark-defaults.conf [11:22:12] I suspect that our dear oozie loads the spark defaults only on startup [11:22:16] let's see [11:23:54] yes [11:28:06] all right all jobs re-running, looks good now [11:28:28] going afk for a couple of hours, I hope that now all jobs are stable, in case there are issues I'll fix them as soon as I am back :) [11:42:32] elukey: sorry I was gone battery again - Let's talk when you're back so that I completely understand what we hit [11:58:28] 10Analytics, 10Better Use Of Data, 10Desktop Improvements, 10Performance-Team, and 6 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Krinkle) [13:58:26] joal: I am back! [13:58:36] heya elukey - so am i [13:58:50] elukey: it's actually more than the battery - alternator is dead :( [13:59:08] ah there you go, this is why the battery was not charging :( [13:59:58] so I haven't really understood the shuffle file encryption issue [14:00:10] yeah, me kinda neither [14:00:14] but removing the spark.io options worked [14:00:15] you wanna batcave? [14:00:18] sure [14:27:47] 10Analytics, 10Android-app-Bugs, 10Wikipedia-Android-App-Backlog (Android-app-release-v2.7.30x-O-Ontbijtkoek): EventLogging sees MobileWikiAppFindInPage parsing errors - https://phabricator.wikimedia.org/T147196 (10Dbrant) 05Open→03Resolved [14:50:36] 10Analytics, 10Cite, 10Reference Previews, 10Research, and 2 others: Instrument Cite to record the nubmer of footnote marks and references list entries rendered in each article - https://phabricator.wikimedia.org/T241833 (10awight) @Miriam This might relate to your citation usage research. Feedback welcomed! [15:00:40] 10Analytics: Unify puppet roles for stat and notebook hosts - https://phabricator.wikimedia.org/T243934 (10Ottomata) FYI, after some chats at all hands, I'm encouraged that a 'thin' jupyterhub with notebook servers running in Yarn is actually feasible! [15:52:44] (going afk a bit, since no meetings for today, but if needed ping me on the phone :) [17:11:11] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Security Readiness Reviews, 10user-sbassett: Security Review For EventStreamConfig extension - https://phabricator.wikimedia.org/T242124 (10sbassett) a:03sbassett [17:12:39] 10Analytics, 10Fundraising-Backlog, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech: Find a better way for WMDE to get impression counts for their banners - https://phabricator.wikimedia.org/T243092 (10DStrine) [17:27:50] 10Analytics, 10Fundraising-Backlog, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech: Find a better way for WMDE to get impression counts for their banners - https://phabricator.wikimedia.org/T243092 (10Nuria) Can someone describe what fields/columns does this dataset have? Where i... [17:41:59] 10Analytics: Unify puppet roles for stat and notebook hosts - https://phabricator.wikimedia.org/T243934 (10Nuria) >notebook servers running in Yarn is actually feasible! nice, let's flush out this more [18:55:05] "Presto does not currently support impersonating the end user when accessing the Hive metastore." [18:58:06] but still doesn't make sense why presto's coordinator allows me to issue queries without kinit [19:59:42] 10Analytics, 10Cite, 10Reference Previews, 10Research, and 2 others: Instrument Cite to record the nubmer of footnote marks and references list entries rendered in each article - https://phabricator.wikimedia.org/T241833 (10Nuria) >Would this belong in a shared feature store? This seems very useful, it sho... [21:15:24] 10Analytics, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 6 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Gilles) [22:01:12] 10Analytics, 10Contributors-Team, 10Growth-Team, 10MediaWiki-Page-editing, and 6 others: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019 (10nshahquinn-wmf) Product Analytics has no plans to work on this, but I'm putting it in the general research backlog t... [22:01:22] 10Analytics, 10Growth-Team, 10MediaWiki-Page-editing, 10StructuredDiscussions, and 4 others: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019 (10nshahquinn-wmf) [23:56:38] 10Analytics, 10Multimedia, 10Tool-Pageviews: Add ability to the pageview tool in labs to get mediarequests per file similar to existing functionality to get pageviews per page title - https://phabricator.wikimedia.org/T234590 (10MusikAnimal) https://tools.wmflabs.org/mediaviews is back! I would still conside... [23:57:21] 10Analytics, 10Fundraising-Backlog, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech: Find a better way for WMDE to get impression counts for their banners - https://phabricator.wikimedia.org/T243092 (10Nuria) And also, where does the data that ends up on pgehres.bannerimpression co... [23:58:42] 10Analytics, 10Multimedia, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10MusikAnimal) https://tools.wmflabs.org/mediaviews/ has been revived, making use of the new media request APIs :) Please create a task with #tool-pageviews if you enc...