[00:25:33] RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:29:13] Good morning [06:29:43] Investigating sqoop failure [06:36:00] 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10elukey) >>! In T279440#7050361, @razzi wrote: > I want to work on this! Is it ok to drop superset_production on db1108 in order to do this? If so, I think I'll be a... [06:37:22] joal: bonjour :) [06:37:27] let me know if I can help [06:37:49] sure elukey [06:38:20] elukey: I don't know if it's related to db-change last month, but we've experienced a lot of failures during sqoop job :( [06:42:01] ls [06:42:03] oops [06:42:39] joal: Isn't private sqooped from the dbstores? [06:42:53] (trying to understand where the issue lies) [06:42:54] it is elukey! [06:43:14] but I see from the logs that we don't really have a good indication of what failed :( [06:43:18] elukey: we're running into problems both for private and labs ( [06:44:11] elukey: I'm gonna try a manual run of one of the failed tables [06:45:33] :( [07:14:45] elukey: 2 different errors from logs [07:14:52] com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during commit(). Transaction resolution unknown. [07:14:55] and [07:14:59] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded [07:15:20] despite the order, I think the problem comes from GC - Will try to bump memory of mappers [07:15:41] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10elukey) >>! In T278423#7049845, @razzi wrote: > Alright, here's my plan @elukey, perhaps we can discuss this next week and if it looks good we can plan the... [07:18:54] 10Analytics: Spike. Try to ML models distributted in jupyter notebooks with dask - https://phabricator.wikimedia.org/T243089 (10elukey) To keep archives happy - we are already testing https://github.com/criteo/tf-yarn with Miriam and Aiko, that behind the scenes uses [[ https://pypi.org/project/skein/ | Skein ]]. [07:21:27] 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10elukey) @srodlund sorry for the lag, me and Joseph should have a draft for this week :) [07:28:07] Cannot seem to use data in spark. Im using stat1008. Error: [07:28:08] 21/05/03 07:24:39 WARN metastore: Failed to connect to the MetaStore Server... [07:28:08] 21/05/03 07:24:40 ERROR TSaslTransport: SASL negotiation failure [07:28:08] javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] [07:28:22] Hi tanny411 [07:28:29] tanny411: you need to kinit :) [07:30:32] Oops, totally forgot! Thanks! 🤐 [07:32:33] 10Analytics: Fix sqoop script to use timestamp limits in `--boundary-query` queries - https://phabricator.wikimedia.org/T281668 (10JAllemandou) [07:32:51] elukey: it seems that we hit https://issues.apache.org/jira/browse/SQOOP-1400 [07:33:41] elukey: I'm surprised, as we use sqoop 1.4.6, for which it is supposed to be fixed - might be related to the mysql-connector version [07:34:40] joal: can you repro reliably? [07:35:13] I am asking since with Goran sqoop worked without any `--driver` option, so maybe it is worth to try [07:35:16] elukey: trying yet another time [07:36:20] we have 5.1.49-0+deb9u1 on launcher [07:36:30] so in theory not 1.17 [07:36:33] Yup I've seen that elukey [07:36:38] :S [07:37:09] elukey: this investigation allowed me to find a nasty bug (see T281668 above) [07:37:10] T281668: Fix sqoop script to use timestamp limits in `--boundary-query` queries - https://phabricator.wikimedia.org/T281668 [07:37:33] lovely [07:39:43] anyhow elukey, I suggest we kill the currently running prod sqoop job, and relaunch after we fix [07:40:16] joal: anything you prefer to do [07:40:44] elukey: I confirm that the problem (GC overhead) happens reliably [07:41:18] ah wait GC Overhead? [07:41:24] yes [07:41:48] so sqoop/jvm limits related? [07:41:56] I think so yes [07:43:10] elukey: we have a fetch-size parameter that we can use, it is currently set to None [07:43:17] This is the trick I use to overcome the problem [07:43:50] joal: mmm I see -Xmx1000m listed for sqoop jobs, isn't it a little low? [07:44:46] elukey: This limit has worked fine for months - I think we're hitting the problem of trying to load the entire mysql result-set in memory [07:45:10] joal: it makes sense yes, but I was wondering if a little bump could help [07:45:23] elukey: I tried, didn't work [07:45:27] perfect [07:45:41] the fetch size looks a good option [07:46:07] !log Kill prod sqoop job to restart after fix [07:46:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:47:07] elukey: Trying without the driver paramerter [07:54:23] PROBLEM - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:59:40] elukey: the job worked without the driver parameter - weird :( [08:00:13] elukey: I don't know if we should remove the driver parameter or add the fetch-size one :S [08:00:28] I lean toward trying without the driver param [08:00:40] yes let's try to do it [08:01:05] ack - sending a PR soon [08:03:47] joal: the --driver thing is a little weird, I thought that it worked for Goran due to hive, but in this case it doesn't make sense [08:04:41] MEH [08:04:54] elukey: could it be that the driver availa [08:04:58] sorry again [08:05:24] available in the jar has a different class that the one we provide as a parameter, and that one is a legacy version? [08:06:56] sqoop uses /usr/lib/sqoop/lib/mysql-connector-java.jar [08:07:12] that is a symlink to the new mysql jar that we have also deployed on the coordinator nodes [08:07:17] the forward port from stretch [08:08:46] before that we were using the mariadb one [08:17:07] elukey@an-launcher1002:~$ jar -tvf /usr/lib/sqoop/lib/mysql-connector-java.jar | grep com/mysql/jdbc/Driver 692 Wed Jun 03 17:30:08 UTC 2020 com/mysql/jdbc/Driver.class [08:17:44] and this makes sense, otherwise we'd get a class not found error [08:17:50] of course [08:18:13] so maybe this version of the mysql driver is more brittle than the mariadb one for some use cases? [08:18:16] like the sqoop ones [08:18:18] I think it must come from a difference of default settings when specifying the driver class versus not specifying it [08:18:32] in sqoop [08:18:41] in sqoop or in Java (can't say) [08:18:48] makes a lot of sense [08:19:15] for instance: when you let Java decide your MysqlDriver based on available classes in class path, some settings are applied by default if driver X is choosen [08:19:34] elukey: this is not sure, but could be [08:19:58] elukey: in any case, I confirm I get errors when specifying driver, an no error when not specifying it [08:20:18] joal: ok then let's kick off the jobs like this [08:20:20] elukey: can you confirm we have the cvorrect setup (mysql-driver link) everywhere on the cluster? [08:20:40] ack elukey - we need to patches, one for refinery and one for puppet [08:21:44] puppet? [08:21:48] I thought only refinery [08:22:20] to answer you question above - we do have libmariadb-java on the cluster [08:22:41] but the symlinks look ok [08:22:49] elukey: I can't recall if we specify the driver option explicitely in puppet - checking [08:23:13] joal: IIRC no, only in refinery [08:23:57] joal: another test that we could do is to apt-get remove elukey@an-worker1080:~$ jar -tvf /usr/lib/sqoop/lib/mysql-connector-java.jar | grep com/mysql/jdbc/Driver 692 Wed Jun 03 17:30:08 UTC 2020 com/mysql/jdbc/Driver.class [08:24:00] ufff sorry [08:24:06] I meant apt-get remove libmariadb-java [08:24:17] from clients and workers [08:26:32] joal: if you can test one thing - can you run sqoop with the org.mariadb.jdbc.Driver? [08:26:50] sure elukey [08:27:01] I want to make sure that removing the --driver doesn't make sqoop use mariadb for some reason [08:27:13] testing now [08:27:17] it shouldn't work in theory [08:29:01] elukey: Could not load db driver class: org.mariadb.jdbc.Driver [08:29:11] super thanks for double checking [08:29:14] np [08:29:42] joal: one last thing - lemme purge libmariadb-java, it is not in puppet anymore [08:29:46] and then we re-test [08:29:50] ack [08:29:51] and make sure without --driver it works [08:33:37] !log clean up libmariadb-java from hadoop workers and clients [08:33:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:33:41] joal: ready :) [08:33:44] ack [08:33:48] testing [08:38:45] (03PS2) 10Joal: Update sqoop (driver, logging, boundary-query) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/683485 [08:39:02] elukey: I took advantage of having an exisitng small patch on sqoop [08:39:34] (03PS3) 10Joal: Update sqoop (driver, logging, boundary-query) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/683485 (https://phabricator.wikimedia.org/T281688) [08:39:41] elukey: --^ please [08:40:01] elukey: job with driver parameter still fails [08:44:29] joal: okok perfect [08:45:25] (03CR) 10Elukey: [C: 03+1] Update sqoop (driver, logging, boundary-query) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/683485 (https://phabricator.wikimedia.org/T281688) (owner: 10Joal) [08:50:50] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for hotfix deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/683485 (https://phabricator.wikimedia.org/T281688) (owner: 10Joal) [08:53:03] !log Deploy refinery for sqoop hotfix [08:53:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:54:36] elukey: I confirm puppet doesn't need to be changed (no specification of driver) [08:55:03] elukey: we can restart the sqoop timers after hotfix deploy [08:56:08] yep [08:58:44] joal: going to get a coffee but feel free to restart the timers anytime, +1 from me [08:58:51] ack elukey [09:15:38] TIL https://rapids.ai/ [09:17:23] all CUDA based sono meh, but it uses Apache Arrow etc.. [09:17:26] looks interesting [09:17:41] very interesting elukey - I've been following for some time [09:18:39] I keep disliking NVIDIA's approach to open source [09:28:40] (03PS1) 10Awight: Add cawiki to the databases we check for preferences [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/684303 (https://phabricator.wikimedia.org/T271894) [09:38:42] !log Drop already sqooped data to restart jobs [09:38:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:43:16] joal: I am watching videos of https://www.applyconf.com/agenda/, really interesting [09:43:26] a lot of topics and big names [09:52:39] interesting! thanks for the link elukey [09:53:20] elukey: while reviewing data to drop, there is only a very small number of tables that have failed for private sqoop (3 to be precise) - Shall I rerun them manually instead of re-sqooping everything? [09:53:47] joal: sure makes sense! [09:53:52] elukey: https://www.applyconf.com/agenda/data-observability-the-next-frontier-of-data-engineering/ - We were exactly on that topic when discussing with the team the other day [09:53:58] ack - thanks [09:54:15] it was organized by Tecton [09:54:20] elukey: I'm gonna reset the timer for private [09:54:32] joal: +1 [09:56:23] !log Reset refinery-sqoop-mediawiki-private timer [09:56:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:57:21] !log restart refinery-sqoop-mediawiki-private timer after patch [09:57:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:00:22] joal: https://www.applyconf.com/agenda/third-generation-production-ml-architectures-lessons-from-history-experiences-with-ray/ is also very nice [10:04:04] elukey: I don't understand :( I have done "sudo systemctl start refinery-sqoop-whole-mediawiki" but nothin has happened [10:04:14] elukey: I have tried it with .service at the end - same [10:04:47] elukey: and also, same for reset-failed :( [10:05:08] joal: ah snap check status [10:05:10] syntax error [10:05:20] two ,, [10:05:24] :( [10:06:03] Oh my! [10:06:03] I suggest to sudo live edit and restart, so we can check if other errors are there [10:06:05] I'm sorry [10:06:13] sure [10:07:00] actually I can't do that elukey [10:07:55] RECOVERY - Check unit status of refinery-sqoop-mediawiki-private on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-mediawiki-private https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:08:17] ok that has worked at least --^ [10:08:33] elukey: may I let you sudo edit? [10:11:46] ah sure! [10:12:03] (03PS1) 10Joal: Fix sqoop bug introduced in previous patch [analytics/refinery] - 10https://gerrit.wikimedia.org/r/684314 [10:12:22] joal: done [10:12:29] elukey: here is the patch --^ [10:12:33] elukey: there was 2 spots [10:12:50] joal: fixed both [10:12:55] you can retry [10:13:16] (03CR) 10Elukey: [C: 03+1] "Completely missed them :(" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/684314 (owner: 10Joal) [10:14:30] elukey: job started, all good so far - thanks for the hotfix [10:14:42] elukey: let's wait for possible new errors before merging [10:16:36] +1 [10:23:19] RECOVERY - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:24:38] 10Analytics, 10Analytics-Kanban: Fix sqoop script to use timestamp limits in `--boundary-query` queries - https://phabricator.wikimedia.org/T281668 (10JAllemandou) [10:36:53] * elukey lunch! [10:43:33] !log Add _SUCCESS flag to /wmf/data/raw/mediawiki_private/tables/cu_changes/month=2021-04 after having manually sqooped missing tables [10:43:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:28:07] hi team [12:28:09] :] [12:34:06] (03CR) 10Svantje Lilienthal: [C: 03+1] Add cawiki to the databases we check for preferences [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/684303 (https://phabricator.wikimedia.org/T271894) (owner: 10Awight) [12:35:34] 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10Ottomata) Instead of doing this work to recreate the replicas with a different binlog format now, could we wait for the new db hardware, set up multi instance Maria... [12:45:32] 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10elukey) I would do it anyway since these are the dbs that we back up periodically, and it may take a while (namely months) to get everything set up and running and... [12:51:07] (03CR) 10Ottomata: "Did I already ask this? Is this a link-recommendation-ui-interaction?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177) (owner: 10Kosta Harlan) [12:51:44] hello mforns ! [12:54:30] hola hola people [12:57:13] hello! [12:57:21] (03PS1) 10GoranSMilovanovic: T239205 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/684387 [12:57:37] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T239205 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/684387 (owner: 10GoranSMilovanovic) [12:58:44] hey ottomata :] I'm still troubleshooting the deletion script, can not manage to make paths with dataceter= be purged... [13:00:46] huh intersting! [13:01:25] the regex matches manually with any path, but when running the script no such paths are added for deletion [13:01:44] mforns: what is your regex curently? [13:02:00] oh! wait, if I remove the ()? around datacenter=[^/]+, it works... [13:02:22] ottomata: it was this one: '[^/]+/(datacenter=[^/]+/)?year=(?P[0-9]+)(/month=(?P[0-9]+)(/day=(?P[0-9]+)(/hour=(?P[0-9]+))?)?)?' [13:03:09] ok almost the same as the first one i got from you except for no slashes in datacenter=[^/]+ [13:03:15] cool yeah my regex tester says that works gerat [13:03:34] even with the parents [13:03:36] parens [13:03:42] you are saying when used with the script it doesn't work? [13:04:34] yes, when used in the script, paths with datacenter= are not added for purging [13:04:49] unless you remove the optionality of the datacenter= part [13:04:59] then they match! O.o [13:07:24] OK, this does not work: [^/]+/(datacenter=[^/]+/)?year=(?P[0-9]+)... [13:07:35] but this does: [^/]+(/datacenter=[^/]+)?/year=(?P[0-9]+) [13:07:58] executing the whole thing now [13:08:23] right put the slash outside of the paren like in the other date partitions? [13:09:00] and in that, year is the only non optional partition? [13:09:25] yes [13:09:33] k [13:10:16] (03CR) 10Kosta Harlan: "> Patch Set 7:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177) (owner: 10Kosta Harlan) [13:15:32] (03CR) 10Ottomata: "> it seems like you could make the case that each one of those dialogs should have its own" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177) (owner: 10Kosta Harlan) [13:37:27] mforns: wanna hangout real quick and do the renaming drp table for the renames we did? e.g. quicksurveysresponses_t280813 [13:37:28] ? [13:42:18] ottomata: sure [13:42:20] batcave? [13:42:32] ya [13:59:40] !log dropped all obselete (upper cased location) event_santizied.*_T280813 tables created for T280813 [13:59:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:59:43] T280813: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 [14:19:40] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Gerrit-Privilege-Requests, and 2 others: Create or identify an appropriate Gerrit group for +2 rights on schemas/event/secondary - https://phabricator.wikimedia.org/T279089 (10Mholloway) Maybe the [[ https://gerrit.wikimedia.org/r/admin/groups/2021f25e... [14:23:43] !log stopping all venv based jupyter singleuser servers - T262847 [14:23:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:23:46] T262847: Decomission SWAP - https://phabricator.wikimedia.org/T262847 [14:40:59] ottomata: I checked and all seems OK, I got the checksum too, do you want me to create a patch for the test cluster first? Or for both? [14:41:22] test first please! :) [14:44:19] 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10srodlund) No problem! I'll keep an eye out for it! [14:45:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decomission SWAP - https://phabricator.wikimedia.org/T262847 (10Ottomata) Did the following on each stat box: ` # find all jupyterhub singeluser processes using a venv and stop them. for u in $(ps aux | grep -E 'jupyterhub.*singleuser' | grep venv | awk... [14:46:15] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decomission SWAP - https://phabricator.wikimedia.org/T262847 (10Ottomata) [14:56:28] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) [14:56:30] ottomata: https://gerrit.wikimedia.org/r/c/operations/puppet/+/684427 [14:56:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) There are still some child tasks of this Newpyter parent task, but as of today I think we can call the 'Newpyter' project done. [14:56:48] great mforns merging [15:01:00] 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10mforns) Ping? :-) If you need to keep this data, I can help in determining what can be kept indefinitely. Thanks! [15:02:29] ottomata: probably we'll have to wait until tomorrow to test that the purging script in test is working properly, without -skipTrash [15:02:33] ok [15:02:44] i can run it manuually [15:02:45] ? [15:08:24] 10Analytics, 10Analytics-Kanban: Fix sqoop script to use timestamp limits in `--boundary-query` queries - https://phabricator.wikimedia.org/T281668 (10JAllemandou) a:03JAllemandou [15:15:39] ah right, everything is already deleted in test cluster today, ok we check tomorrow (or wed) [15:31:53] 10Analytics-Radar, 10AbuseFilter, 10BetaFeatures, 10BlueSpice, and 45 others: Prepare User group methods for hard deprecation - https://phabricator.wikimedia.org/T275148 (10fdans) [15:32:38] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Deploy schema repos to analytics cluster and use local uris for analytics jobs - https://phabricator.wikimedia.org/T280017 (10fdans) [15:34:52] 10Analytics: Fix default ownership and permissions for Hive managed databases in /user/hive/warehouse - https://phabricator.wikimedia.org/T280175 (10fdans) p:05Triage→03High [15:37:32] 10Analytics, 10Analytics-Kanban: Delete UpperCased eventlogging legacy directories in /wmf/data/event 90 days from 2021-04-15 (after 2021-07-14) - https://phabricator.wikimedia.org/T280293 (10fdans) p:05Triage→03High [15:41:06] 10Analytics, 10Product-Analytics: Aggregate table not working after superset upgrade - https://phabricator.wikimedia.org/T280784 (10fdans) p:05Triage→03High a:03razzi [15:47:43] 10Analytics-Radar, 10AbuseFilter, 10BetaFeatures, 10BlueSpice, and 45 others: Prepare User group methods for hard deprecation - https://phabricator.wikimedia.org/T275148 (10Legoktm) @Vlad.shapik with the projects you added, 277 people are now receiving notifications for every update on this task. I think t... [15:52:41] 10Analytics, 10Analytics-Kanban: Stop Refining mediawiki_job events in Hive - https://phabricator.wikimedia.org/T281605 (10fdans) p:05Triage→03High [15:55:06] 10Analytics, 10Analytics-Wikistats: Wikistats shows 0 views for April when data isn't available yet - https://phabricator.wikimedia.org/T281617 (10fdans) p:05Triage→03High [15:56:46] 10Analytics, 10Analytics-Kanban: Fix sqoop script to use timestamp limits in `--boundary-query` queries - https://phabricator.wikimedia.org/T281668 (10fdans) 05Open→03Resolved [16:13:31] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, and 2 others: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10Edtadros) === Test Result - Beta **Status:** ✅ PASS **Environment:** beta/xyzwiki **... [16:13:50] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Bstorm) >>! In T269211#7010741, @razzi wrote: > @Marostegui do you have any advice on how to configure clouddb1021 memory / m... [16:14:26] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, and 2 others: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10Edtadros) [16:21:27] 10Analytics, 10Product-Analytics: Aggregate table not working after superset upgrade - https://phabricator.wikimedia.org/T280784 (10Esanders) Thanks - I had to recreate most of the filters and groups from scratch as well. [16:25:09] 10Analytics-Radar, 10AbuseFilter, 10BetaFeatures, 10BlueSpice, and 45 others: Prepare User group methods for hard deprecation - https://phabricator.wikimedia.org/T275148 (10AndyRussG) [16:25:43] 10Analytics-Radar, 10AbuseFilter, 10BetaFeatures, 10BlueSpice, and 46 others: Prepare User group methods for hard deprecation - https://phabricator.wikimedia.org/T275148 (10AndyRussG) [16:31:24] (03PS1) 10Mholloway: Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 [16:31:45] (03Abandoned) 10Mholloway: Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 (owner: 10Mholloway) [16:39:25] (03Restored) 10Mholloway: Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 (owner: 10Mholloway) [16:39:38] 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10Verena) Thanks for the ping. We have to check if we need the data and get back asap. [16:40:56] (03PS2) 10Mholloway: Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 [16:41:59] (03PS3) 10Mholloway: Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 (https://phabricator.wikimedia.org/T279089) [16:47:12] (03PS2) 10Ottomata: Remove requiredness of fields from mediawiki common schema fragments [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/683980 (https://phabricator.wikimedia.org/T275674) [16:48:13] (03CR) 10Ottomata: [C: 03+1] Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 (https://phabricator.wikimedia.org/T279089) (owner: 10Mholloway) [16:49:00] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Gerrit-Privilege-Requests, and 3 others: Create or identify an appropriate Gerrit group for +2 rights on schemas/event/secondary - https://phabricator.wikimedia.org/T279089 (10Mholloway) The [[ https://gerrit.wikimedia.org/r/admin/groups/2021f25e751518... [16:53:12] (03CR) 10Mholloway: [C: 03+2] Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 (https://phabricator.wikimedia.org/T279089) (owner: 10Mholloway) [16:53:28] (03CR) 10Mholloway: [V: 03+2 C: 03+2] Review access change [schemas/event/secondary] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/684452 (https://phabricator.wikimedia.org/T279089) (owner: 10Mholloway) [16:58:01] (03PS1) 10Mholloway: [DNM] Test Gerrit voting rights [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/684468 [16:59:31] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Gerrit-Privilege-Requests, and 3 others: Create or identify an appropriate Gerrit group for +2 rights on schemas/event/secondary - https://phabricator.wikimedia.org/T279089 (10Ottomata) Thanks Michael! :) [17:00:00] (03CR) 10Mholloway: "Sharvani, do you have the ability to vote Code Review +2 on this change?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/684468 (owner: 10Mholloway) [17:04:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10yuvipanda) >>! In T224658#7053302, @Ottomata wrote: > There are still some child tasks of this Newpyter parent task, but as of today I think we can call the 'Newpyter' project d... [17:05:25] 10Analytics, 10Analytics-SWAP: Notebook machine to double as RStudio Server? - https://phabricator.wikimedia.org/T190769 (10yuvipanda) Pretty well supported with github.com/jupyterhub/jupyter-rsession-proxy/, although you need to be running inside a container (or something with network namespace isolation) for... [17:12:27] I heard SWAP is getting decomissioned [17:12:33] so came to say good bye :) [17:17:44] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Gerrit-Privilege-Requests, and 2 others: Create or identify an appropriate Gerrit group for +2 rights on schemas/event/secondary - https://phabricator.wikimedia.org/T279089 (10Mholloway) 05Open→03Resolved a:03Mholloway Still waiting on confirmati... [17:20:14] 10Analytics, 10Event-Platform, 10Inuka-Team: InukaPageView Event Platform Migration - https://phabricator.wikimedia.org/T267344 (10SBisson) @Ottomata our migration is done and released to one of the stores. Release to the other store is pending. Waiting to hear back from them. [17:23:04] yuvipanda: it just evolved, your efforts have been appreciated and used by a ton of people <3 [17:23:32] (that was SWAP on stat100x though :) [17:23:46] (I don't recall how it started :D) [17:25:57] mostly used it as an excuse to write github.com/jupyterhub/systemdspawner/ :D [17:26:19] and https://github.com/jupyterhub/ldapauthenticator [17:26:31] elukey: madhu also did a lot of that [17:30:08] yuvipanda: :) the underlying arch is pretty much the same [17:30:31] it just uses conda instead of virtualenvs, and the user envs are disposable [17:30:52] still using systemdspawner and ldapauthenticator ;) [17:32:26] yeah, totally re: conda. [17:32:32] what does 'disposable' mean? [17:33:03] ottomata: while you talk about that ... I managed to pip install toree, but I fail to install it within [17:33:28] jupyter: jupyter toree install --spark_home=/usr/lib/spark2/ fails with Read-only file system: '/usr/local/share/jupyter' [17:33:52] oh i think you need a --user flag [17:33:53] or the like [17:34:54] arf - my ignorance of python and notebooks is as big as expected [17:35:02] indeed ottomata - many thanks [17:35:13] yuvipanda: https://github.com/wikimedia/puppet/blob/production/modules/jupyterhub/files/config/spawners.py [17:35:41] users can create new conda envs at will (stacked on top of a big read only anaconda env) [17:35:57] and they can select which conda env will be used to launch their notebook server [17:36:20] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda [17:39:36] (03CR) 10Ottomata: [C: 03+2] Remove requiredness of fields from mediawiki common schema fragments [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/683980 (https://phabricator.wikimedia.org/T275674) (owner: 10Ottomata) [17:39:49] ottomata: very interesting! i this 'stacking' overlayfs? or something native to conda/ [17:40:05] (03CR) 10Ottomata: [C: 03+2] "Merging, these are new schema versions. I don't plan on updating any concrete schemas. New concrete schema versions can use version 2.0." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/683980 (https://phabricator.wikimedia.org/T275674) (owner: 10Ottomata) [17:40:13] yuvipanda: partly [17:40:25] conda stacking doesn't do much more than set a few env vars and update PATH [17:40:30] it mostly stacks the bin directories [17:40:39] (03Merged) 10jenkins-bot: Remove requiredness of fields from mediawiki common schema fragments [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/683980 (https://phabricator.wikimedia.org/T275674) (owner: 10Ottomata) [17:40:40] oh , not overlayfs [17:40:41] dunno what that is [17:40:52] interesting! and that works with PYTHONPATH and libraries too? [17:40:53] i added a anaocnda.pth file into the user's conda env [17:41:07] so that the base anaconda env is always in the PYTHONPATH [17:41:36] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/anaconda-wmf/+/refs/heads/debian/extra/bin/conda-create-stacked [17:41:47] aaah [17:41:48] right, ok [17:42:40] so now, technically, i ugess it is only PATH and PYHTONPATH that are 'stacked' [17:42:56] other conda supported packages, R or whatever, may not be really 'stacked'. [17:43:01] but we're mostly focusing on python support [17:43:15] right [17:43:20] https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#nested-activation [17:44:23] woah TIL [17:44:51] yeah this big base anaconda + dispsoable user conda envs solves a couple of pain points of swap [17:44:58] 1. is we can install common deps across the whole hadoop cluster [17:45:11] so distributed work doesn't always need to solve how to distribute deps [17:45:22] so there's some tight smart integration with spark and the stacked conda envs [17:46:02] and 2. with swap, the user venvs were created once and needed to always work. users could dispose ofo their venvs, but they'd have to do it manually, and we had no control over the base depdenencies [17:46:38] i think we may run into some difficulties if/when we upgrade the base anconda-wmf envs, but at least the users will have a supported action: use the UI to create a new conda env [17:47:04] right, that makes great sense [17:47:09] and this is a pattern that many people want!!! [17:47:38] yeah, i kinda wish i had been able to support running multiple singleusers servers from differnet cnoda envs at the same time [17:47:49] that would have requied somoe systemdspawner hackery i think though [17:47:59] ottomata: what was preventing that? [17:48:13] not sure i totally remember, maybe i didn't try hard enough? [17:48:18] named servers should let you support that pretty esily i'd have thought [17:48:30] but i thought it might be confusing, i'd have to make the profilesspawner ui much smarter [17:48:51] i think named server operates one level above profilespawner [17:48:58] and each 'server' then just triggers profilespawner [17:49:06] basically by default, you have a 'default' server [17:49:10] and named servers are turned off [17:49:18] so the flow you're seeing is going through a 'default' server [17:49:23] and you can enable multiple ones [17:49:31] https://jupyterhub.readthedocs.io/en/stable/reference/config-user-env.html#named-servers [17:49:39] https://github.com/wikimedia/puppet/blob/production/modules/jupyterhub/templates/config/jupyterhub_config.py.erb#L237-L240 [17:49:50] i did try it... [17:50:04] i can't remember what didn't work about it [17:50:18] but yeah it does seem like that should just work [17:50:34] yeah [17:50:44] if you run into bugs plz2report [17:50:53] (I work on core JupyterHub now...) [17:50:56] ottomata: toree/scala/spark all good :) [17:50:59] joal: nice! [17:51:02] ottomata: Will add a line to the doc [17:51:07] * elukey afk! [17:51:08] yuvipanda: what's up with all this jupyter server stuff? [17:51:16] Bye elukey [17:51:19] i just recenly saw that and was like 'aww man i just got it all working the old way!' :) [17:51:25] ottomata: it's confusing. [17:51:28] I don't fully know tbh [17:51:50] ottomata: IIRC you can just swap out a package and things should continue working as they have before [17:56:39] aye [17:58:39] ottomata: would <3 a post on discourse.jupyter.org with the stacking setup you did [17:58:51] would also <3 to adopt it and ship it by default with tljh.jupyter.org/ [17:59:04] TLJH was somewhat inspired by the original SWAP setup [18:02:56] huh i wonder... i'm not sure how easy it would be to make the stacking bits generic [18:03:24] i mean i guess anyone could use the anaconda-wmf deb [18:03:29] which has those scripts in it [18:03:44] but ya hm i could write a blog post [18:03:50] i'm a little worried its a bit hacky [18:03:52] hm [18:05:27] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Data-Infrastructure, 10Patch-For-Review: MEP: Schema fragments shouldn't require fields - https://phabricator.wikimedia.org/T275674 (10Ottomata) Ok, I've removed the requiredness in all the mediawiki fragments. I'm not going to touch common, a... [18:07:12] 10Analytics, 10Analytics-Kanban: Stop Refining mediawiki_job events in Hive - https://phabricator.wikimedia.org/T281605 (10Ottomata) If there are no objections, we will stop refining these and remove them from the `event` database during the week of May 10. [18:08:02] yuvipanda: also, i feel like a more ideal setup would be to just use docker [18:08:16] for reasons we can't (yet?) [18:08:20] but most folks probably can [18:10:35] ottomata: yeah, I agree. but many people can't. [18:10:54] ottomata: also there's lots of people who *just* want to use conda - particularly in teaching situations - and not learn about 'images' and 'containers' [18:11:02] TLJH explicitly made a choice to avoid docker [18:14:39] ottomata: docs updated [18:17:57] 10Analytics, 10Analytics-Kanban: Refine + EventLoggingSchemaLoader should use api.svc instead of meta.wikimedia.org directly. - https://phabricator.wikimedia.org/T247510 (10Ottomata) [18:18:53] 10Analytics-Kanban: Add Presto to Analytics' stack - https://phabricator.wikimedia.org/T243309 (10Ottomata) [18:18:55] 10Analytics: debianize presto python package so it is available by default - https://phabricator.wikimedia.org/T245194 (10Ottomata) 05Open→03Declined This is now available in wmfdata via anaconda-wmf [18:19:19] joal: link? [18:19:32] ottomata: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Scala-Spark_or_Spark-SQL_using_Toree [18:20:09] thanks joal! i wonder if we should put that in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter/Tips [18:20:26] i'm trying to keep the main page more minimal; the swap docs main page got really big [18:21:03] ottomata: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter/Querying [18:21:06] ? [18:21:25] with the overall section on querying? [18:22:06] joal hm why a new page? maybe just in Tips with that? [18:22:16] we can rename Tips to somethinng better? [18:22:59] ottomata: to keep querying-solutions together - but I won't argue over that :) [18:24:03] hm 'querying' just doesn't sounds like where 'how to use scala notebook' belongs [18:24:26] i'd maybe move the pyspark stuff off the main page too, but i thtink that is one of the most common use cases for jupyter [18:24:53] Tips has a section for building a custom pyspark kernel noteobok, i think building a custom scala spark notebook shoudl go there too [18:25:57] joal: it'll be so cool if/when we get spark 3 + almond [18:26:14] fabian and I tried pretty hard to get it working with spark 2.4.4 but i thikn we need scala > 2.11 [18:26:33] if we get that i think i'll start using notebooks mroe [18:26:34] more [18:37:01] 10Analytics: Make RefineFailuresChecker checker jobs use the same parameters as Refine jobs - https://phabricator.wikimedia.org/T274376 (10Ottomata) 05Open→03Resolved a:03Ottomata Done in recent refactors. [18:44:31] I hear you ottomata on almomd - Spark 3 is not far :) [18:44:39] Gone for tonight :) [18:44:53] laters <3 [19:00:41] anyone else having trouble buildinig refinery-source? [19:00:46] (on a stat box?) [19:04:37] Could not resolve dependencies for project org.wikimedia.analytics.refinery.job:refinery-job:jar:0.1.10-SNAPSHOT: Failed to collect dependencies at com.criteo:rsvd:jar:1.0 -> org.apache.spark:spark-core_2.11:jar:2.3.1 -> net.java.dev.jets3t:jets3t:jar:0.9.4 -> commons-codec:commons-codec:jar:1.15-SNAPSHOT: Failed to read artifact descriptor for commons-codec:commons-codec:jar:1.15-SNAPSHOT: Could not [19:04:37] transfer artifact commons-codec:commons-codec:pom:1.15-SNAPSHOT from/to wmf-mirror-spark [19:36:15] Funny :) https://twitter.com/indygupta/status/1388588905927221249 [20:02:21] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Isaac) > When using the Laplace distribution, the noise doesn't consume the δ, however some of the δ is consumed by something we call p... [20:02:27] (03CR) 10Mforns: [C: 03+1] Update cassandra jobs for double loading (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681678 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [20:34:37] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Product-Data-Infrastructure, and 3 others: Replace usages of Linker::link() and Linker::linkKnown() in extension EventLogging - https://phabricator.wikimedia.org/T279328 (10Mholloway) 05Open→03Resolved [20:59:41] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Nuria) >what it would take to migrate some of this to the cluster where the Apache Spark runner could be tested We probably do not want... [21:29:59] (03CR) 10Mholloway: [C: 03+1] "> Patch Set 8:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/680798 (https://phabricator.wikimedia.org/T254891) (owner: 10Neil P. Quinn-WMF) [21:49:02] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Htriedman) Hi all — just finished updating the demo to get it into a good place. You can see the finished product (UI, user- and pagevi... [22:02:10] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:02:58] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:25:00] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:25:58] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process