[02:03:56] (03CR) 10Razzi: "Looks like this patch renamed org.wikimedia.analytics.refinery.job.refine.EventLoggingSanitization which is used by a timer: https://gerri" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670321 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [02:25:11] 10Analytics-Clusters, 10Analytics-Kanban: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) kafka-kit depends on other packages such as https://tracker.debian.org/pkg/golang-github-zorkian-go-datadog-api, which is not availa... [02:30:24] (03CR) 10Razzi: "Here's my attempt at fixing https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DIcinga%2FCheck%20unit%20status%20of%20refine_s" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670321 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [02:43:15] o/ razzi hiya [02:43:22] just saw the sanitization thing [02:43:32] the jar version shouldn't have changed yet [02:43:39] we should have to make a puppet patch for that to affect anything [02:43:50] unless the data_purge puppet isn't specifcyinig a specific refinery versino [02:44:26] aH it isn't! [02:44:31] that is a mistake [02:44:32] fixing that. [02:56:36] 10Analytics, 10Product-Analytics: Hive Runtime Error - Query on event.MobileWikiAppDailyStats failing with errors - https://phabricator.wikimedia.org/T277348 (10SNowick_WMF) 05Open→03Resolved Thank you for the thorough response, I used the solution 1 query successfully, will try the spark-sql query next ti... [03:11:34] RECOVERY - Check unit status of refine_sanitize_eventlogging_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:21:34] PROBLEM - Check unit status of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:29:16] PROBLEM - Check unit status of monitor_refine_sanitize_eventlogging_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:20:27] Hello hello :) [06:21:05] I have run into the following error upon trying to spawn a stacked conda environment in JupyterHub from stat1004: Error: [Errno 30] Read-only file system: '/run/jupyter-goransm-singleuser' [06:21:17] What could this mean? Please advise. Thank you! [06:22:56] This happens when I click "Start" to Create and use new stacked conda environment... [06:29:24] 10Analytics-Clusters, 10Analytics-Kanban: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10elukey) >>! In T255973#6959367, @razzi wrote: > kafka-kit depends on other packages such as https://tracker.debian.org/pkg/golang-github-zo... [06:36:31] GoranSM: Hi! How did you connect to Jupyterhub? I mean following what wiki page, since things changed a bit (if you are connecting to port :8000 it is the old stack, so it might not work) [06:43:02] strange I don't see your user logged in either the new or the old jupyterhub on stat1004 [06:43:33] I'll wait for more info from your side :) [07:29:18] good motivation to think about kafka 2.x https://www.confluent.io/blog/kafka-without-zookeeper-a-sneak-peek/ [07:29:21] :) [08:20:40] 10Analytics: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10elukey) Interesting: https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/KerberosReq.html Up to https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8211380 the dns_cano... [08:21:20] 10Analytics-Radar, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-03-31: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) [09:18:37] 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10Qgil) Gosh, guys, you're fast! Thank you so much. @mforns this query looks good to me, and the list of results is very interesting already. Except that "LIMIT 100" w... [09:20:23] (03CR) 10Gilles: "@ottomata is there more to do for this new schema to be deployed than just merging this patch?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [10:13:10] 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10Qgil) @mforns I'm not sure this would be more accurate or technically possible 9or whether it would make a big difference in the ranking), but here is an idea for bet... [10:36:04] 10Analytics: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10elukey) Indeed this is what Java does: https://github.com/openjdk/jdk/blob/master/src/java.security.jgss/share/classes/sun/security/krb5/PrincipalName.java#L424-L441 So my understanding is the fol... [10:36:12] /o\ --^ [10:38:24] another rabbit hole [10:52:25] (03PS11) 10Phuedx: Add new analytics/pref_diff schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668529 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [10:53:04] (03CR) 10Phuedx: [C: 03+2] "Neato!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668529 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [10:55:08] (03Merged) 10jenkins-bot: Add new analytics/pref_diff schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668529 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [11:20:20] * elukey lunch! [13:15:38] 10Analytics: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10Ottomata) Interesting! > The next test that I want to do is to execute pyspark in yarn mode and try to connect to analytics-test-hive.wikimedia.org. You'll probably need the modified krb5.conf fil... [13:16:34] (03CR) 10Ottomata: [C: 03+1] "Nope, merge away!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [13:17:15] (03CR) 10Ottomata: [C: 03+1] "> Patch Set 6:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [13:18:21] o/ looking into the sanitize failures [13:23:22] good morning :) [13:23:51] joal: qq - should we have a maximum-capacity set (say 80%) for the users queues? [13:24:07] hi team [13:24:13] just to prevent the basic use cases like an uncontrolled spark job eating all the ram [13:24:14] elukey: you have a sensor :) [13:24:18] :) [13:24:32] I sense joseph :D [13:25:09] * joal usually doesn't make sense :S [13:25:35] elukey: I don't think reducing max capacity is the correct approach [13:27:00] elukey: By reducing max capacity, we lower resource usage elasticity - The correct approach is to not disallow resource preemption [13:27:03] IMO [13:28:32] elukey: does that ... make sense :D [13:28:34] ? [13:28:36] joal: sure but preemption is technically enabled also for Fair, and we ended up several times in seeing production starving [13:28:57] elukey: preemption has been setup very late for fair [13:29:35] elukey: and, it makes quite some time we've not seen prod starving :) [13:29:39] joal: I am not getting why all the users would need more than say 80% [13:29:59] joal: because you followed up with people promptly a lot of times [13:30:04] elukey: we can set up that limit [13:30:18] elukey: if I had not, preemption would have done it's job [13:30:45] elukey: me following up is mostly for users to share - prod jobs have not slow for a very long time (even 1st of month) [13:30:53] joal: I am not completely sure that it would have not caused problems, but you know best :) [13:31:00] elukey: it's also true that resource have grown significantly [13:31:28] elukey: user queue maxing at 80% of the global usage is super ok :) we can set that if you wish [13:31:52] elukey: it just that, if users need resource at a moment prod jobs are not running, why not use 100% for users? [13:32:24] joal: it makes sense sure [13:34:18] I am following the same approach for ram usage on stat boxes, leaving space for the OS even if not immediately needed [13:35:10] anyway, I don't have a lot of decision power now so I'll leave the decision to the team, we can review the setting later on if needed [13:35:14] :D [13:35:24] thanks a lot ottomata for checking the sanitization - let me know if you wish me to help [13:35:47] elukey: you're the puppet-master - you do what you want :) [13:37:01] 10Analytics: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10elukey) @Ottomata if this is the case then I'd just leave dns_canonicalize_hostname = false on all worker nodes as well, so people will not have to wonder all these extra things. It is interesting... [13:44:26] (03PS1) 10DCausse: Make LocaleUtil and UserEventBuilder independent from system locale [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676075 [13:45:45] elukey: why do they need to be re-ran via the unit? [13:45:52] systemctl status shows them as succeeded [13:46:03] the unit has been launched and succeeded since those failures [13:46:07] by the tiner [13:46:12] timer* [13:46:13] ottomata: monitor_ vs refine_ :) [13:46:18] OHHHH DUH [13:46:30] right right iriiiiight [13:46:31] ok thank you [13:46:46] I was puzzled by the email and I checked, I thought something horrible was happening on the timers [13:46:51] joal: thanks, yeah data_purge.pp was just using refine_job without specifying the refinery version! [13:46:58] so its been using the latest deployed reifnery always [13:47:10] will try to make some puppet today to use the new version [13:47:12] yeah so that ottomata - great finding :) [13:47:18] althogh...lots of meetings today [13:47:20] so maybe tomorrow [13:47:43] RECOVERY - Check unit status of monitor_refine_sanitize_eventlogging_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:50:57] RECOVERY - Check unit status of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:15:59] (03PS1) 10DCausse: Switch to eventutilities 1.0.4 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 [14:16:24] (03CR) 10jerkins-bot: [V: 04-1] Switch to eventutilities 1.0.4 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [14:18:12] !log starting copy of large tables from aqs1007 to aqs1011 [14:18:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:20:39] 10Analytics-Radar, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10akosiaris) 05Open→03Resolved All action items have been addressed, many many thanks to everyone who contribu... [14:41:44] (03CR) 10DCausse: Switch to eventutilities 1.0.4 (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [14:57:54] 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Suriname0) Hi @JAllemandou, thanks so much for doing this. Inspecting the 2021 data, format looks great (and cov... [15:06:35] (03CR) 10Gilles: [C: 03+2] Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [15:07:16] elukey: I think I found the rabbit, btw: https://gerrit.wikimedia.org/r/c/operations/puppet/+/675907/ [15:07:19] (03Merged) 10jenkins-bot: Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [15:08:23] awight: hi! I wanted to follow up with you, from the task I didn't get if the RU jobs are working or not now [15:16:49] elukey: That's the thing I can't tell on my own, unfortunately. Nothing is landing in Graphite, but I don't have access to the RU error logs. [15:22:39] 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10KCVelaga) Hi, I have a couple of suggestions. I am a part-time facilitator on Quim's team. # A rule for user not blocked (on any Wikimedia project) # [[ https://... [15:23:23] Hi team, good morning! I'm planning to deploy superset today; could elukey / ottomata review https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/665130? [15:25:42] razzi: hi! I am reviewing, but there are some things that are unclear.. why are we using a different npm and not the buster one? [15:27:01] same thing for pip, we are not using anymore what provided by buster [15:29:03] it should be fine since we don't really rely on buster nodejs, but it is another dep from the internet rather than from debian [15:29:47] also "curl -fsSL https://deb.nodesource.com/setup_14.x | bash -" may be a little dangerous, we are blindly trusting a bash script that can contain anything [15:31:17] elukey: npm version is to match the one superset uses: https://github.com/apache/superset/blob/master/superset-frontend/package.json#L60 [15:31:17] True regarding nodesource curl | bash; would be better to install from the deb [15:31:17] The pip version is so that we can use the latest wheel format, which the required arrow version now comes in [15:38:47] (03CR) 10Elukey: Upgrade superset to 1.0.1 (035 comments) [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi) [15:39:10] razzi: sure, but please document them, otherwise anybody else doing an upgrade knows it [15:39:40] for the curl | bash it is sufficient to expand it with the steps needed, likely adding apt source to config, apt-get update and install? [15:40:35] *so that anybody else doing .. [15:47:08] a-team remember to join the right meeting for the CTO check-in :) [15:47:25] Thanks fdans --^ I'd have missed that [15:52:25] does that mean no standup? I've been to the platform CTO check-ins so I figure it wouldn't make sense for me to be in the analytics one :) [16:02:14] hnowlan: yes yes no standup, free time for you :) [16:02:35] but feel free to come if you want! [16:02:51] * hnowlan runs outside [16:03:24] ahahahah [16:19:39] awight: (sorry a lot of meetings, I promise that I'll work on the patch tomorrow) [16:31:08] 10Analytics-Clusters, 10Product-Analytics: Can't re-run failed Oozie workflows in Hue/Hue-Next (as non-admin) - https://phabricator.wikimedia.org/T275212 (10nshahquinn-wmf) >>! In T275212#6958479, @razzi wrote: > @nshahquinn-wmf any luck with that setting? Thanks for the reminder—`mapreduce.job.acl-modify-job... [16:33:37] 10Analytics: Hive log4j logging is misconfigured - https://phabricator.wikimedia.org/T216294 (10nshahquinn-wmf) 05Open→03Resolved a:03nshahquinn-wmf >>! In T216294#6957007, @elukey wrote: > Neil I think that this can be closed, what do you think? Yes, I think so. Now we have //new// logspam to worry about... [16:33:42] 10Analytics: Hive log4j logging is misconfigured - https://phabricator.wikimedia.org/T216294 (10nshahquinn-wmf) a:05nshahquinn-wmf→03None [16:34:47] 10Analytics, 10Analytics-Dashiki: npm install gives Verification failed while extracting mediawiki-storage@https://github.com/wikimedia/analytics-mediawiki-storage/archive/master.tar.gz - https://phabricator.wikimedia.org/T278982 (10Urbanecm) [16:35:22] 10Analytics, 10Analytics-Dashiki: npm install gives Verification failed while extracting mediawiki-storage@https://github.com/wikimedia/analytics-mediawiki-storage/archive/master.tar.gz - https://phabricator.wikimedia.org/T278982 (10Urbanecm) [16:37:15] 10Analytics-Clusters, 10Product-Analytics: Can't re-run failed Oozie workflows in Hue/Hue-Next (as non-admin) - https://phabricator.wikimedia.org/T275212 (10nshahquinn-wmf) >>! In T275212#6958479, @razzi wrote: > @nshahquinn-wmf any luck with that setting? Thanks for the reminder—`mapreduce.job.acl-modify-job... [17:31:24] fdans: meeting? [17:32:29] i’m in it? [17:32:36] I think so yes :) [17:32:39] fdans: --^ [17:33:01] i mean in the hangout [17:33:14] nope [18:01:06] * elukey afk! [18:01:14] thanks all for the extra time on this :) [18:53:43] (03PS1) 10Ottomata: Include RefineFailuresChecker functionality into RefineMonitor [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 [19:03:43] (03PS2) 10Ottomata: [RIP] Include RefineFailuresChecker functionality into RefineMonitor [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 [19:06:50] (03CR) 10jerkins-bot: [V: 04-1] [RIP] Include RefineFailuresChecker functionality into RefineMonitor [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 (owner: 10Ottomata) [19:37:15] 10Analytics, 10Patch-For-Review: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10fkaelin) Thanks @Ottomata, I can also confirm that the certificates work now too, ie a request with `verify=False` now fails on the workers as well. [22:23:16] (03CR) 10Razzi: Upgrade superset to 1.0.1 (035 comments) [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi)