[03:01:13] (03PS1) 10BrandonXLF: Use one_or_none to handle non-existent queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682030 (https://phabricator.wikimedia.org/T280915) [03:02:24] (03PS2) 10BrandonXLF: Use one_or_none to handle non-existent queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682030 (https://phabricator.wikimedia.org/T280915) [04:17:56] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 (10Ottomata) ` 21/04/22 21:42:14 INFO Refine: Successfully refined 20 of 20 dataset partitions into table `event_sanitized`.`netflow` (total # refined recor... [04:19:07] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 (10Ottomata) Ok, should be good here. Will check up on the RefineSanitize jobs in the morning. [04:19:35] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) [04:27:01] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:30:35] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:03:54] 10Analytics, 10Analytics-Kanban, 10Dumps-Generation, 10Patch-For-Review: Mention QRank in “Analytics Datasets” - https://phabricator.wikimedia.org/T278416 (10ArielGlenn) 05Open→03Resolved Live on the web server. Have a great weekend when it arrives! [06:04:34] (03CR) 10Awight: "Thanks for all these tips, they dramatically simplify this patch! I'm out of my element here so I've blindly copied from https://github.c" (038 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 (owner: 10Awight) [06:13:59] (03PS7) 10Awight: [WIP] Report on test coverage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 [06:24:39] (03CR) 10Awight: "Now this step is failing," [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 (owner: 10Awight) [06:30:52] (03PS8) 10Awight: [WIP] Report on test coverage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 [06:38:58] (03PS9) 10Awight: Report on test coverage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 [06:40:28] (03CR) 10Awight: "Works! See https://sonarcloud.io/dashboard?id=org.wikimedia.analytics.refinery%3Arefinery&branch=681933-8" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 (owner: 10Awight) [07:01:24] (03CR) 10Gehel: [C: 03+1] "Looks good! Much simpler this way!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 (owner: 10Awight) [07:02:56] Good morning [07:03:02] o/ [08:20:05] (03PS1) 10BrandonXLF: Add stop button to running queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682092 (https://phabricator.wikimedia.org/T71037) [08:20:37] (03PS2) 10BrandonXLF: Add stop button to running queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682092 (https://phabricator.wikimedia.org/T71037) [08:22:13] (03PS3) 10BrandonXLF: Add stop button to running queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682092 (https://phabricator.wikimedia.org/T71037) [08:24:30] (03CR) 10jerkins-bot: [V: 04-1] Add stop button to running queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682092 (https://phabricator.wikimedia.org/T71037) (owner: 10BrandonXLF) [08:29:13] (03PS1) 10Gehel: Upgrade Findbugs to Spotbugs and integrate with Sonar. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682093 [08:36:53] (03PS4) 10BrandonXLF: Add stop button to running queries [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/682092 (https://phabricator.wikimedia.org/T71037) [08:39:14] hello folks [08:39:44] if nobody opposes I'd deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/680383 [08:39:52] it seems to work fine in hadoop test [08:40:03] to enable it we'll have to restart all the daemons [08:40:17] so I'll start with some for the weekend [08:40:31] then we'll do a roll restart when the next jvm upgrade is due [08:42:46] joal: --^ [08:50:54] (03PS10) 10Awight: Report on test coverage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 [08:58:21] (03PS1) 10Awight: Fail CI when test coverage is below 60% [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682100 [08:59:59] (03CR) 10Awight: Fail CI when test coverage is below 60% (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682100 (owner: 10Awight) [09:00:09] (03CR) 10jerkins-bot: [V: 04-1] Fail CI when test coverage is below 60% [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682100 (owner: 10Awight) [09:06:24] (03PS11) 10Awight: Report on test coverage [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 [09:06:26] (03PS2) 10Awight: Fail CI when test coverage is below 54% [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682100 [09:07:48] sorry elukey, I was in meeting - reading the CR [09:08:00] all good for me elukey [09:09:57] (03CR) 10jerkins-bot: [V: 04-1] Fail CI when test coverage is below 54% [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682100 (owner: 10Awight) [09:11:02] joal: <3 [09:12:46] !log change default log4j hadoop config to include rolling gzip appender [09:12:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:20:56] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10elukey) Verified on an-test-worker1001 that the Yarn NM uses the gzip appender now: ` elukey@an-test-worker1001:~$ ls -lht /var/log... [09:22:41] for the yarn NM logs there is a reduction from 256MB -> 13MB [09:22:52] \o/ [09:23:08] the .out files though are not gzipped, they are created outside log4j [09:23:29] can we mabe not have them? [09:24:06] in some cases they are important, but I think that their definition is buried inside the scripts shipped with upstream packages :( [09:26:52] :/ [09:38:00] Hops seems to use Hive+Hudi for their features store (historical / offline use case) [09:39:36] We'd go for hive/Spark + iceberg, but the approach is cool --^ [09:41:50] the only two open source solutions that I found are Feast and Hops, and the former seems to not have any support outside BigTable [09:42:06] (but it supports Cassandra/Redis for the online/serving use case) [10:30:20] !log restart hadoop daemons (NM, DN, JN) on an-worker1080 to further test the new log4j config - T276906 [10:30:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:30:23] T276906: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 [10:37:38] * elukey lunch! [11:12:50] Break! [12:11:40] (03CR) 10Gehel: [C: 04-1] "I would prefer to leave this to SonarCloud quality gates. The current quality gates will fail the build on coverage <80% on new code. Focu" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682100 (owner: 10Awight) [12:20:40] (03CR) 10Gehel: [C: 03+1] "Still LGTM and looks ready to be merged." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681933 (owner: 10Awight) [12:22:40] elukey, joal: do you now which group membership Aisha will need? T280967 [12:23:35] elukey: for context, Aisha will be working with Search Platform and Joseph to do some analysis around WDQS. I don't know much about how access to the analytics resources work :/ [12:29:23] gehel: we have https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request? with some info [12:29:39] elukey: thanks! that might help! [12:29:44] we reduced a lot the complexity around group membership, basically we now have analytics-privatedata-users [12:29:56] + ldap (wmf/nda) + kerberos if needed [12:30:43] I am going to be back in a few but let's follow up if something is not clear! [12:35:10] elukey: I think it's clear enough. I've updated T280967 with what I understand (wmf + analytics-privatedata-users) [12:35:35] joal: if you can think of anything else that tanny411 might need, can you update the task? Please and thank you! [12:50:47] gehel: if any data access to hadoop will be required (hive spark hdfs blabla) let's also add kerberos [12:51:03] going to add a note [12:51:06] (to the task) [12:53:17] gehel, elukey: Thanks, task updated. [12:54:01] hi tanny411!! [12:54:36] tanny411: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide to start digesting kerberos, ping us for any doubt :) [13:10:29] tests on an-worker1080 for logging are good [13:13:00] hi teammm [13:13:21] hola marcel! [13:13:29] :] [13:28:04] https://lwn.net/Articles/852112/ [13:28:20] interesting article, the author is a very famous italian person [13:28:38] \o/ :D [13:32:18] elukey: Thanks! [13:49:31] (03PS3) 10Mforns: Add --no-graphite flag [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) [14:22:25] going to roll restart the hadoop masters to pick up the new log4j settings [14:23:01] !log roll restart an-master100[1,2] daemons to pick up new lo4j settings - T276906 [14:23:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:23:05] T276906: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 [14:27:28] master is 1002, waiting 10/15 mins to failback [14:35:35] Hi gehel - I was in break, sorry [14:36:50] gehel: private-data-users and kerberos is what I'd have asked for [14:37:08] joal: yeah, I think we got all the info needed on the ticket now [14:37:19] sorry to be late for the party gehel :( [14:37:25] all good! [14:37:58] joal: and if you want other fun stuff to make up for it: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/682093 [15:00:59] starting to failback 1001 [15:08:39] joal: looks like we have the green light for the capacity scheduler! [15:08:50] Hurray elukey :) [15:08:55] When do you wish us to do it? [15:09:38] joal: monday? [15:09:52] YES! [15:13:09] super :) [15:13:21] ottomata, razzi - ok if we deploy the capacity scheduler on monday? [15:15:53] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10elukey) I'll keep monitoring the an-master100x nodes and an-worker1080, but we should be good from preliminary results. The complete... [16:07:46] * elukey bbiab [16:15:47] (03CR) 10Mforns: [V: 03+2 C: 03+2] "I tested this and all jobs ran fine, LGMT!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/681707 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [16:23:27] (03CR) 10Neil P. Quinn-WMF: Create content_translation_event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/680798 (https://phabricator.wikimedia.org/T254891) (owner: 10Neil P. Quinn-WMF) [16:43:25] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:48:27] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:09:46] (03CR) 10Awight: [C: 03+1] "Would merge." (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) (owner: 10Mforns) [17:15:06] mforns: FYI, I've purge graphite again so everything should be ready for reenabling my jobs. I'll rebase the puppet patch... [17:15:45] thanks awight, pinging an SRE [17:15:53] mforns: I think you might need to purge a bunch of the RU output files to force re-runs for the backfill. [17:16:02] (03CR) 10Mforns: [C: 03+2] Add --no-graphite flag [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) (owner: 10Mforns) [17:16:16] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add --no-graphite flag [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) (owner: 10Mforns) [17:17:33] awight: purge files, yes you're totally right, doing [17:17:47] mforns: Happy Friday! [17:18:07] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) [17:18:42] awight: :] thanks! you too.. [17:19:06] this is relaxed work, so totally cool [17:19:22] Just a lot of it ;-) [17:21:48] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:37:58] elukey: +1 to capacity sched on monday [17:40:23] ottomata: thanks! [18:00:23] hi razzi :] can you please look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/680021 everything that needed to be done prior to re-activating these jobs has been done by awight and me, we've tested the jobs with real data, including posting to graphite. The start dates for those reports have been updated, and the report files purged of corrupted rows that need to be re-run. We think it's ready to [18:00:23] merge :] thanks!! [18:04:20] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) [18:14:31] 10Analytics-Radar: Migrate all reportupdater queries to hive - https://phabricator.wikimedia.org/T205296 (10awight) I'd like to understand what this task is about. Was it similar to {T193169}, or was the intent to migrate SQL to Hive? [18:17:32] 10Analytics: [Reportupdater] Support category of jobs that cannot be backfilled - https://phabricator.wikimedia.org/T280997 (10awight) [18:25:57] (03Abandoned) 10Awight: Fail CI when test coverage is below 54% [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/682100 (owner: 10Awight) [18:32:27] * elukey afk, have a good weekend folks [18:32:29] :) [19:13:36] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, 10Readers-Web-Backlog: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10mforns) > FYI, @mforns has begun work on this I think. @Jdlrobson, @phuedx... [19:13:50] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:27:59] Hi team, I didn’t mention I was taking today as a vacation. I apologize for not letting you all know in advance [19:47:24] 10Analytics-Radar, 10Data-Services, 10Developer-Advocacy (Apr-Jun 2021), 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [19:50:05] 10Analytics-Radar, 10Data-Services, 10Developer-Advocacy (Apr-Jun 2021), 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [20:13:02] 10Analytics-Radar, 10Data-Services, 10Developer-Advocacy (Apr-Jun 2021), 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [20:50:30] (03CR) 10Mholloway: Create content_translation_event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/680798 (https://phabricator.wikimedia.org/T254891) (owner: 10Neil P. Quinn-WMF)