[01:36:28] (03PS5) 10Milimetric: Update mysql resolver to work with cloud replicas [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666209 (https://phabricator.wikimedia.org/T274690) [01:38:44] (03CR) 10Milimetric: "cc @Joal & @Razzi, is this ready? I just wanna make sure it lands along with the corresponding puppet patches (remember the --clouddb par" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666209 (https://phabricator.wikimedia.org/T274690) (owner: 10Milimetric) [05:57:10] gooood morning [06:01:25] 10Analytics, 10WMDE-Analytics-Engineering: wmde-toolkit-analyzer-build.service fails on stat1007 - https://phabricator.wikimedia.org/T278665 (10elukey) [06:41:08] Good morning :) [07:06:09] bonjour! [07:06:40] I found out that the Resource Manager exposes some JMX metrics related to the status of the various queues, same thing for fair, so I am trying to add basic root queue metrics [07:06:47] like apps completed/killed/etc.. [07:10:30] if we think about hour metrics aggregation, we could have an SLI-like metric for apps completed/total and one for total/hour jobs [07:13:01] That'd be really great elukey! [07:13:15] elukey: would there by any chance be metrics about resource consumption? [07:14:01] yep! [07:14:07] and there are metrics for each queue [07:14:11] elukey: \o/ [07:14:19] This is reaaly great :) [07:14:37] but I'd start only from root [07:14:41] sure [07:15:36] ah also the labels are working! [07:16:04] I am going to deploy the analytics-search keytab in a bit [07:16:23] joal: https://blog.cloudera.com/managing-cpu-resources-in-your-hadoop-yarn-clusters/ is also nice to read [07:16:36] ack elukey - will do [07:17:09] but possibly for later, we can keep the default for the moment [07:36:41] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10awight) [07:42:01] joal: just added some metrics to the RM tab in grafana [07:42:55] the allocated MBs seems to be weird [07:43:02] maybe it is in bytes [07:44:07] elukey: it would make more sense if in bytes :) [07:44:25] elukey: I'm completly ignorant in the space though [07:44:38] it seems in MB from the docs, weird [07:44:43] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 3 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10awight) [07:44:44] elukey: those metrics are super great :) [07:46:53] joal: the apps completed/submitted may be a good SLI, together tight rate of completed / timeframe [07:47:10] I mean something that we can alarm on, in a more SRE-like way [07:47:20] yup I hear that [07:57:07] the metrics in the Test cluster (with capacity scheduler) seems to work in a better way [07:57:10] weird [07:57:19] anyway, will review them again later on :) [07:57:56] deploying the keytab on an-test-client [08:11:45] joal: ok analytics-search deployed in test :) [08:12:01] Yes elukey !P [08:12:04] let's test! [08:12:43] I can create a spark2 session [08:12:51] elukey: great [08:13:36] elukey: please go ahead :) [08:13:48] joal: done [08:14:11] elukey: I confirm I can see it [08:14:17] elukey: I'm gonna try to move it [08:14:28] elukey: I should be able as I am in analytics-admin [08:16:19] elukey: confirmed [08:16:23] \o/ [08:16:38] elukey: can you sudo as a user not being in analytics-admin? [08:17:02] elukey: or temporarily remove me from analytics-admin [08:17:03] ? [08:17:05] joal: I can yes [08:17:14] what command should I run? [08:17:27] ah elukey - this user should be kerberized [08:17:32] ah joal you can use analytics [08:17:39] Ah! [08:17:42] ok great, testing [08:18:21] elukey: it looks like analytics user is in analytics-admins [08:18:26] elukey: I managed to move the job [08:20:48] joal: nope it is not [08:20:59] Arf :( [08:21:16] It feels like a ACL config problem then (or maybe it's a me problem) [08:21:42] elukey: IIUC the analytics user shouldn't be able to 'manage' queues [08:21:54] joal: from what queue did you move it from? [08:21:56] And, it did [08:22:10] elukey: I moved it back and forth in analytics/search [08:27:05] weird [08:27:28] joal: can you try to submit a job to the search queue as analytics? [08:27:35] elukey: tring [08:27:43] if it works acls are not enabled [08:28:36] elukey: refun [08:28:41] elukey: sorry - again [08:28:56] elukey: refused: User analytics does not have permission to submit application_1616774365241_1568 to queue search [08:29:08] ACLs are enabled [08:29:22] but, it seems that queue-management is maybe not what we expected? [08:29:30] probably yes, interesting [08:29:48] I checked on the RM logs and it was the user 'analytics' to issue the move [08:30:07] just to confirm that maybe there was no weird sudo -> RM interaction [08:30:15] ack [08:31:57] joal: I think that is a queue vs job kind of ACL [08:32:11] elukey: I was reading on that as well [08:32:42] elukey: but then, what is the queue-management stuff??? [08:34:00] elukey: I'm aonly reading about 'submit', not manqage [08:34:39] ah - it's about getting info and killing [08:35:52] elukey: I managed to kill your spark-shell while in product queue with analytics-user [08:36:02] I think this is not expected [08:36:10] * joal doesn't understand [08:36:53] https://docs.cloudera.com/documentation/enterprise/5-12-x/topics/cm_mc_yarn_acl.html#concept_yarn_app_acls [08:37:14] so in theory, IIUC, we should also set mapreduce.job.acl-modify-job [08:37:51] the "how" part I am not sure [08:37:58] hm [08:39:02] elukey: from https://docs.cloudera.com/runtime/7.2.7/yarn-allocate-resources/topics/yarn-control-access-to-queues-using-acls.html last 2 paragraphs, the queue-administration ACLs should have prevented user analytics to kill your spark-shell, no? [08:41:34] in theory yes, this was my understanding [08:42:58] https://docs.cloudera.com/runtime/7.2.7/yarn-security/topics/yarn-admin-acl.html [08:43:04] * elukey cries in a corner [08:43:08] joal: --^ [08:43:45] /o\ [08:43:58] Man - this is like broken badly [08:46:06] joal: ok added the option with ' analytics-admins', let's see what happens [08:46:13] I just re-created the spark session [08:46:34] testing [08:47:09] elukey: User analytics cannot perform operation MODIFY_APP on application_1617007527356_0001 [08:47:19] elukey: trying with my own user [08:47:38] ah! [08:48:02] elukey: same for moving app [08:48:19] elukey: moving worked with joal user [08:48:35] and so did kill [08:48:39] elukey: you nailed it!!!! [08:48:54] * joal bows to elukey once again :) [08:50:34] the usual intuitive hadoop settings :( [08:50:37] adding it to puppet! [08:59:26] all right so we are good :) [08:59:45] the labels settings are a little weird [09:00:10] I had to add the following [09:00:12] 'yarn.scheduler.capacity.root.accessible-node-labels' => 'GPU', [09:00:16] 'yarn.scheduler.capacity.root.accessible-node-labels.GPU.capacity' => '100', [09:00:19] 'yarn.scheduler.capacity.root.users.accessible-node-labels' => 'GPU', [09:00:22] 'yarn.scheduler.capacity.root.users.accessible-node-labels.GPU.capacity' => '100', [09:00:25] 'yarn.scheduler.capacity.root.production.accessible-node-labels' => ' ', [09:00:26] elukey: the queues are weird as weel in UI [09:00:28] 'yarn.scheduler.capacity.root.users.fifo.accessible-node-labels.GPU.capacity' => '100', [09:00:31] 'yarn.scheduler.capacity.root.users.fifo.accessible-node-labels.GPU.max-capacity' => '100', [09:00:34] 'yarn.scheduler.capacity.root.users.default.accessible-node-labels.GPU.capacity' => '0', [09:00:37] 'yarn.scheduler.capacity.root.users.default.accessible-node-labels.GPU.max-capacity' => '0', [09:00:40] 'yarn.scheduler.capacity.root.users.fifo.default-node-label-expression' => 'GPU', [09:00:49] so when we add labels, it is as if we created a sub-cluster [09:00:58] and the queues gets duplicated (sort of) [09:01:27] in our case I had to specifically enable the label from the root to the leaf [09:01:35] and then assign new limits [09:03:29] ack elukey - it's weird [09:04:37] I kinda makes sense when you enter in the mentality of the cluster partitioning, but I wouldn't have expected this complexity [09:04:46] maybe code-wise it was easier [09:05:32] do you think that we are ready-ish to think about deploying this? [09:05:40] of course we'll have to present it to the team [09:14:00] elukey: presentation to the team (and other teams), and then move [09:14:01] +1 [09:15:25] I can prep one, for other teams it might be a little too much, we can advertise in a wiki page the main features [09:15:33] in theory nobody will really care [09:15:41] as long as they'll be able to submit jobs [09:15:55] (with some guidance about queues etc..) [09:16:00] what do you think? [09:17:45] works for me elukey [10:03:59] joal: created a bare min presentation :) [10:04:12] Thanks a milion elukey :) [10:04:42] joal: I'll need your help when presenting since I'll make mistakes for sure (especially for fair) [10:05:02] elukey: I think you know more than I do, but I'll be with you for sure: ) [10:05:37] joal: I highily doubt it, as always :) [12:18:22] hey teamm! [12:47:21] hola! [13:20:11] 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10kostajh) >>! In T275172#6930651, @nettrom_WMF wrote: > The first part of this work is no... [13:27:37] elukey: heya, I'm reading your slides, qq: when you say i.e. the production queue preempts if needed, does it mean the production jobs get interrupted if needed, or that other jobs with lower priority get interrupted to ensure resources for the production ones? [13:31:05] (03PS5) 10Gilles: Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) [13:32:03] (03CR) 10jerkins-bot: [V: 04-1] Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [13:32:16] mforns: the scheduler should try to reduce container allocations (when possible) from other jobs in other queues, assigning the capacity to production [13:32:23] I can change the wording in case [13:33:45] https://hadoop.apache.org/docs/r2.10.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html [13:33:52] (03PS6) 10Gilles: Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) [14:28:32] elukey: I understand, thanks! [14:29:25] no need to change, just me not understanding the jargon [14:31:44] mforns: keep in mind that my understanding of the schedulers is not great, Joseph is the one to ask :D [14:32:10] let me know if there if you want to chat about the slides, the more eyes/questions/etc.. the better [14:32:28] I am planning to quickly present this when the team wants to see if we can proceed or not [14:37:57] 10Analytics, 10Cassandra: AQS Cassandra driver needs to be updated - https://phabricator.wikimedia.org/T278699 (10hnowlan) [14:39:06] elukey: slides look good to me! [14:39:51] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging this, as it's just config." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/673547 (owner: 10Mforns) [14:41:04] (03Abandoned) 10Mforns: POC Airflow Refine [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597623 (https://phabricator.wikimedia.org/T241246) (owner: 10Mforns) [14:49:43] (03PS1) 10Hnowlan: package: bump restbase-mod-table-cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/675523 (https://phabricator.wikimedia.org/T278699) [14:50:02] (03CR) 10Mforns: Add monthly pageview complete job (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/658348 (https://phabricator.wikimedia.org/T265732) (owner: 10Fdans) [14:53:16] who is a good person to curse with *ahem* help with review on a minor AQS change? [14:55:32] 10Analytics, 10Cassandra: Store AQS schema and grants in git - https://phabricator.wikimedia.org/T278701 (10hnowlan) [14:55:51] 10Analytics, 10Cassandra, 10Patch-For-Review: AQS Cassandra driver needs to be updated - https://phabricator.wikimedia.org/T278699 (10hnowlan) a:03hnowlan [14:58:42] hnowlan: for the aqs change you can add Joseph and Dan [14:58:47] and also Marcel [14:59:50] and we have https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deployment [15:03:21] fdans: should I create a task to add the pageviews per country metrics to WIkistats, or do we usually wait until the endpoint's stability is 'stable' to do that? I could also add it anyways and note that we need to wait for the stability change [15:28:18] 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464 (10Tgr) 05duplicate→03Open Not really related. [15:32:30] elukey: took a quick glance at httpbb that you suggested, and it seems that it's more intended for validity testing (to make sure an expected response is served) rather than load testing (to test load on Cassandra) - do you know if it's possible to get Cassandra load metrics for httpbb? would I just have to use something like prometheus to monitor Cassandra? [15:39:39] 10Analytics-Clusters: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10Ottomata) a:03razzi [15:39:48] 10Analytics-Clusters: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10Ottomata) a:03razzi [15:39:58] 10Analytics-Clusters: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10Ottomata) a:03elukey [15:41:04] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10Ottomata) @razzi, FYI in ops sync today we decided that you could drive a few of these upgrade tasks in Q4, while Luca would drive the Hadoop coordinator node one. I've assigned them accord... [15:42:55] 10Analytics-Clusters: Migrate eventlog1002 to buster - https://phabricator.wikimedia.org/T278137 (10Ottomata) a:03hnowlan [15:43:33] 10Analytics-Clusters: AQS Cassandra storage: Investigate incorrect storage report on Grafana - https://phabricator.wikimedia.org/T278234 (10Ottomata) a:03hnowlan [15:44:39] 10Analytics-Clusters: Upgrade Druid to 0.20.1 (latest upstream) - https://phabricator.wikimedia.org/T278056 (10Ottomata) a:03elukey Luca to work with @razzi on this. [15:47:42] 10Analytics-Clusters, 10Analytics-Kanban: AQS Cassandra storage: Investigate incorrect storage report on Grafana - https://phabricator.wikimedia.org/T278234 (10Ottomata) [15:48:12] 10Analytics-Clusters, 10Analytics-Kanban, 10observability, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10Ottomata) [15:48:34] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Ottomata) [15:56:58] gonna depool aqs1004 while syncing over the local_group_default_T_pageviews_per_article_flat tables if that's ok [16:02:20] elukey: just an update on the rack/topology question in AQS - turns out that if you import all data from all hosts in a single rack, that'll be enough. Which makes total sense in hindsight g [16:18:12] hnowlan: \o/ [16:18:22] that is a lot less data moving and loading! [16:18:42] +1 for depool and sync! [16:50:54] thanks a lot hnowlan :) [16:51:04] Gone for diner team - will be back after [16:57:57] later! [16:58:08] I'm guessing aqs1004 is gonna be out until 1800 UTC at least [16:59:20] ack! [17:12:29] 10Analytics, 10Analytics-Wikistats, 10Patch-For-Review, 10good first task: Wikistats Bug - easy to understand language for pageviews - https://phabricator.wikimedia.org/T263973 (10Bharatkhatri351) a:05Bharatkhatri351→03None [17:24:56] elukey: I think you may have missed my message above around 2 hours ago, would you mind taking a look? [17:27:49] lexnasser: ah sorry indeed I missed it! [17:28:20] yes yes it may not be the right tool, feel free to use siege or ab [17:28:35] it also depends what is the goal of your load testing [17:29:18] but siege might give you good latency metrics related to the client, plus you can incorporate it with more generic metrics from grafana about the cluster in general [17:29:40] need to go now sorry if I have missed the ping, we can chat anytime tomorrow or after it! [17:29:43] * elukey afk! [17:35:45] thanks for the info, chat tomorrow! [19:06:55] aqs1004 is back in [19:07:17] starting the import on aqs1010, it'll be a while :) [19:12:32] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670321 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [19:35:43] (03CR) 10Mforns: Add support for finding RefineTarget inputs from Hive (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/673604 (https://phabricator.wikimedia.org/T212451) (owner: 10Ottomata) [19:38:01] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/674304 (owner: 10Ottomata) [19:43:05] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Merging. Thanks a lot for this!" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [20:00:06] 10Analytics-Radar, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Wikidata Usage and Coverage ETL failing from stat1004 - https://phabricator.wikimedia.org/T278299 (10GoranSMilovanovic) 05Open→03Resolved [20:10:52] 10Quarry: Bad database name: nds_nlwiki - https://phabricator.wikimedia.org/T278715 (10Bdijkstra) [20:25:37] (03CR) 10Ottomata: Add support for finding RefineTarget inputs from Hive (0314 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/673604 (https://phabricator.wikimedia.org/T212451) (owner: 10Ottomata) [20:26:00] (03CR) 10Ottomata: "Ah while, unsubmitted comments from Friday! Sorry about that." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/673604 (https://phabricator.wikimedia.org/T212451) (owner: 10Ottomata)