[00:01:34] (03PS2) 10MNeisler: Add DesktopWebUIActionsTracking fields to eventlogging allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631988 (https://phabricator.wikimedia.org/T263143) [00:06:06] (03CR) 10MNeisler: Add DesktopWebUIActionsTracking fields to eventlogging allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631988 (https://phabricator.wikimedia.org/T263143) (owner: 10MNeisler) [05:37:29] !log decom analytics1047 from the Hadoop cluster [05:37:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [05:37:35] bonjour [06:38:20] Bonjour ! [06:38:48] Yeah, more and more nodes decom elukey - Congrats :) [06:39:30] joal: congrats?? :D [06:39:59] Well, no alert, everything works and all - while still removing nodes :) [06:40:20] too much credit, it is really a two command thing :) [06:40:32] anyway, hdfs on the test cluster should be working :) [06:40:57] there is not a lot of space, I think it is probably best to reduce replication to 2 [06:41:03] that's great! I'll test functional distcp cross-cluster today [06:41:28] My test is really not about space for now, rather about functionality [06:42:19] yes ok but I'll need space for sure in the future to test :D [06:42:27] For sure [06:43:15] elukey: interesti [06:43:21] elukey: memory write [06:43:55] elukey: I wonder how nodes deal with memory pressure when doing such a thing [06:44:21] elukey: or, we reduce available memory for compute and preallocate ram for fs - probably the best approach I guess [06:45:12] yes we'd need to add tmpfs mountpoints, that preallocate X amount of RAM when mounted [06:45:35] right - And that RAM cannot be shared with others [06:49:25] elukey: can ou tell me more about that test cluster (master adress for instance), please? [06:49:51] so the nodes are [06:49:56] an-test-master1001 [06:50:00] an-test-master1002 [06:50:07] an-test-coordinator1001 (still wip) [06:50:14] an-test-worker100[1-3] [06:50:14] nice and easy :) [06:50:57] better than before yes [06:55:28] elukey: do you by any chance know which engine in MariaDB? [06:56:02] elukey: + we use - sorry - not coffeinated enough [06:56:10] joal: should be innodb! [06:56:36] ok - then this could actually be very fun: https://issues.apache.org/jira/browse/CALCITE-4034 [06:56:37] also I assume the mariadb on an-coord1001 right [06:56:40] ? [06:56:55] elukey: I was thinking on our prod DBs more [06:57:06] wow! [06:57:18] I think it should be innodb as well [06:57:32] elukey: I assume we might have backups of innodb files ? [06:58:02] Which we possibly could load into hdfs - and maybe convert there [06:58:20] so we do have a lot of backups but we'd need to ask to data persistence about details [06:58:33] I can imagine that :) [06:59:35] IIRC we do use xtrabackup that should backup innodb files [07:00:29] so ideally this could eliminate our need for the huge db replica for sqoop? [07:00:58] it could, depending on how well it works and how data-cop for innodb-files can be done [07:01:13] really interesting [07:02:21] I love calcite elukey :) [07:02:30] !log reduce hdfs block replication factor on Hadoop test to 2 [07:02:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:02:43] thanks elukey --^ [07:03:18] ah also the test cluster runs bigto [07:03:20] *bigtop [07:03:26] \o/ [07:12:39] elukey: I assume it's you swapping active node on test cluster? [07:14:13] yes just did [07:17:46] ack - something else we'll be willing to to be able to handle HA-HDFS while copying - https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/administration/content/distcp_between_ha_clusters.html [07:19:14] make sense yes, some of those should already be there in theory [07:19:56] ah nice one test datanode wasn't working, now space is ~42T with replication 2 [07:28:02] Ok I have a working distcp command :) [07:30:32] niceeee [08:09:45] 10Analytics, 10Operations, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Kormat) 05Open→03Resolved a:03Kormat Sounds like this is complete, so resolving. [08:19:50] 10Analytics, 10Operations, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey) Just executed: ` elukey@krb1001:~$ sudo manage_principals.py create lexnasser --email_address=lexnasser@icloud.com Principal successfully created. Make... [08:27:30] * elukey bbiab! [08:55:14] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create the new Hadoop test cluster - https://phabricator.wikimedia.org/T255139 (10elukey) The basic test cluster is set up: an-master100[1,2] - Hadoop masters an-coord1001 - test coordinator an-worker100[1-3] - workers The total HDFS space is ar... [09:27:37] 10Analytics-Clusters, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) a:05Jclark-ctr→03Cmjohnson [09:28:14] 10Analytics-Clusters, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) @Cmjohnson if you have some time during the next days can we swap the NIC on one node only? (to verify the procedure and make sure that the NICs... [09:34:58] 10Analytics-Clusters: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey) [09:35:00] 10Analytics-Clusters, 10Analytics-Kanban, 10User-Elukey: Create temporary cluster to hold a copy of data for backup purposes - https://phabricator.wikimedia.org/T263814 (10elukey) [09:45:55] 10Analytics: Configure Yarn to be able to locate nodes with a GPU - https://phabricator.wikimedia.org/T264401 (10elukey) I found https://www.ibm.com/support/pages/node/6260093 that says: > Recommended versions > The YARN node labels feature was introduced in Apache Hadoop 2.6, but it’s not mature in the first o... [09:50:48] very nice https://github.com/criteo/tf-yarn#running-on-gpu [09:52:28] joal: one thing that I see in --^ is that people also have a separate yarn queue for hosts with GPUs, could be interesting [09:53:04] elukey: interesting! [09:53:20] the labels that I have in mind are related to GPUs and 10 nics, but IIUC from a IBM link hadoop 2.6 is no bueno [09:53:47] elukey: We'll soon have 2.7, and even 2.9 IIRC, no? [09:54:06] (10 nics since some nodes will stay on 1g even after the refreshes etc.., so it would be nice to have say the heavy spark jobs only running on 10g) [09:54:16] 2.8.5 as first step, then 2.10 [09:55:47] so yes definitely better support [09:56:08] one thing that I am wondering is if, at the end, we'll really need native yarn support for rocm or not [09:56:33] or if node labels, in practice, will solve our issues [09:57:10] (rocm is not really nice now with concurrency on the gpu, but it should hopefully improve in the future) [09:57:30] and also, how does this work relate with the new ML platform? [10:01:42] elukey: My understanding is that the new ML platform will use k8s as a scheduler through kubeflow, so probably no need for support from us [10:02:14] elukey: About rocw support in YARN, if we don't get it it means we need to enforce single job running in the GPU queue to prevent conflicts [10:03:09] joal: but if say a job scheduled on a GPU that is busy can "wait" for resources to be freed, it might also be fine to schedule multiple jobs on the same nodes [10:03:33] OR we could simply create a dedicated queue with GPU labels and one job at the time [10:03:53] until we have a better solution I mean [10:04:20] elukey: jobs waiting for GPU to be available is exactly what yarn-first-class-support is about [10:04:27] joal: re: ML platform - I get the k8s part, what I mean is the role of the hadoop cluster to train models etc.. [10:04:58] so for now it'll need to be labels (for placement) and queue (for signel job at a time, even if not all GPUs are used - what a shame) [10:05:22] elukey: As I understand models are trained on k8s, not on hadoop [10:05:29] what I mean is that even if two tf jobs are running on the same GPU, maybe it is fine since tf will wait by itself [10:06:15] elukey: If jobs behave correctly through inner APIs when run concurrently, why not allow multiple jobs at a time [10:06:48] it is a big if, needs to be tested etc.. [10:06:55] yup [10:08:19] about training - there might be people that will need to train on gpus no? Otherwise I don't explain why we are working on GPU Nodes :D [10:08:42] this is why I am asking what it will be the relationship between us and the ML infra [10:09:01] elukey: I assume training on GPUs will also be done through kubeflow - And I think we should postpone actual work on GPUs on yarn [10:09:15] elukey: let's ask nuria while she's still here :) [10:09:51] I am very confused, but I think that eventually we'll have two ways of training and people will choose depending on their needs [10:10:25] ok I'll stop, thanks for the brainbounce :) [10:10:41] hm - maybe - i seems effort-duplication to me, but eh, who knows [10:12:45] klausman: good morning :) let's have a chat about --^ during the next days, so we have a clear path in mind [10:13:21] yep [10:13:30] also, good morning [10:13:49] Hi klausman :) [10:52:04] (03PS1) 10Lucas Werkmeister (WMDE): Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633515 (https://phabricator.wikimedia.org/T265272) [10:53:23] (03Abandoned) 10Lucas Werkmeister (WMDE): Fix and reenable terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/603557 (https://phabricator.wikimedia.org/T154601) (owner: 10Lucas Werkmeister (WMDE)) [11:23:36] !log remove analytics-meta lvm backup settings from an-master1002 [11:23:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:25:22] elukey: Would you have some time today to try druid configs for nested-group-by query? [11:27:53] sure! [11:28:04] I am going afk from ~14->15:30 [11:28:08] elukey: let me know what moment is best for you [11:28:14] ok :) [11:28:28] do we batcave now to prepare? [11:29:08] I am currently finishing one thing before leaving, ok if we do at 15:30? [11:29:19] Actually, I wonder if I shouldn't do more testing on my own before involving you - I have ideas :) [11:29:45] elukey: let's do tomorow - it gives me time to test before [11:31:55] sure! [11:31:57] anytime [11:32:51] !log remove analytics-meta lvm backup settings from an-coord1001 [11:32:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:33:29] elukey: the new druid UI is so easier - this is really great [11:33:33] \o/ [11:34:14] there are a couple of things that we still didn't do [11:34:27] 1) fold the overlord into the coordinator (should be possible now) [11:34:48] 2) move middle manager to multi-thread (vs multi process via peons) [11:35:16] ack - 1) allows for one less component? [11:36:24] !log Clean druid test-datasources [11:36:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:38:39] joal: yes exactly, we'd remove the need for the overlord as standalone daemon [11:38:49] it doesn't use a lot of space etc.. atm so not a big deal [11:38:51] nice elukey [11:39:03] elukey: 1 less daemon to monitor is no small :) [11:39:31] similarly, multi-thread middle-manager instead of multi-process peons seems nice [11:43:35] GoranSM: Hello - I have a vague memory of possibly you owning druid datasources (test_gsc_all and test_gsc_rich) - Am I right? [11:47:34] going afk! ttl [11:47:38] bye elukey [11:57:51] (03CR) 10Tobias Andersson: [C: 03+1] Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633515 (https://phabricator.wikimedia.org/T265272) (owner: 10Lucas Werkmeister (WMDE)) [12:18:37] joal: No, my friend, I own no Druid sources. [12:19:16] Thanks GoranSM :) I'm gonna continue my reasearch of owners [12:54:02] * klausman out for lunch and groceries [13:19:33] 10Analytics-Radar, 10ChangeProp, 10Event-Platform, 10WMF-JobQueue, and 2 others: Better way to pause writes on elasticsearch - https://phabricator.wikimedia.org/T230730 (10Gehel) [13:46:48] joal: if you need a brainbounce for anything I am back [13:56:21] also I am more and more interested in https://issues.apache.org/jira/browse/CALCITE-4034 [14:01:29] 10Analytics, 10EventStreams: EventStreams socket stays connected without any traffic incoming - https://phabricator.wikimedia.org/T250912 (10Ikkingjinnammebetinke) I believe it is solved for my case. I ditched Java's internal HTTP handling and replaced it with our internal utility using Apache HttpClient. Up u... [14:26:43] (03PS2) 10Tobias Andersson: Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633515 (https://phabricator.wikimedia.org/T265272) (owner: 10Lucas Werkmeister (WMDE)) [14:47:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633515 (https://phabricator.wikimedia.org/T265272) (owner: 10Lucas Werkmeister (WMDE)) [14:47:26] (03PS1) 10Lucas Werkmeister (WMDE): Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633488 (https://phabricator.wikimedia.org/T265272) [14:47:46] (03Merged) 10jenkins-bot: Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633515 (https://phabricator.wikimedia.org/T265272) (owner: 10Lucas Werkmeister (WMDE)) [14:48:03] (03CR) 10Lucas Werkmeister (WMDE): "I’m not sure who even has +2 rights on the deployment branch… at least Amir does, I think." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633488 (https://phabricator.wikimedia.org/T265272) (owner: 10Lucas Werkmeister (WMDE)) [14:50:15] (03CR) 10Ladsgroup: [C: 03+2] Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633488 (https://phabricator.wikimedia.org/T265272) (owner: 10Lucas Werkmeister (WMDE)) [14:50:44] (03Merged) 10jenkins-bot: Remove terms_by_language script [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/633488 (https://phabricator.wikimedia.org/T265272) (owner: 10Lucas Werkmeister (WMDE)) [15:03:24] ping mforns ? [15:03:25] mforns: standuuupp [15:27:50] 10Analytics-Clusters, 10Operations, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6507717, @elukey wrote: > Most of the usage seems to be VUT related, especially for `fxstatat64` (no idea where it is used). You are indeed correct. The n... [15:45:41] 10Analytics-Radar, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10Ottomata) Hm, I think we need to find a use case that can be done using Stream SQL repl. I don't think SRE will deploy a Java app e... [15:56:18] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10Ottomata) a:03Ottomata Interesting. So we don't know exactly where the timeout is occurring? Assigning to... [16:00:11] ottomata: o/ [16:01:13] yoeee [16:01:40] how are things??? All good? [16:04:45] 10Analytics-Clusters, 10Operations, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) Thanks Ema, really great analysis! I am wondering if we could quickly test how varnishncsa behaves when we pass `-q`, that seems to be the big difference between the tw... [16:10:34] elukey: ya! kind of a holiday today i guess? [16:10:38] i'm working through emails [16:10:44] trying to catch up but woweeeee [16:15:31] yep I figured! [16:18:03] going to log off earlier today, ttl! [16:19:27] laters! [17:26:24] elukey: on GPus, that is correct, all training will happen (i think) on k8 [17:27:01] elukey: now, the 1st iteration of that platform (per chris' plan) about serving models rather than training (cc klausman for correction) [17:28:43] elukey: the efforts towards GPU in debian should continue with cause we will continue to use those more and more [17:30:31] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Combine filters and splits on wikistats UI - https://phabricator.wikimedia.org/T249758 (10Nuria) ping on announcement e-mail to wikitech-l (cc @fdans ) [17:30:42] 10Analytics, 10Analytics-Kanban: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10Nuria) [17:30:44] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Nuria) 05Open→03Resolved [17:30:46] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Nuria) [17:36:48] 10Analytics-Clusters, 10Operations, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6536967, @elukey wrote: > I am wondering if we could quickly test how varnishncsa behaves when we pass `-q`, that seems to be the big difference between the... [18:16:05] nuria: hola! Yes we followed up a bit during standup, my only fear is having GPUs sitting on hadoop workers doing nothing because people train models on k8s (with CPU only). In case in the future we'll just repurpose those nodes if needed [18:16:42] elukey: i think in the near term all training will happen on stats machines and serving will be on kubeflow /k8 [18:17:27] nuria: yes yes makes sense, I think that it is possible to work with Miriam to try tensorflow on yarn though, with node labeling [18:17:29] elukey: so , i think you are correct that gpu work in hadoop can take a second priority [18:17:51] elukey: as an experiment is totally worth it ya [18:18:23] elukey: the serving layer (even on k8) will need gpus and for that the work on debian is key [18:21:00] nuria: ack! [18:36:34] (03CR) 10Nuria: [C: 03+2] Add DesktopWebUIActionsTracking fields to eventlogging allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631988 (https://phabricator.wikimedia.org/T263143) (owner: 10MNeisler) [18:36:47] (03CR) 10Nuria: [V: 03+2 C: 03+2] Add DesktopWebUIActionsTracking fields to eventlogging allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631988 (https://phabricator.wikimedia.org/T263143) (owner: 10MNeisler) [18:44:40] 10Analytics-Clusters, 10Operations, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) There is another test that we could do, namely grouping. As far as I can see in the varnishkafka change, the [[ https://github.com/wikimedia/varnishkafka/commit/b0675e80... [18:52:18] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Add data quality alarm for mobile-app data - https://phabricator.wikimedia.org/T257692 (10Nuria) [18:53:28] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Add data quality alarm for mobile-app data - https://phabricator.wikimedia.org/T257692 (10Nuria) Sum up: The timeseries of entropy of os_family per access_method works well to as a data quality timeseries for 'mobile web' (see green line in plot above) an... [19:11:51] 10Analytics-Radar, 10Product-Analytics: Add DesktopWebUIActionsTracking fields to the allowlist - https://phabricator.wikimedia.org/T263143 (10MNeisler) [21:33:57] (03PS1) 10Nuria: [WIP] Adding quality alarms for mobile app data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633579 (https://phabricator.wikimedia.org/T257692) [21:34:33] (03CR) 10Nuria: "Still testing but please take a look for naming." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633579 (https://phabricator.wikimedia.org/T257692) (owner: 10Nuria)