[00:54:59] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:09:27] * elukey bbiab [08:31:39] I am starting https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Test [08:55:00] Good morning [08:55:24] bonjour! How are things? [08:55:39] All good elukey - ready to stay home :) [08:55:44] :( [08:56:21] elukey: we have friends that were in the middle of a house-move - I went and helped yesterday in a rush, and we care their kid today :) [08:56:37] elukey: nothing major, but little adjustments on schedule [08:56:55] elukey: AND enough stuff and tools to keep busy while at home :D [08:57:27] :) [09:12:24] elukey: how is it going for you? [09:13:27] all good! not sure if you are busy but we can have a quick cofffe :) [09:21:38] Ready for coffee I am :) [10:05:25] Morning! [10:05:34] Hi klausman :) [10:05:43] Plan: install 5.8 on 1005, reboot, push rocm38 to it, watch fireworks [10:05:52] \o/ fireworks :) [10:10:08] :) [10:15:00] Rebooting [10:19:00] stat1005 ~ $ uname -a [10:19:02] Linux stat1005 5.8.0-0.bpo.2-amd64 #1 SMP Debian 5.8.10-1~bpo10+1 (2020-09-26) x86_64 GNU/Linux [10:19:04] wheee [10:19:07] yessss [10:19:32] The 3.3 rocm-smi works fine and reports good numbers [10:19:43] Now to get 3.8 onto the machine [10:23:46] I wish there was a way to disable only *automatic* puppet runs, but still be able to trigger a manual/interactive one [10:25:30] mmm I am not getting the use case [10:25:51] is it to avoid typing --enable before a manual puppet run? :D [10:26:11] Well, and disabling it immediately afterwards as well [10:26:42] Or is the cmdline --enable only for that run? [10:26:59] if you need to change other things, but usually (at least for me) it is: disable, test, enable+run [10:27:05] nono it is global [10:27:46] Fun fact: rock-dkms 3.8 fails with kernel 5.8 as well (though differently) [10:33:13] elukey: I'll have to iteratively update the 3.8 packages grabbed by our apt reprepro, there are still missing deps from upstream. Patch incoming [10:33:41] ack [10:46:10] elukey: do we have docs somewhere on how to run a Tensorflow GPU hello-world from scratch? [10:47:52] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Configure_your_Tensorflow_script [10:48:27] mmmm not sorry, the bottom of [10:48:27] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Upgrade_the_Debian_packages [10:48:42] this is what I usually use [10:50:06] roger [11:03:40] Hmm. Does virtualenv honor http_proxy et al? [11:04:01] ah, it does [11:08:02] AttributeError: module 'tensorflow' has no attribute 'Session' [11:08:12] elukey: ^^^ Does that ring a bell? [11:10:00] sort of, I think it may be a tf 2 vs tf 1 coding issues [11:10:14] yeah, there is a tf.compat.v1.Session [11:10:47] Same for ConfigProto [11:11:02] can I try to run a script ? [11:11:02] But even with that, I get: [11:11:04] RuntimeError: The Session graph is empty. Add operations to the graph before calling run(). [11:11:08] Sure [11:12:48] Ah, one needs tf.compat.v1.disable_eager_execution() with the example script [11:13:07] I have test_tf2.py in my home, that fails for [11:13:14] ImportError: libhip_hcc.so.3: cannot open shared object file: No such file or directory [11:13:29] Checking... [11:13:44] trying also to upgrade tf-rocm [11:14:53] yep works [11:14:59] with tf-rocm 2.3.1 [11:15:00] the last one [11:15:02] Excellent [11:15:27] So should we tell the users to give the new setup a whirl and report any issues? [11:15:55] Oh, and we need to change the puppet role to not install rock-dkms anymore for 3.8 [11:16:05] ah yes I was about to ask [11:16:19] +1 [11:16:22] I think I can change the firmware special-casing to make that happen [11:16:49] what I'd also highlight in docs +email is that tensorflow-rocm needed is 2.3.1 (the last upstream) [11:16:55] Ack. [11:17:00] I had 2.1.1 and it didn't work [11:20:13] I also quickly tested spark2-shell and it doesn't break [11:20:59] there are other things to test like jupyter etc.., just to make sure that the new kernel is ok, but we can do it later on [11:21:39] let's also ping miriam [11:21:59] helloo [11:22:13] miriam: ciao! we have updated the AMD drivers on stat1005, plus the kernel, and it looks good, but we need tf-rocm 2.3.1 [11:22:52] Tobias is sending an email with all the details [11:23:04] if you see anything weird let us know :) [11:23:21] ciao elukey and klausman! Thanks for this! should I update this https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU [11:23:32] Lovely. I just lost v4 connectivity, but v6 still works :D [11:23:46] elukey: do you want me to do tests? [11:24:16] miriam: if you have a moment later on yes, but nothing urgent :) [11:24:21] we'll also upgrade the page [11:24:51] brilliant, thanks!! I'll do it later, thanks a lot, you guys rock!! [11:24:57] thankssss [11:25:39] fdans would say "we ROCm!" [11:25:41] * elukey hides [11:26:09] ahahaha [11:27:53] elukey: Can you take a look at the CI failure in https://gerrit.wikimedia.org/r/c/operations/puppet/+/637682? I find the error very puzzling [11:28:55] ah, nvm, I think I may have spotted the mistake [11:29:52] (<< does not work the same way as + does when concatenating lists [a,b] + [c,d] is [a,b,c,d], but [a,b] << [c,d] is [a,b,[c,d]]; I had remembered it the other way around) [11:30:14] aaand still the same error. [11:30:36] oh. silly me. [11:30:52] note to self: when you rename a variable, do it *everywhere* [11:31:20] :) [11:31:41] But the < And Jerkins is happy [11:34:52] going afk for lunch + errand, ttl! [11:35:19] later! [12:24:18] Hi, I have a question around using mysql on stat-machines (stat1008). I would like to store some processed data as mysql-databases there. I saw documentation around accessing/reading the mariadb-replicas, but dont know how I would create my own database (or whether thats possible). Any pointers? Background: storing a few look-up table for link-recommendation in mysql-tables instead of sqlite [12:24:18] https://phabricator.wikimedia.org/T265610 thanks for any help [13:33:37] mgerlach: I've heard great things about DuckDB https://duckdb.org/ you may want to check it out [13:43:40] mgerlach: alternatively there's a 'staging' database that appears to be exactly what you're looking for (`stat1008$ analytics-mysql staging`). It looks like Morten/nettrom's got quite a few tables in there, you may want to ask him about the process. [13:46:38] if you come across documentation for it on wikitech please share because I couldn't find it [13:56:12] (03CR) 10Joal: "One comment inline about perf, and the ask of moving this file in the refine folder please :)" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [14:02:39] bearloga: hi! I think that there should be a reference on data access, the staging db is a scratchpad basically [14:08:12] bearloga,mgerlach - https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB [14:08:20] if something is missing let's clarify! [14:10:43] elukey: ah, nevermind. kostajh just clarified that with multiple tables per wiki there would eventually be 100+ tables required for the link project mgerlach is working on so it'd be better for them to have their own databases [14:11:55] yes yes staging is meant to be non-production stuff [14:12:14] T266826 is also relevant to the discussion, fwiw [14:12:14] T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server - https://phabricator.wikimedia.org/T266826 [14:16:06] bearloga: thanks for the pointers. [15:04:56] who's feeling up for a social hour today? [15:05:42] milimetric: sure :] [15:05:49] milimetric: I'm up! [15:07:16] ok :) does 16:00 UTC work for everyone (in about an hour) or is earlier better? [15:07:25] (heh, as in now) [15:07:49] milimetric: now would be fine for me, or later too [15:08:06] a-team social hour in the batcave, all who want to join are welcome! :) [15:37:38] 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson) [15:40:17] 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson) p:05Triage→03High So this is is now happening at an even higher frequency then before. We are dropping IPs from these soon so I'm not sure what the... [16:33:59] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:40:07] ok so kafka-jumbo1006 is back up recovering, before getting 10g cards we need to install a package [16:40:22] oh ya? [16:41:54] 10Analytics-Clusters, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) We had to rollback the NIC on 1006, we need to install `firmware-bnx2x` on all nodes before doing any work (checked with Fa... [16:43:47] ottomata: yes I commented in --^, basically the driver was not there and without network connectivity we can't really apt-get install :D [16:44:16] I thought it was already in the kernel's drivers since we use those nics in other places, but IIUC the drivers are added at d-i time if needed [16:44:38] next time it should be fine [16:46:02] ha! good to know [16:46:11] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 2.271e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [16:48:03] razzi: if you want to watch metrics in the "Kafka" panel of https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&from=now-3h&to=now [16:48:08] there are interesting ones [16:48:22] (kafka-jumbo1006 was done for a bit for hw maintenance, and now it is recovering) [16:54:29] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 745 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [16:56:25] Ah yes, and I can see the usage is low on kafka-jumbo100[7,8,9] which makes it make sense for us to rebalance partitions with https://phabricator.wikimedia.org/T255973 [16:58:20] yep, also when a broker is down the others take the load [16:59:45] the main issue atm is that kafka-jumbo1006 is still not ingesting the same share of bytes/messages as the other nodes [17:00:16] in theory, kafka takes care of it by itself after some minutes, in practice it might need a little encouragement [17:01:22] by encouragement you mean leader election? [17:01:30] yep, I just did one [17:01:41] !log kafka preferred-replica-election on jumbo1001 [17:01:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:02:17] there was an imbalance count for a broker [17:02:53] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) > I agree that threshold of 100s pageviews seems small for privacy. I agree if we're delivering raw data and using... [17:04:04] yep it worked [17:09:01] razzi: you can see how the graphs are changing, basically because there was a re-assigment of what brokers are leaders for partitions (the ones getting traffic from producers, so not simple replicas) [17:09:06] 10Analytics, 10Operations: Augment NEL reports with a computed timestamp-of-generation - https://phabricator.wikimedia.org/T266886 (10CDanis) [17:12:48] razzi: if it is not clear (like it was for me when I started, kafka maybe be a little dense to grasp), we have a Kafka master in the team that knows all the details :) (Andrew) [17:13:38] all metrics look good now, going to log off :) [17:13:49] have a good weekend people! [17:37:49] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:01:58] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10mforns) Awesome analysis @lexnasser! I was thinking about using the threshold on pageviews vs. on unique readers, as @Is... [20:33:11] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10Mholloway) [20:35:49] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:16:57] 10Analytics-Radar, 10Wikipedia-iOS-App-Backlog, 10Reading Epics (Analytics), 10Spike: Research and define initial technical requirements for app analytics - https://phabricator.wikimedia.org/T164801 (10Mholloway) For posterity, linking https://gerrit.wikimedia.org/r/c/apps/android/wikipedia/+/359008 which... [21:37:01] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:23:03] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Dzahn) a:03Rmaung [22:24:39] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:03JAnstee_WMF