[00:54:59] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:09:27] * elukey bbiab
[08:31:39] <elukey>	 I am starting https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Test
[08:55:00] <joal>	 Good morning
[08:55:24] <elukey>	 bonjour! How are things?
[08:55:39] <joal>	 All good elukey - ready to stay home :)
[08:55:44] <elukey>	 :(
[08:56:21] <joal>	 elukey: we have friends that were in the middle of a house-move - I went and helped yesterday in a rush, and we care their kid today :) 
[08:56:37] <joal>	 elukey: nothing major, but little adjustments on schedule
[08:56:55] <joal>	 elukey: AND enough stuff and tools to keep busy while at home :D
[08:57:27] <elukey>	 :)
[09:12:24] <joal>	 elukey: how is it going for you?
[09:13:27] <elukey>	 all good! not sure if you are busy but we can have a quick cofffe :)
[09:21:38] <joal>	 Ready for coffee I am :)
[10:05:25] <klausman>	 Morning!
[10:05:34] <joal>	 Hi klausman :)
[10:05:43] <klausman>	 Plan: install 5.8 on 1005, reboot, push rocm38 to it, watch fireworks
[10:05:52] <joal>	 \o/ fireworks :)
[10:10:08] <elukey>	 :)
[10:15:00] <klausman>	 Rebooting
[10:19:00] <klausman>	 stat1005 ~ $ uname -a
[10:19:02] <klausman>	 Linux stat1005 5.8.0-0.bpo.2-amd64 #1 SMP Debian 5.8.10-1~bpo10+1 (2020-09-26) x86_64 GNU/Linux
[10:19:04] <klausman>	 wheee
[10:19:07] <elukey>	 yessss
[10:19:32] <klausman>	 The 3.3 rocm-smi works fine and reports good numbers
[10:19:43] <klausman>	 Now to get 3.8 onto the machine
[10:23:46] <klausman>	 I wish there was a way to disable only *automatic* puppet runs, but still be able to trigger a manual/interactive one
[10:25:30] <elukey>	 mmm I am not getting the use case
[10:25:51] <elukey>	 is it to avoid typing --enable before a manual puppet run? :D
[10:26:11] <klausman>	 Well, and disabling it immediately afterwards as well
[10:26:42] <klausman>	 Or is the cmdline --enable only for that run?
[10:26:59] <elukey>	 if you need to change other things, but usually (at least for me) it is: disable, test, enable+run
[10:27:05] <elukey>	 nono it is global
[10:27:46] <klausman>	 Fun fact: rock-dkms 3.8 fails with kernel 5.8 as well (though differently)
[10:33:13] <klausman>	 elukey: I'll have to iteratively update the 3.8 packages grabbed by our apt reprepro, there are still missing deps from upstream. Patch incoming
[10:33:41] <elukey>	 ack
[10:46:10] <klausman>	 elukey: do we have docs somewhere on how to run a Tensorflow GPU hello-world from scratch?
[10:47:52] <elukey>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Configure_your_Tensorflow_script
[10:48:27] <elukey>	 mmmm not sorry, the bottom of
[10:48:27] <elukey>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Upgrade_the_Debian_packages
[10:48:42] <elukey>	 this is what I usually use
[10:50:06] <klausman>	 roger
[11:03:40] <klausman>	 Hmm. Does virtualenv honor http_proxy et al?
[11:04:01] <klausman>	 ah, it does
[11:08:02] <klausman>	 AttributeError: module 'tensorflow' has no attribute 'Session'
[11:08:12] <klausman>	 elukey: ^^^ Does that ring a bell?
[11:10:00] <elukey>	 sort of, I think it may be a tf 2 vs tf 1 coding issues
[11:10:14] <klausman>	 yeah, there is a tf.compat.v1.Session
[11:10:47] <klausman>	 Same for ConfigProto
[11:11:02] <elukey>	 can I try to run a script ?
[11:11:02] <klausman>	 But even with that, I get:
[11:11:04] <klausman>	 RuntimeError: The Session graph is empty.  Add operations to the graph before calling run().
[11:11:08] <klausman>	 Sure
[11:12:48] <klausman>	 Ah, one needs tf.compat.v1.disable_eager_execution() with the example script
[11:13:07] <elukey>	 I have test_tf2.py in my home, that fails for
[11:13:14] <elukey>	 ImportError: libhip_hcc.so.3: cannot open shared object file: No such file or directory
[11:13:29] <klausman>	 Checking...
[11:13:44] <elukey>	 trying also to upgrade tf-rocm
[11:14:53] <elukey>	 yep works
[11:14:59] <elukey>	 with tf-rocm 2.3.1
[11:15:00] <elukey>	 the last one
[11:15:02] <klausman>	 Excellent
[11:15:27] <klausman>	 So should we tell the users to give the new setup a whirl and report any issues?
[11:15:55] <klausman>	 Oh, and we need to change the puppet role to not install rock-dkms anymore for 3.8
[11:16:05] <elukey>	 ah yes I was about to ask
[11:16:19] <elukey>	 +1
[11:16:22] <klausman>	 I think I can change the firmware special-casing to make that happen
[11:16:49] <elukey>	 what I'd also highlight in docs +email is that tensorflow-rocm needed is 2.3.1 (the last upstream)
[11:16:55] <klausman>	 Ack.
[11:17:00] <elukey>	 I had 2.1.1 and it didn't work
[11:20:13] <elukey>	 I also quickly tested spark2-shell and it doesn't break
[11:20:59] <elukey>	 there are other things to test like jupyter etc.., just to make sure that the new kernel is ok, but we can do it later on
[11:21:39] <elukey>	 let's also ping miriam 
[11:21:59] <miriam>	 helloo
[11:22:13] <elukey>	 miriam: ciao! we have updated the AMD drivers on stat1005, plus the kernel, and it looks good, but we need tf-rocm 2.3.1
[11:22:52] <elukey>	 Tobias is sending an email with all the details
[11:23:04] <elukey>	 if you see anything weird let us know :)
[11:23:21] <miriam>	 ciao elukey and klausman! Thanks for this! should I update this https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU 
[11:23:32] <klausman>	 Lovely. I just lost v4 connectivity, but v6 still works :D
[11:23:46] <miriam>	 elukey: do you want me to do tests?
[11:24:16] <elukey>	 miriam: if you have a moment later on yes, but nothing urgent :)
[11:24:21] <elukey>	 we'll also upgrade the page
[11:24:51] <miriam>	 brilliant, thanks!! I'll do it later, thanks a lot, you guys rock!!
[11:24:57] <elukey>	 thankssss
[11:25:39] <elukey>	 fdans would say "we ROCm!" 
[11:25:41] * elukey hides
[11:26:09] <miriam>	 ahahaha
[11:27:53] <klausman>	 elukey: Can you take a look at the CI failure in https://gerrit.wikimedia.org/r/c/operations/puppet/+/637682? I find the error very puzzling
[11:28:55] <klausman>	 ah, nvm, I think I may have spotted the mistake
[11:29:52] <klausman>	 (<< does not work the same way as + does when concatenating lists [a,b] + [c,d] is [a,b,c,d], but [a,b] << [c,d] is [a,b,[c,d]]; I had remembered it the other way around)
[11:30:14] <klausman>	 aaand still the same error.
[11:30:36] <klausman>	 oh. silly me.
[11:30:52] <klausman>	 note to self: when you rename a variable, do it *everywhere*
[11:31:20] <elukey>	 :)
[11:31:41] <klausman>	 But the <</+ change was needed anyway, so I won that
[11:32:02] <klausman>	 And Jerkins is happy
[11:34:52] <elukey>	 going afk for lunch + errand, ttl!
[11:35:19] <klausman>	 later!
[12:24:18] <mgerlach>	 Hi, I have a question around using mysql on stat-machines (stat1008). I would like to store some processed data as mysql-databases there. I saw documentation around accessing/reading the mariadb-replicas, but dont know how I would create my own database (or whether thats possible). Any pointers? Background: storing a few look-up table for link-recommendation in mysql-tables instead of sqlite 
[12:24:18] <mgerlach>	 https://phabricator.wikimedia.org/T265610 thanks for any help
[13:33:37] <bearloga>	 mgerlach: I've heard great things about DuckDB https://duckdb.org/ you may want to check it out
[13:43:40] <bearloga>	 mgerlach: alternatively there's a 'staging' database that appears to be exactly what you're looking for (`stat1008$ analytics-mysql staging`). It looks like Morten/nettrom's got quite a few tables in there, you may want to ask him about the process.
[13:46:38] <bearloga>	 if you come across documentation for it on wikitech please share because I couldn't find it
[13:56:12] <wikibugs>	 (03CR) 10Joal: "One comment inline about perf, and the ask of moving this file in the refine folder please :)" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns)
[14:02:39] <elukey>	 bearloga: hi! I think that there should be a reference on data access, the staging db is a scratchpad basically
[14:08:12] <elukey>	 bearloga,mgerlach - https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB
[14:08:20] <elukey>	 if something is missing let's clarify!
[14:10:43] <bearloga>	 elukey: ah, nevermind. kostajh just clarified that with multiple tables per wiki there would eventually be 100+ tables required for the link project mgerlach is working on so it'd be better for them to have their own databases
[14:11:55] <elukey>	 yes yes staging is meant to be non-production stuff
[14:12:14] <kostajh>	 T266826 is also relevant to the discussion, fwiw
[14:12:14] <stashbot>	 T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server - https://phabricator.wikimedia.org/T266826
[14:16:06] <mgerlach>	 bearloga: thanks for the pointers.
[15:04:56] <milimetric>	 who's feeling up for a social hour today?
[15:05:42] <mforns>	 milimetric: sure :]
[15:05:49] <fdans>	 milimetric: I'm up!
[15:07:16] <milimetric>	 ok :)  does 16:00 UTC work for everyone (in about an hour) or is earlier better?
[15:07:25] <milimetric>	 (heh, as in now)
[15:07:49] <mforns>	 milimetric: now would be fine for me, or later too
[15:08:06] <milimetric>	 a-team social hour in the batcave, all who want to join are welcome! :)
[15:37:38] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson)
[15:40:17] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson) p:05Triage→03High So this is is now happening at an even higher frequency then before. We are dropping IPs from these soon so I'm not sure what the...
[16:33:59] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:40:07] <elukey>	 ok so kafka-jumbo1006 is back up recovering, before getting 10g cards we need to install a package
[16:40:22] <ottomata>	 oh ya?
[16:41:54] <wikibugs>	 10Analytics-Clusters, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) We had to rollback the NIC on 1006, we need to install `firmware-bnx2x` on all nodes before doing any work (checked with Fa...
[16:43:47] <elukey>	 ottomata: yes I commented in --^, basically the driver was not there and without network connectivity we can't really apt-get install :D
[16:44:16] <elukey>	 I thought it was already in the kernel's drivers since we use those nics in other places, but IIUC the drivers are added at d-i time if needed
[16:44:38] <elukey>	 next time it should be fine
[16:46:02] <ottomata>	 ha!  good to know
[16:46:11] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 2.271e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[16:48:03] <elukey>	 razzi: if you want to watch metrics in the "Kafka" panel of https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&from=now-3h&to=now
[16:48:08] <elukey>	 there are interesting ones
[16:48:22] <elukey>	 (kafka-jumbo1006 was done for a bit for hw maintenance, and now it is recovering)
[16:54:29] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 745 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[16:56:25] <razzi>	 Ah yes, and I can see the usage is low on kafka-jumbo100[7,8,9] which makes it make sense for us to rebalance partitions with  https://phabricator.wikimedia.org/T255973
[16:58:20] <elukey>	 yep, also when a broker is down the others take the load
[16:59:45] <elukey>	 the main issue atm is that kafka-jumbo1006 is still not ingesting the same share of bytes/messages as the other nodes
[17:00:16] <elukey>	 in theory, kafka takes care of it by itself after some minutes, in practice it might need a little encouragement
[17:01:22] <ottomata>	 by encouragement you mean leader election?
[17:01:30] <elukey>	 yep, I just did one
[17:01:41] <elukey>	 !log kafka preferred-replica-election on jumbo1001
[17:01:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:02:17] <elukey>	 there was an imbalance count for a broker
[17:02:53] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) > I agree that threshold of 100s pageviews seems small for privacy. I agree if we're delivering raw data and using...
[17:04:04] <elukey>	 yep it worked
[17:09:01] <elukey>	 razzi: you can see how the graphs are changing, basically because there was a re-assigment of what brokers are leaders for partitions (the ones getting traffic from producers, so not simple replicas)
[17:09:06] <wikibugs>	 10Analytics, 10Operations: Augment NEL reports with a computed timestamp-of-generation - https://phabricator.wikimedia.org/T266886 (10CDanis)
[17:12:48] <elukey>	 razzi: if it is not clear (like it was for me when I started, kafka maybe be a little dense to grasp), we have a Kafka master in the team that knows all the details :) (Andrew)
[17:13:38] <elukey>	 all metrics look good now, going to log off :)
[17:13:49] <elukey>	 have a good weekend people!
[17:37:49] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:01:58] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10mforns) Awesome analysis @lexnasser!  I was thinking about using the threshold on pageviews vs. on unique readers, as @Is...
[20:33:11] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10Mholloway)
[20:35:49] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:16:57] <wikibugs>	 10Analytics-Radar, 10Wikipedia-iOS-App-Backlog, 10Reading Epics (Analytics), 10Spike: Research and define initial technical requirements for app analytics - https://phabricator.wikimedia.org/T164801 (10Mholloway) For posterity, linking https://gerrit.wikimedia.org/r/c/apps/android/wikipedia/+/359008 which...
[21:37:01] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:23:03] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Dzahn) a:03Rmaung
[22:24:39] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:03JAnstee_WMF