[00:11:50] 10Quarry, 10cloud-services-team (Kanban): Do some checks of how many queries will break in a multiinstance environment - https://phabricator.wikimedia.org/T267989 (10Bstorm) p:05Triage→03Medium [00:12:09] 10Quarry, 10cloud-services-team (Kanban): Do some checks of how many Quarry queries will break in a multiinstance environment - https://phabricator.wikimedia.org/T267989 (10Bstorm) [04:11:24] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:22:10] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:54:36] 10Analytics-Radar, 10Platform Engineering Roadmap Decision Making, 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), and 2 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Marostegui) [08:35:41] hello people [08:35:49] little things that makes SREs happy [08:36:46] not sure since when but if I file a puppet patch and then run manually a puppet compiler check, wikibugs posts the result of the run (with links) automagically in the gerrit CR [09:08:57] * elukey afk for a bit! [09:57:34] PROBLEM - analytics-meta MySQL instance on an-coord1002 is CRITICAL: NRPE: Command check_mysql_analytics-meta not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [09:58:54] PROBLEM - MySQL disk space for analytics-meta instance on an-coord1002 is CRITICAL: NRPE: Command check_mysql_analytics-meta_disk_space not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [09:59:10] this is me --^ [10:11:02] RECOVERY - MySQL disk space for analytics-meta instance on an-coord1002 is OK: DISK OK - free space: / 52344 MB (73% inode=93%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [10:11:22] RECOVERY - analytics-meta MySQL instance on an-coord1002 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [10:11:57] gooood [10:42:00] Taking a dump of the an-coord1001 dbs, hopefully ok but if anything weird happens lemme know [10:51:24] ok done, now I am moving it to an-coord1002 [10:51:30] then I'll try to bootstrap the replica [11:28:52] !log set analytics meta instance on an-coord1002 as replica of an-coord1001 [11:28:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:29:18] \o/ [11:29:27] so we now have a replica on an-coord1002 too! [11:47:55] * elukey lunch! [14:43:05] (03PS1) 10Fdans: Disable chart movement on scrolling when on table [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/641434 (https://phabricator.wikimedia.org/T267467) [14:57:20] !log stutdown stat1008 for ram expansion [14:57:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:06:52] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: Automate EventGate validation error reporting - https://phabricator.wikimedia.org/T268027 (10Ottomata) [15:07:52] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: Automate EventGate validation error reporting - https://phabricator.wikimedia.org/T268027 (10Ottomata) Automated Phabricator tickets seem nice, but if that gets a little funky, emails to stream owners would suffice. [15:09:23] !log drop 'dump' user from an-coord1001's analytics meta (related to dbprov hosts, previous attempts before db1108) [15:09:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:11:06] !log drop backup@localhost user from an-coord1001's mariadb meta instance (not used anymore) [15:11:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:16:19] 10Analytics-Kanban: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet - https://phabricator.wikimedia.org/T268028 (10elukey) [15:17:35] (03PS1) 10Mforns: Migrate browser general job to use cname credentials [analytics/refinery] - 10https://gerrit.wikimedia.org/r/641440 (https://phabricator.wikimedia.org/T268028) [15:18:30] 10Analytics, 10Event-Platform: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267342 (10MNeisler) @Ottomata No we don't need them. Thank you for checking! [15:19:04] 10Analytics, 10Event-Platform: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (10MNeisler) @Ottomata No we don't need them. Thank you for checking! [15:21:12] (03CR) 10Elukey: [C: 03+1] Migrate browser general job to use cname credentials [analytics/refinery] - 10https://gerrit.wikimedia.org/r/641440 (https://phabricator.wikimedia.org/T268028) (owner: 10Mforns) [15:22:04] stat1008 back with 512G of ram! [15:22:04] (03CR) 10Mforns: [V: 03+2] "I tested this successfully with:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/641440 (https://phabricator.wikimedia.org/T268028) (owner: 10Mforns) [15:22:07] joal: this is the job to load pageviews complete from pagecounts-raw, if you have a moment to take a look [15:22:09] cc milimetric [15:22:10] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/640146 [15:22:44] elukey: niiiice \o/ [15:29:25] aaand we have a mariadb replica of an-coord1001 on 1002 [15:29:36] finishing the last things, but it seems working nicely [15:34:55] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) 05Open→03Resolved added the new power supplies (will keep the older ones for spares). Added all the new memory sticks. resolv... [15:44:26] elukey something is a miss with rocm on stat1008. I can't install rocm-dev, and as a result, tensorflow doesn't work [15:44:32] Hi, after the update in 1008 I installed tensorflow-rocm 2.3.1, but when I check the GPU devices in the python console (gpu_devices = tf.config.experimental.list_physical_devices('GPU')) I get the following messages [15:44:36] 2020-11-17 15:26:11.489241: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so [15:44:36] 4:29 PM 2020-11-17 15:26:11.489863: E tensorflow/stream_executor/rocm/rocm_driver.cc:982] could not retrieve ROCM device count: HIP_ERROR_NoDevice [15:44:36] 4:29 PM 2020-11-17 15:26:11.489901: E tensorflow/stream_executor/rocm/rocm_driver.cc:982] could not retrieve ROCM device count: HIP_ERROR_NoDevice [15:50:05] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10DLynch) > @Nuria > Just like session id a browser re -start will clear error counters I just wanted to raise th... [15:50:22] nvm, the package issue has been resolved [15:50:42] But I can reproduce agaduran's error [15:51:21] (03CR) 10Milimetric: [C: 04-1] Add historical_raw job to load data from pagecounts_raw (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [15:51:46] 10Analytics, 10Analytics-Kanban: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10Milimetric) [16:00:53] klausman: mmm and I guess it was working right after your maintenance [16:01:42] I compared packages containing "rocm" on stat1005 and 8 and the list is exactly the same, including versions [16:01:49] I'm also the same user on both machines [16:01:59] the SMI tool can see the GPU just fine. [16:02:42] agaduran: hi! Is the code running fine on 1005? [16:03:04] yes, it works fine on 1005 [16:03:06] It seems stat1008 is very unbusy at the moment, we could try a cold boot as a measure of last resort [16:03:36] well I just rebooted for ram expansion, not sure if it'd make a diff [16:03:43] agaduran: what is the path for the venv? [16:03:45] Ah, good point [16:04:16] FWIW, I used the standard "new env, sourc it, try the simple GPU test" approach [16:04:37] with both 2.3.1 and 2.3.2 for tensorflow-rocm [16:04:59] yes I am trying as well, weird [16:05:14] klausman: was it working right after the upgrade? [16:05:21] I mean, tf was running fine etc.. [16:05:30] I didn't try myself, my bad. [16:05:50] ahh okok good, it is a datapoint, I thought it could have been my reboot :) [16:05:52] I figured if SMI worked, the rest would as well [16:05:58] yes yes it makes sense [16:06:03] path /home/agaduran/.conda/envs/keras/bin/python [16:06:04] it must be a weird thing [16:09:32] (venv) elukey@stat1008:~$ ls -l /dev/kfd [16:09:33] crw-rw---- 1 root video 239, 0 Nov 17 15:19 /dev/kfd [16:09:34] mmmmm [16:09:49] so in theory it should be "render" [16:10:06] elukey@stat1005:~$ ls -l /dev/kfd [16:10:06] crw-rw---- 1 root render 243, 0 Oct 30 10:18 /dev/kfd [16:10:08] ah! [16:10:10] klausman: --^ [16:10:17] I saw some bug reports that people had this problem if tnice find [16:10:26] oops, half and half a message, disregard [16:10:37] now I am wondering why this happens only on 1008... [16:10:52] agaduran: we might be close to the root cause, we'll keep you informed, thanks for the pointer! [16:10:59] the perms come out of udev, I presume [16:11:25] haaang on... [16:11:57] elukey@stat1008:~$ cat /etc/udev/rules.d/70-amdgpu.rules [16:11:57] KERNEL=="kfd", GROUP=="video", MODE="0660" [16:11:58] whatt [16:12:06] Yes. rock-dkms is still there! [16:12:17] ahhhh [16:12:39] purge and reboot? [16:12:46] yes +1 [16:14:15] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) I tried that command you mentioned @Ottomata by copying the self-contained binary on to kafka-jumbo. Passing `--brokers -2` means to apply for all brokers... [16:18:28] And it works! [16:18:45] agaduran: I think we've fixed the machine, can you try your script? [16:18:53] elukey: thank you! [16:19:06] \o/ [16:19:21] let me try [16:19:44] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10Ottomata) Wow that is very cool! > some reassignments don't share anything with the previous state: Hm, I guess that's ok, data will be blasting around all over... [16:21:14] works for me [16:21:57] yes, seems to work! thank you! [16:22:07] np :) Thanks for the bug report [16:23:32] ottomata, razzi - we are not testing topicmappr in production right? Can we do it in cloud/labs first? [16:24:08] elukey: It is on production; I did test it on cloud first [16:24:39] It is only creating a migration plan, not actually applying anything as of yet [16:25:19] yep yep I get it is only a migration plan, but I didn't see it running on cloud first (from reading the task) so I was asking [16:25:42] how much data did you add to the cloud set up? Also, how many brokers/replicas/etc..? [16:26:07] I don't want to block anything but just ask the usual pessimistic questions before going further :) [16:27:45] Good questions :) [16:27:45] We added 1 broker: deployment-kafka-jumbo-3.deployment-prep.eqiad1.wikimedia.cloud to the existing 2 [16:27:45] In terms of data, my understanding is that the beta environment has almost no throughput, so migrations complete instantaneously [16:28:16] elukey: iiuc topicmappr doesn't actually do anything [16:28:21] it just helps compute a plan [16:29:24] ottomata: I got it but I wanted to know if you were going to pull the trigger or not with something, that's it :) [16:29:43] elukey: no triggers until we have a good plan and timeline :) [16:30:01] I think that we should test the moves on a cluster with some data on it plus clients producing/consuming [16:30:07] just to verify that nothing blows up [16:30:31] we could even use some of the hadoop worker nodes to decom for this [16:30:45] (and load them with jumbo's data) [16:33:02] elukey: ya could do [16:33:10] could set up temp mirror maker too to get real data [16:33:35] +1 nice idea [16:33:55] I was testing decom with riccardo earlier on today, we ended up to analytics1050 [16:34:22] I was planning to work on 51->57 with razzi today but maybe we can just wait and see if we want to use them as temp kafka cluster [16:35:50] ottomata: in other news, we have a replica of the meta db on an-coord1002! [16:35:51] don't see why not [16:35:55] elukey: that is awesome! [16:36:11] elukey: would we need to do all the puppet stuff, or could we just set kafka up manually [16:36:20] elukey: ...do we have a test zookeeper clsuter for test hadoop? [16:37:14] ottomata: even manual is fine, whatever is best (maybe the puppet way could be interesting for Razzi as playground to bootstrap a cluster from scratch) - For zookeeper we could colocate if we want! [16:37:38] elukey: shouldu we also maintain a test kafka jumbo cluster? [16:37:48] not on decommed but on hardware we keep up [16:37:54] could probably do it on ganeti if we had to [16:38:10] ottomata: could be interesting yes, and it would be nice together with hadoop test [16:39:32] and we usually colocate the mirrormaker processes with the destination kafka cluster [16:39:37] so it could run on those too [16:39:44] ( razzi please let us know your thoughts, if you think it is a good idea or not etc..) [16:40:47] It sounds like a lot of work :P but we don't want to mess up production kafka so it makes sense to do a realistic test run first [16:41:41] yep it is a lot of work I agree, but if we cause an outage to Jumbo we may loose ton of data :( [16:42:15] if we do it for keeps (instead of a one off on decommed hw), its a lot of work once, and we can use it for future kafka work laterr [16:42:20] like broker upgrades [16:42:24] for whiich we are many years behind [16:42:35] exactly! [16:42:50] testing the migration to 2.x would be way better [16:56:07] (03PS2) 10Fdans: Add historical_raw job to load data from pagecounts_raw [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) [16:56:11] (03CR) 10Fdans: Add historical_raw job to load data from pagecounts_raw (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [17:04:20] PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:05:58] lovely [17:06:43] (monitoring issue, will fix) [17:09:02] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:17:34] mforns: check it out! [17:17:34] https://stream-beta.wmflabs.org/?doc#/streams [17:17:44] lookin [17:18:31] ottomata: looks great :] I will point the ui to that, and test [17:18:45] cool [17:18:58] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:19:10] RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:19:12] \o/ [17:25:14] (03CR) 10Milimetric: [C: 03+2] Make compatible with Python 3 [analytics/statsv] - 10https://gerrit.wikimedia.org/r/639223 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [17:26:00] (03CR) 10Ottomata: [C: 03+1] Make compatible with Python 3 [analytics/statsv] - 10https://gerrit.wikimedia.org/r/639223 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [17:30:56] (03CR) 10Dave Pifke: "Looks like CI is not set up, so someone's going to have to hit merge." [analytics/statsv] - 10https://gerrit.wikimedia.org/r/639223 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [17:34:02] (03CR) 10Ottomata: [V: 03+2 C: 03+1] Make compatible with Python 3 [analytics/statsv] - 10https://gerrit.wikimedia.org/r/639223 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [18:14:23] a-team: we're in https://meet.google.com/gcj-bqgv-stk [18:14:34] (not the cave) [18:14:38] sorry my bad :) [18:33:48] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:55:12] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:02:34] ah ottomata I forgot one thing [19:02:37] naming question [19:02:54] Rob asked to us what name we should use for the superset/turnilo/etc.. dedicated node [19:03:15] an-ui1001? Too short? [19:04:08] no more an-tool? [19:09:02] ottomata: yep we can also go for an-tool1010 [19:09:09] fine for me [19:09:16] seems easier than making a new name :p [19:09:27] +1 all right will comment in the task, thanks! [19:11:10] there is also another point, namely the refresh for thorium [19:39:00] going to log off folks :) [19:39:04] see you tomorrow [19:59:58] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Operations: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10razzi) This has been deployed with a 60-second TTL. [20:43:49] 10Analytics-Clusters: Create kafka-jumbo mirror cluster - https://phabricator.wikimedia.org/T268074 (10razzi) [20:52:41] 10Analytics-Clusters: Create kafka-jumbo mirror cluster - https://phabricator.wikimedia.org/T268074 (10razzi) [21:09:15] 10Analytics-Clusters: Create kafka-jumbo mirror cluster - https://phabricator.wikimedia.org/T268074 (10razzi) Questions: - new zookeeper cluster or reuse a zookeeper cluster? [21:12:56] hey razzi, yt? wanna pair on deployment train? [21:13:28] mforns: chatting with ottomata currently; you available later? [21:13:50] razzi: yes, will do my pending CR now, and pair with you in a bit! [21:49:11] mforns: meet in the bc? [21:49:26] hey razzi yes! [21:49:27] omw [22:33:02] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/641440 (https://phabricator.wikimedia.org/T268028) (owner: 10Mforns) [22:36:50] !log deploying refinery (regular weekly deployment train) [22:36:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:38:43] Hi mforns - Can I disturb you while you deploy, or do you prefer to stay focused? [23:00:07] arf sorry mforns and razzi - I didn't read you were meeting for deploy - logging off for today :) [23:00:28] !log finished deploying refinery (regular weekly deployment train) [23:00:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:00:32] hey joal! [23:00:44] I wanted to ask you about the webrequest fix! [23:00:48] Ah! [23:00:53] What's up? [23:01:05] sorry joal, didn't see your previous message [23:01:15] np mforns - we missed each other :) [23:01:20] cave? [23:01:38] we saw that the webrequest comma fix was not yet deployed, but it seemed weird to us, because the code does not work without it?? [23:01:45] joal: we're in da cave [23:01:48] :] [23:09:44] !log restarted browser general oozie job [23:09:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log