[01:21:48] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:26:41] !log Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-10-11 [07:26:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:46:38] 10Analytics, 10Analytics-Kanban: HDFS check topology alert is currently broken - https://phabricator.wikimedia.org/T292846 (10BTullis) The alert has now been fixed and deploying the change above, followed by a restart of the Hadoop masters resulted in the check passing again. Marking this as done. [12:56:44] 10Analytics-Radar, 10Patch-For-Review: Update ROCm version on GPU instances. - https://phabricator.wikimedia.org/T287267 (10elukey) 05Open→03Resolved a:03elukey This is done! It seems that ROCm 4.2 is the only viable option for the moment, I'll keep working on https://github.com/ROCmSoftwarePlatform/tens... [13:14:27] !log btullis@analytics1069:~$ sudo systemctl stop hadoop-yarn-nodemanager.service [13:14:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:15:14] !log btullis@analytics1069:~$ sudo systemctl stop hadoop-hdfs-* [13:15:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:16:41] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Growth-Team, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Okay, deployed, and I see HomepageVisit events with real client IPs. @nettrom_WMF, f it... [13:17:05] !log btullis@analytics1069:~$ sudo shutdown -h now [13:17:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:41:49] 10Analytics, 10Patch-For-Review: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10elukey) Merged the patches, next step is to build the new debian package and install it across our nodes. I can take care of it or leave... [13:42:03] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10BTullis) Host is not booting cleanly. We get an error from /dev/sdc on boot and it required the root password for maintenance. `dmesg` shows this. ` [ 105.195... [13:46:41] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10BTullis) I quit out of the maintenance prompt with Ctrl-D but it failed at fsck again. ` Reloading system manager configuration Starting default target [ 1167.... [13:48:41] 10Analytics, 10Patch-For-Review: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10Ottomata) We're doing 'offsite' this week so I don't think we'll get to it soon. Please proceed if you need it! [13:49:24] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10elukey) One thing that I do when this happens is to enter the root password and comment the disk in /etc/fstab, and then powercycle. In theory the OS should bo... [13:52:53] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10BTullis) Great, thanks @elukey - I had got as far as looking at various megacli commands, but as far as the RAID controller was concerned everything is fine. I... [14:15:12] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10Jclark-ctr) preformed flea power drained, power off, remove power cables, unseat power supplies, hold the power button for 20-30 seconds and plug it all back i... [14:51:30] joal: learned it by chance - https://www.featurestoresummit.com/agenda [14:51:57] it is today/tomorrow, I hope that some youtube recordings will appear during the next days (all in Pacific time, not super friendly) [14:52:00] a lot of big names [14:52:35] elukey: will not be able to attend, we have our virtual-offsite :S I hope there'll be viseoa [14:52:41] *videos [14:54:41] me too :( [15:43:18] !log btullis@aqs1008:~$ sudo nodetool-b clearsnapshot [15:43:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:05:48] Interesting read! http://highscalability.com/blog/2021/10/11/scaling-indexing-and-search-algolia-new-search-architecture.html [16:12:30] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10BTullis) Created follow-up task: {T293111} [16:17:38] 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10User-GoranSMilovanovic: Sync with https://analytics.wikimedia.org/published/datasets/ from stat1008 - https://phabricator.wikimedia.org/T293112 (10GoranSMilovanovic) [16:18:54] 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10User-GoranSMilovanovic: Sync with https://analytics.wikimedia.org/published/datasets/ from stat1008 - https://phabricator.wikimedia.org/T293112 (10GoranSMilovanovic) [16:29:40] 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10User-GoranSMilovanovic: Sync with https://analytics.wikimedia.org/published/datasets/ from stat1008 - https://phabricator.wikimedia.org/T293112 (10GoranSMilovanovic) [16:47:14] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Growth-Team, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) 05Open→03Resolved I've verified that there are now events in the Data Lake with cl... [18:57:40] (03Abandoned) 10ODimitrijevic: add exclusion comment [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730047 (owner: 10ODimitrijevic) [18:57:50] (03Abandoned) 10ODimitrijevic: remove redundant exclusion [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730045 (owner: 10ODimitrijevic) [18:57:56] (03Abandoned) 10ODimitrijevic: exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/728654 (owner: 10ODimitrijevic) [19:06:16] (03PS1) 10ODimitrijevic: exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 [19:08:33] (03CR) 10jerkins-bot: [V: 04-1] exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [19:17:59] (03PS2) 10ODimitrijevic: exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 [19:53:51] 10Analytics: Presto error in Superset - https://phabricator.wikimedia.org/T292879 (10JAnstee_WMF)