[00:04:31] 10Analytics, 10Product-Analytics, 10Tracking-Neverending: Spark application UI shows data for different application - https://phabricator.wikimedia.org/T245892 (10nshahquinn-wmf) [00:05:38] 10Analytics, 10Product-Analytics, 10Tracking-Neverending: Analysts cannot reliably use Spark to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10nshahquinn-wmf) a:03nshahquinn-wmf [00:14:52] 10Analytics, 10Product-Analytics, 10Epic: Analysts cannot reliably use Spark to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10nshahquinn-wmf) [00:28:56] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use Spark to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10nshahquinn-wmf) p:05Triage→03High @kzimmerman I'm making Phab reflect reality as this is having significant impacts on our work an... [00:32:44] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Spark applications crash when running large queries - https://phabricator.wikimedia.org/T245896 (10nshahquinn-wmf) [00:36:35] 10Analytics, 10Product-Analytics: Spark application UI shows data for different application - https://phabricator.wikimedia.org/T245892 (10nshahquinn-wmf) [00:40:23] 10Analytics, 10Product-Analytics: Give clear recommendations for Spark settings - https://phabricator.wikimedia.org/T245897 (10nshahquinn-wmf) [00:40:44] 10Analytics, 10Product-Analytics: Give clear recommendations for Spark settings - https://phabricator.wikimedia.org/T245897 (10nshahquinn-wmf) a:05nshahquinn-wmf→03None [00:48:26] PROBLEM - Check the last execution of reportupdater-reference-previews on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:48:58] PROBLEM - Check the last execution of wikimedia-discovery-golden on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:49:04] PROBLEM - Check the last execution of refinery-import-page-current-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:49:08] PROBLEM - Check the last execution of refinery-import-wikidata-all-ttl-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:49:12] PROBLEM - Check the last execution of refinery-import-page-history-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:50:48] PROBLEM - Check the last execution of reportupdater-published_cx2_translations on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:51:32] PROBLEM - Check the last execution of archive-maxmind-geoip-database on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:51:40] PROBLEM - Check the last execution of refinery-import-wikidata-all-json-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:54:48] PROBLEM - Check the last execution of reportupdater-pingback on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:55:08] PROBLEM - Check the last execution of reportupdater-wmcs on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:55:44] PROBLEM - Check the last execution of reportupdater-structured-data on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:57:14] PROBLEM - Check the last execution of reportupdater-interlanguage on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:59:08] PROBLEM - Check the last execution of refinery-import-siteinfo-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:14:25] PROBLEM - Check the last execution of reportupdater-browser on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:17:46] RECOVERY - Check the last execution of reportupdater-interlanguage on stat1007 is OK: OK: Status of the systemd unit reportupdater-interlanguage https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:19:16] RECOVERY - Check the last execution of reportupdater-reference-previews on stat1007 is OK: OK: Status of the systemd unit reportupdater-reference-previews https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:19:24] RECOVERY - Check the last execution of refinery-import-siteinfo-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-siteinfo-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:19:50] RECOVERY - Check the last execution of wikimedia-discovery-golden on stat1007 is OK: OK: Status of the systemd unit wikimedia-discovery-golden https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:19:54] RECOVERY - Check the last execution of refinery-import-page-current-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:19:58] RECOVERY - Check the last execution of refinery-import-wikidata-all-ttl-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-wikidata-all-ttl-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:20:04] RECOVERY - Check the last execution of refinery-import-page-history-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:22:12] RECOVERY - Check the last execution of reportupdater-published_cx2_translations on stat1007 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:22:54] RECOVERY - Check the last execution of archive-maxmind-geoip-database on stat1007 is OK: OK: Status of the systemd unit archive-maxmind-geoip-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:23:02] RECOVERY - Check the last execution of refinery-import-wikidata-all-json-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-wikidata-all-json-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:24:44] RECOVERY - Check the last execution of reportupdater-browser on stat1007 is OK: OK: Status of the systemd unit reportupdater-browser https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:26:00] RECOVERY - Check the last execution of reportupdater-pingback on stat1007 is OK: OK: Status of the systemd unit reportupdater-pingback https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:26:06] RECOVERY - Check the last execution of reportupdater-wmcs on stat1007 is OK: OK: Status of the systemd unit reportupdater-wmcs https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:26:50] RECOVERY - Check the last execution of reportupdater-structured-data on stat1007 is OK: OK: Status of the systemd unit reportupdater-structured-data https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:44:10] 10Analytics, 10Product-Analytics (Kanban): Update wmfdata to use sensible Spark settings - https://phabricator.wikimedia.org/T245097 (10nshahquinn-wmf) [01:50:29] 10Analytics, 10Product-Analytics: wmfdata cannot recover from a crashed Spark session - https://phabricator.wikimedia.org/T245713 (10nshahquinn-wmf) [01:53:30] nuria: o/ [01:53:33] nuria: welcome back. [01:54:05] nuria: qq. I'll have to continue working at least for a couple of hours over the weekend on OKRs. Let me know if you will work, too, and we may be able to cowork and make progress faster. [01:56:46] leila: hello! my internet will disappear in a bit, if you send me an e-mail i can probably answer tomorrow. [01:58:16] nuria: thumbs up [02:16:02] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Spark applications crash when running large queries - https://phabricator.wikimedia.org/T245896 (10nshahquinn-wmf) @Ottomata, I'm trying to break down these issues more clearly in Phab. This task is about the Spark crashes themselves; T245713 is about not be... [11:21:14] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10Volans) [11:59:55] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:27:40] 10Analytics, 10Analytics-Wikistats, 10translatewiki.net, 10Patch-For-Review: Add stats.wikimedia.org to translatewiki.net - https://phabricator.wikimedia.org/T240621 (10MarcoAurelio) [12:28:08] 10Analytics, 10Gerrit, 10Gerrit-Privilege-Requests, 10Patch-For-Review, 10User-MarcoAurelio: Give access to Wikistats 2 to l10n-bot - https://phabricator.wikimedia.org/T245805 (10MarcoAurelio) [12:28:13] 10Analytics, 10Analytics-Wikistats, 10translatewiki.net, 10Patch-For-Review: Add stats.wikimedia.org to translatewiki.net - https://phabricator.wikimedia.org/T240621 (10MarcoAurelio) [12:37:42] (03CR) 10MarcoAurelio: "Jenkins bot apparently isn't picking this change for the gate-and-submit queue. Can you please re+2 so we can test if this is related to h" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/573953 (https://phabricator.wikimedia.org/T240621) (owner: 10Fdans) [14:11:23] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:11:26] !log restart hadoop-yarn-nodemanager on analytics1073 - process died, logs saved in /home/elukey [14:11:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:11:29] weird [14:12:16] a lot of broken pipe etc.. errors [14:21:04] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10elukey) The host is going to be refreshed this fiscal, I'd just avoid to use the disk for the moment. [14:21:24] !log restart hadoop-yarn-nodemanager on analytics1044 - broken disk, apply hiera overrides to exclude it [14:21:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:22:22] all right all good, afk again :) [16:05:11] Ack elukey - Thanks a lot [18:18:12] 10Analytics, 10Better Use Of Data, 10Performance-Team, 10Product-Analytics, and 2 others: Switch mw.user.sessionId back to session-cookie persistence - https://phabricator.wikimedia.org/T223931 (10Krinkle) 05Open→03Resolved