[08:14:21] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) [08:15:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) While checking the hive-server2.log file on an-coord1001 I found a recurrence of the following, right after a virtualpageview hourly job starts: ` 2021-0... [08:17:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) [08:17:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) ` 2021-02-13T03:15:15,802 ERROR [2d5044e1-67a1-4c02-88e6-ebcb6352842b HiveServer2-Handler-Pool: Thread-171789] parse.CalcitePlanner: CBO failed, skipping... [08:34:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) [08:35:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) Every time pageview_actor hourly is ran, I see the following in hive-server2.log: ` 2021-02-13T00:29:20,114 ERROR [c69ed050-16b1-47f3-9306-8740b72a3aaf H... [08:59:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou) [09:01:26] 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since February 9, 2021 - https://phabricator.wikimedia.org/T274617 (10JAllemandou) Hi Hydriz, sorry for the late response. We did a major upgrade of our hadoop cluster last week and we're still s... [09:10:32] (03PS1) 10Joal: Fix mediacount and mediarequest hourly oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664049 (https://phabricator.wikimedia.org/T274332) [09:11:17] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for hotfix deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664049 (https://phabricator.wikimedia.org/T274332) (owner: 10Joal) [09:12:12] (03PS2) 10Joal: Fix mediacount and mediarequest hourly oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664049 (https://phabricator.wikimedia.org/T274332) [09:12:55] (03CR) 10Joal: [V: 03+2 C: 03+2] Fix mediacount and mediarequest hourly oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664049 (https://phabricator.wikimedia.org/T274332) (owner: 10Joal) [09:14:29] !log Deploy hotfix for mediarequest and mediacount [09:14:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:38:14] !log deploy refinery onto hdfs [09:38:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:38:40] !log Restart and backfill mediacount and mediarequest, and backfill mediarequest-AQS and mediacount archive [09:38:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:57:44] joal: bonjour! [10:59:46] very nice fix <3 [11:00:46] so if we had two concurrent mediacounts jobs they were working on the same table (like dropping it while the other job is running) [11:00:59] ending up in the IO errors that I've seen in the logs [11:01:05] nice one [11:02:03] ah and you are backfilling all the past days too, right after the upgrade [11:02:17] elukey: I had it on the verge of my mind yesterday, I checked forlders and they were distinct - And this morning I realized it was not at folder level but at table-name level :) [11:04:33] joal: you rock :) [11:04:36] thanks a lot [11:05:15] elukey: I have seen your notes on the CBO bugs - I think they affect only performance, will triple check on monday [11:06:36] joal: yep yep I just reported them since they are spamming hive logs, if not important we can definitely drop them [11:06:44] ack [11:06:47] the null pointer is kinda weird [11:07:04] but it may not be anything relevant, we have never checked in dept hive logs [11:07:14] (I am pretty sure we'd have found horrible things :D) [11:56:09] 10Analytics: Sync urbanecm's LDAP account to Hue - https://phabricator.wikimedia.org/T274732 (10Urbanecm)