[00:05:04] !log deployed refinery and synced to hdfs, restarting cassandra jobs gently [00:05:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:45:01] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:58:59] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:06:26] (03PS1) 10Milimetric: Fix spelling error in mediacounts job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663364 [02:06:38] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix spelling error in mediacounts job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663364 (owner: 10Milimetric) [02:19:35] !log deployed again to fix old spelling error :) referererererer [02:19:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [02:43:17] (03PS1) 10Milimetric: [WIP] The mediarequest per file job had a syntax error that I fix here, but it also has the UNION ALL syntax that I understand doesn't work. In this case, it would be prohibitively expensive to use UNION, it's a LOT of data. So I'm leaving this fix for tomorrow. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663372 (https://phabricator.wikimedia.org/T274322) [02:49:12] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric) [02:52:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric) Update: all cassandra jobs restarted and seem ok, except mediarequests per_file daily. Patch for that WIP above. See note in description, when figur... [02:54:47] big kudos to Jo, the refinery-source patch he sent made all the cassandra jobs run. Update on the task above^. I'll be out tomorrow morning, call me on my cell if you need me. [02:59:54] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric) [06:08:32] 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10ArielGlenn) >>! In T182351#6820912, @fkaelin wrote: > @ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the... [06:59:15] goood morning [06:59:24] wow so refinery source without cdh deps! [07:44:03] Good morning elukey :) [07:44:11] let me know when you're ready for a sync [07:44:36] bonjour :) [07:48:06] * joal dances - cassandra jobs are back up [07:48:14] nice! [08:18:17] joal: we can sync if you want [08:18:46] heya elukey - joining the caveb [08:37:26] 10Analytics, 10PM: Fix Analytics workflow for #Analytics-EventLogging tasks - https://phabricator.wikimedia.org/T274490 (10Aklapper) [09:05:36] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10elukey) 05Open→03Resolved a:03elukey @kzimmerman should be done! Let me know if you still have issues :) [09:05:45] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10elukey) [09:15:24] going to be afk for a bit since I have some workers at home, if needed ping me on the phone :) [10:07:34] I am running puppet on all hosts to move the cdh module to bigtop [10:08:06] so so happy that we have another Apache project :) [10:09:54] \o/ [11:12:20] 10Analytics-Clusters, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey) [11:12:40] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey) [11:28:27] Kids are at home because of snow - I'll be on-and-off this afternoon [13:23:26] (03PS1) 10Joal: [WIP] Move mediarequest oozie job to sparksql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663575 (https://phabricator.wikimedia.org/T274322) [13:33:35] PROBLEM - At least one Hadoop HDFS NameNode is active on an-worker1118 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [13:34:32] weird, I am testing some nn changes but didn't expect this [13:34:39] I am doing changes on the test cluster only [13:34:42] checking in a sec [13:54:42] I suspect that the test and backup cluster shared the zk state for some reason [13:54:44] heya elukey - Would you have a minute? [13:54:54] wut? weird [13:55:15] joal: if not urgent lemme check hadoop first [13:57:18] mmm in theory no, so why the namenodes went down [13:57:46] :( [14:03:07] there was a thread dump on the namenodes, but it was right after I issued the formatzk for the test cluster [14:03:12] so it cannot be a coincidence [14:03:24] the znodes are split correctly for the 3 clusters though [14:03:44] indeed elukey - coincidence seems unlikely :( [14:04:56] but it is weird, the zk format was issued only for test [14:09:47] I am very confused, but the alert should resolve soon in theory, the namenode on 1118 is up [14:11:50] joal: the good news is that the procedure to add the service port on test seems to work fine [14:11:59] Yay! [14:12:07] MOAR GOOD NEWZ [14:12:31] but I am really puzzled by the backup namenode issue [14:12:43] !log Fix oozie sharelib for spark-2.4.4 by copying oozie-sharelib-spark-4.3.0.jar onto the spark folder [14:12:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:12:52] elukey: I just posted a CR about that --^ [14:14:26] joal: ah did the path change on an-coord1001? [14:14:27] elukey: could you please restart oozie? [14:14:37] elukey: indeed it changed [14:14:49] joal: feel free to restart it [14:14:53] ack [14:15:23] elukey: sudo systemctl oozie restart ? [14:15:35] yep [14:15:38] nope sorry [14:15:40] restart oozie [14:15:55] Okey :) [14:16:19] !log Restart oozie after having fixed the spark-2.4.4 sharelib [14:16:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:19:13] RECOVERY - At least one Hadoop HDFS NameNode is active on an-worker1118 is OK: Hadoop Active NameNode OKAY: an-worker1118-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [14:20:05] !log Rerun failed clicstream instance 2021-01 after sharelib fix [14:20:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:22:04] 10Analytics-Clusters: Upgrade the Hadoop Analytics cluster to BigTop - https://phabricator.wikimedia.org/T255142 (10nshahquinn-wmf) [14:22:15] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf) [14:23:18] 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10nshahquinn-wmf) [14:23:24] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf) [14:23:27] 10Analytics-Clusters: Upgrade the Hadoop Analytics cluster to BigTop - https://phabricator.wikimedia.org/T255142 (10nshahquinn-wmf) [14:24:27] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf) [14:24:31] 10Analytics, 10Analytics-Kanban: Backup HDFS data before BigTop upgrade - https://phabricator.wikimedia.org/T272846 (10nshahquinn-wmf) [14:24:33] 10Analytics, 10Analytics-Kanban: Establish what data must be backed up before the HDFS upgrade - https://phabricator.wikimedia.org/T260409 (10nshahquinn-wmf) [14:24:35] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10nshahquinn-wmf) [14:24:55] 10Analytics-Clusters: Upgrade the Hadoop Analytics cluster to BigTop - https://phabricator.wikimedia.org/T255142 (10nshahquinn-wmf) [14:24:57] 10Analytics, 10Analytics-Kanban: Backup HDFS data before BigTop upgrade - https://phabricator.wikimedia.org/T272846 (10nshahquinn-wmf) [14:24:59] 10Analytics, 10Analytics-Kanban: Establish what data must be backed up before the HDFS upgrade - https://phabricator.wikimedia.org/T260409 (10nshahquinn-wmf) [14:25:03] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10nshahquinn-wmf) [14:26:33] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf) [14:26:36] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10nshahquinn-wmf) [14:26:39] !log Restart oozie API job after spark sharelib fix (start: 2021-02-10T18:00) [14:26:40] 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10nshahquinn-wmf) [14:26:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:26:42] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf) [14:26:44] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10nshahquinn-wmf) [14:27:13] 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10nshahquinn-wmf) [14:27:15] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf) [14:39:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou) [14:47:19] (03CR) 10Elukey: "Should we merge this? :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663245 (https://phabricator.wikimedia.org/T274322) (owner: 10Mforns) [14:48:03] (03PS2) 10Joal: [WIP] Move mediarequest oozie job to sparksql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663575 (https://phabricator.wikimedia.org/T274322) [14:49:42] I cannot find a reason why the backup cluster's namenodes were down [14:55:40] ok I am trying to apply the same procedure to the backup cluster to split the rpc queues [15:01:56] heya teammmm [15:02:25] elukey: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/663245/ yes please :] [15:02:55] PROBLEM - At least one Hadoop HDFS NameNode is active on an-worker1118 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [15:03:13] elukey: are we planning to deploy refinery soon? otherwise, when merged, I will copy these to hdfs to get rid of the alarms [15:04:15] mforns: as you prefer! [15:04:30] elukey: ok, will copy to HDFS [15:04:36] when merged [15:05:13] RECOVERY - At least one Hadoop HDFS NameNode is active on an-worker1118 is OK: Hadoop Active NameNode OKAY: an-worker1124-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [15:05:22] this is me --^ [15:05:37] PROBLEM - Hadoop Namenode - Primary on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:06:07] PROBLEM - Hadoop HDFS Zookeeper failover controller on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:08:06] it will be up in a sec [15:08:27] RECOVERY - Hadoop HDFS Zookeeper failover controller on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:10:15] RECOVERY - Hadoop Namenode - Primary on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:13:08] I am very puzzled [15:13:43] the only suspicion that I have is that the backup cluster's namenodes were in a strange state after the recent disk partition filled up, and some change in zookeeper caused them to shake [15:13:51] but it is not a good answer [15:13:58] from the logs I cannot really find anything [15:23:54] * elukey bbiabi [15:59:59] mforns: fdans: running a minute late [16:04:57] a-team are we skpping standup/grooming today for tech dept update? [16:05:47] ottomata: o/ fine for me! Maybe let's send e-scrum in case? [16:07:55] 10Analytics, 10SRE, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10kzimmerman) Thanks @elukey ! I'm able to access the data that I couldn't earlier :) [16:28:38] Ugh. I can't brain today. I have the dumb. [16:28:59] Trying to write YAML from scratch with my editor settings hosed doesn't help. [16:29:15] +1 ottomata [16:31:00] klausman: how many spaces indent? :/ [16:31:57] tabs. it was tabs [16:32:14] Or more precisely (and worse) mixed spaces and tabs. [16:32:26] /facepalm [16:34:28] It's one of those afternoons where you question everything you think you know about computers [17:03:00] a-team: so no standup right? [17:03:06] yep [17:03:09] I'm in the dept meeting [17:03:13] perfect [17:21:18] joal: very nice https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=57&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-backup-hadoop&var-worker=All [17:21:58] Wow! [17:22:01] How come? [17:22:30] Ah! I get it [17:22:42] sorry elukey - I read that incorrectly at first :) [17:23:02] well - Looks like you have nailed it elukey :) [17:23:10] * joal claps to elukey - again :) [17:23:24] not really sure why the backup cluster did that weird thing [17:23:36] but joal we can upgrade the main cluster next week if you are ok [17:23:49] Yessir! [17:24:01] ack :) [17:29:53] (03CR) 10Lex Nasser: Fix unit tests that ensure certain requests fail and clean up all unit tests (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) (owner: 10Lex Nasser) [17:48:56] lexnasser: o/ I am still trying to create the aqs cluster to test, puppet is not really flexible for this use case so it may take a little more :( [17:53:52] * elukey afk! [18:00:48] a-team not for grooming or anything but if anyone wants some hangout time I'll be in the batcave right after this [18:01:40] I've got a screaming baby so I can join but not talk [18:01:49] will join in 3 mins [18:50:42] 10Quarry: Add a possibility to delete a draft - https://phabricator.wikimedia.org/T135908 (10CristianCantoro) My 2cents: I created a new query by mistake, it is a draft and the fact that I cannot delete it is super annoying. I am ok with the idea of not deleting published queries. When you click publish you kno... [19:34:41] 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10RBrounley_WMF) Hi @fkaelin - it's nice to meet you, sounds like there are a lot of overlaps in your thinking and ours. On Okapi, in general, we are working on some things that may be... [20:13:04] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Milimetric) Ping @Pchelolo, @lexnasser was looking at this as the next thing he might focus on. I hesitated to ping before becau... [21:39:23] (03PS2) 10Ebernhardson: refinery-drop-hive-partitions: Ensure verbose logging goes somewhere [analytics/refinery] - 10https://gerrit.wikimedia.org/r/661799 [21:54:41] 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10fkaelin) To summarize my understanding: - for research, the html history is interesting because it expands templates and lua modules - for a revision of page p created at time t, we p... [23:48:21] 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10Ottomata) Templates are stored in wikitext (right...are they?). If so, I wonder if [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history | mediawiki hist... [23:53:48] (03PS1) 10Eric Gardner: Update schema to handle quickview playback events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154)