[00:05:04] <milimetric>	 !log deployed refinery and synced to hdfs, restarting cassandra jobs gently
[00:05:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[00:45:01] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:58:59] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:06:26] <wikibugs>	 (03PS1) 10Milimetric: Fix spelling error in mediacounts job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663364
[02:06:38] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix spelling error in mediacounts job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663364 (owner: 10Milimetric)
[02:19:35] <milimetric>	 !log deployed again to fix old spelling error :) referererererer
[02:19:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[02:43:17] <wikibugs>	 (03PS1) 10Milimetric: [WIP] The mediarequest per file job had a syntax error that I fix here, but it also has the UNION ALL syntax that I understand doesn't work.  In this case, it would be prohibitively expensive to use UNION, it's a LOT of data.  So I'm leaving this fix for tomorrow. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663372 (https://phabricator.wikimedia.org/T274322)
[02:49:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric)
[02:52:41] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric) Update: all cassandra jobs restarted and seem ok, except mediarequests per_file daily.  Patch for that WIP above.  See note in description, when figur...
[02:54:47] <milimetric>	 big kudos to Jo, the refinery-source patch he sent made all the cassandra jobs run.  Update on the task above^.  I'll be out tomorrow morning, call me on my cell if you need me.
[02:59:54] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric)
[06:08:32] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10ArielGlenn) >>! In T182351#6820912, @fkaelin wrote: > @ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the...
[06:59:15] <elukey>	 goood morning
[06:59:24] <elukey>	 wow so refinery source without cdh deps!
[07:44:03] <joal>	 Good morning elukey :)
[07:44:11] <joal>	 let me know when you're ready for a sync
[07:44:36] <elukey>	 bonjour :)
[07:48:06] * joal dances - cassandra jobs are back up
[07:48:14] <elukey>	 nice!
[08:18:17] <elukey>	 joal: we can sync if you want
[08:18:46] <joal>	 heya elukey - joining the caveb
[08:37:26] <wikibugs>	 10Analytics, 10PM: Fix Analytics workflow for #Analytics-EventLogging tasks - https://phabricator.wikimedia.org/T274490 (10Aklapper)
[09:05:36] <wikibugs>	 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10elukey) 05Open→03Resolved a:03elukey @kzimmerman should be done! Let me know if you still have issues :)
[09:05:45] <wikibugs>	 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10elukey)
[09:15:24] <elukey>	 going to be afk for a bit since I have some workers at home, if needed ping me on the phone :)
[10:07:34] <elukey>	 I am running puppet on all hosts to move the cdh module to bigtop
[10:08:06] <elukey>	 so so happy that we have another Apache project :)
[10:09:54] <joal>	 \o/
[11:12:20] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey)
[11:12:40] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey)
[11:28:27] <joal>	 Kids are at home because of snow - I'll be on-and-off this afternoon
[13:23:26] <wikibugs>	 (03PS1) 10Joal: [WIP] Move mediarequest oozie job to sparksql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663575 (https://phabricator.wikimedia.org/T274322)
[13:33:35] <icinga-wm>	 PROBLEM - At least one Hadoop HDFS NameNode is active on an-worker1118 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running
[13:34:32] <elukey>	 weird, I am testing some nn changes but didn't expect this
[13:34:39] <elukey>	 I am doing changes on the test cluster only
[13:34:42] <elukey>	 checking in a sec
[13:54:42] <elukey>	 I suspect that the test and backup cluster shared the zk state for some reason
[13:54:44] <joal>	 heya elukey - Would you have a minute?
[13:54:54] <joal>	 wut? weird
[13:55:15] <elukey>	 joal: if not urgent lemme check hadoop first
[13:57:18] <elukey>	 mmm in theory no, so why the namenodes went down
[13:57:46] <joal>	 :(
[14:03:07] <elukey>	 there was a thread dump on the namenodes, but it was right after I issued the formatzk for the test cluster
[14:03:12] <elukey>	 so it cannot be a coincidence
[14:03:24] <elukey>	 the znodes are split correctly for the 3 clusters though
[14:03:44] <joal>	 indeed elukey - coincidence seems unlikely :(
[14:04:56] <elukey>	 but it is weird, the zk format was issued only for test
[14:09:47] <elukey>	 I am very confused, but the alert should resolve soon in theory, the namenode on 1118 is up
[14:11:50] <elukey>	 joal: the good news is that the procedure to add the service port on test seems to work fine
[14:11:59] <joal>	 Yay!
[14:12:07] <joal>	 MOAR GOOD NEWZ
[14:12:31] <elukey>	 but I am really puzzled by the backup namenode issue
[14:12:43] <joal>	 !log Fix oozie sharelib for spark-2.4.4 by copying oozie-sharelib-spark-4.3.0.jar onto the spark folder
[14:12:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:12:52] <joal>	 elukey: I just posted a CR about that --^
[14:14:26] <elukey>	 joal: ah did the path change on an-coord1001?
[14:14:27] <joal>	 elukey: could you please restart oozie?
[14:14:37] <joal>	 elukey: indeed it changed
[14:14:49] <elukey>	 joal: feel free to restart it
[14:14:53] <joal>	 ack
[14:15:23] <joal>	 elukey: sudo systemctl oozie restart ?
[14:15:35] <elukey>	 yep
[14:15:38] <elukey>	 nope sorry
[14:15:40] <elukey>	 restart oozie
[14:15:55] <joal>	 Okey :)
[14:16:19] <joal>	 !log Restart oozie after having fixed the spark-2.4.4 sharelib
[14:16:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:19:13] <icinga-wm>	 RECOVERY - At least one Hadoop HDFS NameNode is active on an-worker1118 is OK: Hadoop Active NameNode OKAY: an-worker1118-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running
[14:20:05] <joal>	 !log Rerun failed clicstream instance 2021-01 after sharelib fix
[14:20:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:22:04] <wikibugs>	 10Analytics-Clusters: Upgrade the Hadoop Analytics cluster to BigTop - https://phabricator.wikimedia.org/T255142 (10nshahquinn-wmf)
[14:22:15] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf)
[14:23:18] <wikibugs>	 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10nshahquinn-wmf)
[14:23:24] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf)
[14:23:27] <wikibugs>	 10Analytics-Clusters: Upgrade the Hadoop Analytics cluster to BigTop - https://phabricator.wikimedia.org/T255142 (10nshahquinn-wmf)
[14:24:27] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf)
[14:24:31] <wikibugs>	 10Analytics, 10Analytics-Kanban: Backup HDFS data before BigTop upgrade - https://phabricator.wikimedia.org/T272846 (10nshahquinn-wmf)
[14:24:33] <wikibugs>	 10Analytics, 10Analytics-Kanban: Establish what data must be backed up before the HDFS upgrade - https://phabricator.wikimedia.org/T260409 (10nshahquinn-wmf)
[14:24:35] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10nshahquinn-wmf)
[14:24:55] <wikibugs>	 10Analytics-Clusters: Upgrade the Hadoop Analytics cluster to BigTop - https://phabricator.wikimedia.org/T255142 (10nshahquinn-wmf)
[14:24:57] <wikibugs>	 10Analytics, 10Analytics-Kanban: Backup HDFS data before BigTop upgrade - https://phabricator.wikimedia.org/T272846 (10nshahquinn-wmf)
[14:24:59] <wikibugs>	 10Analytics, 10Analytics-Kanban: Establish what data must be backed up before the HDFS upgrade - https://phabricator.wikimedia.org/T260409 (10nshahquinn-wmf)
[14:25:03] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10nshahquinn-wmf)
[14:26:33] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf)
[14:26:36] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10nshahquinn-wmf)
[14:26:39] <joal>	 !log Restart oozie API job after spark sharelib fix (start: 2021-02-10T18:00)
[14:26:40] <wikibugs>	 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10nshahquinn-wmf)
[14:26:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:26:42] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf)
[14:26:44] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade the Hadoop test cluster to BigTop - https://phabricator.wikimedia.org/T244499 (10nshahquinn-wmf)
[14:27:13] <wikibugs>	 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10nshahquinn-wmf)
[14:27:15] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10nshahquinn-wmf)
[14:39:41] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou)
[14:47:19] <wikibugs>	 (03CR) 10Elukey: "Should we merge this? :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663245 (https://phabricator.wikimedia.org/T274322) (owner: 10Mforns)
[14:48:03] <wikibugs>	 (03PS2) 10Joal: [WIP] Move mediarequest oozie job to sparksql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663575 (https://phabricator.wikimedia.org/T274322)
[14:49:42] <elukey>	 I cannot find a reason why the backup cluster's namenodes were down
[14:55:40] <elukey>	 ok I am trying to apply the same procedure to the backup cluster to split the rpc queues
[15:01:56] <mforns>	 heya teammmm
[15:02:25] <mforns>	 elukey: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/663245/ yes please :]
[15:02:55] <icinga-wm>	 PROBLEM - At least one Hadoop HDFS NameNode is active on an-worker1118 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running
[15:03:13] <mforns>	 elukey: are we planning to deploy refinery soon? otherwise, when merged, I will copy these to hdfs to get rid of the alarms
[15:04:15] <elukey>	 mforns: as you prefer!
[15:04:30] <mforns>	 elukey: ok, will copy to HDFS
[15:04:36] <mforns>	 when merged
[15:05:13] <icinga-wm>	 RECOVERY - At least one Hadoop HDFS NameNode is active on an-worker1118 is OK: Hadoop Active NameNode OKAY: an-worker1124-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running
[15:05:22] <elukey>	 this is me --^
[15:05:37] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[15:06:07] <icinga-wm>	 PROBLEM - Hadoop HDFS Zookeeper failover controller on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[15:08:06] <elukey>	 it will be up in a sec
[15:08:27] <icinga-wm>	 RECOVERY - Hadoop HDFS Zookeeper failover controller on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[15:10:15] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[15:13:08] <elukey>	 I am very puzzled
[15:13:43] <elukey>	 the only suspicion that I have is that the backup cluster's namenodes were in a strange state after the recent disk partition filled up, and some change in zookeeper caused them to shake
[15:13:51] <elukey>	 but it is not a good answer
[15:13:58] <elukey>	 from the logs I cannot really find anything
[15:23:54] * elukey bbiabi
[15:59:59] <sukhe>	 mforns: fdans: running a minute late 
[16:04:57] <ottomata>	 a-team are we skpping standup/grooming today for tech dept update?
[16:05:47] <elukey>	 ottomata: o/ fine for me! Maybe let's send e-scrum in case?
[16:07:55] <wikibugs>	 10Analytics, 10SRE, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10kzimmerman) Thanks @elukey ! I'm able to access the data that I couldn't earlier :)
[16:28:38] <klausman>	 Ugh. I can't brain today. I have the dumb. 
[16:28:59] <klausman>	 Trying to write YAML from scratch with my editor settings hosed doesn't help.
[16:29:15] <joal>	 +1 ottomata 
[16:31:00] <joal>	 klausman: how many spaces indent? :/
[16:31:57] <klausman>	 tabs. it was tabs
[16:32:14] <klausman>	 Or more precisely (and worse) mixed spaces and tabs.
[16:32:26] <joal>	  /facepalm
[16:34:28] <klausman>	 It's one of those afternoons where you question everything you think you know about computers
[17:03:00] <elukey>	 a-team: so no standup right?
[17:03:06] <razzi>	 yep
[17:03:09] <joal>	 I'm in the dept meeting
[17:03:13] <elukey>	 perfect
[17:21:18] <elukey>	 joal: very nice https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=57&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-backup-hadoop&var-worker=All
[17:21:58] <joal>	 Wow!
[17:22:01] <joal>	 How come?
[17:22:30] <joal>	 Ah! I get it
[17:22:42] <joal>	 sorry elukey - I read that incorrectly at first :)
[17:23:02] <joal>	 well - Looks like you have nailed it elukey :)
[17:23:10] * joal claps to elukey  - again :)
[17:23:24] <elukey>	 not really sure why the backup cluster did that weird thing
[17:23:36] <elukey>	 but joal we can upgrade the main cluster next week if you are ok
[17:23:49] <joal>	 Yessir!
[17:24:01] <elukey>	 ack :)
[17:29:53] <wikibugs>	 (03CR) 10Lex Nasser: Fix unit tests that ensure certain requests fail and clean up all unit tests (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) (owner: 10Lex Nasser)
[17:48:56] <elukey>	 lexnasser: o/ I am still trying to create the aqs cluster to test, puppet is not really flexible for this use case so it may take a little more :(
[17:53:52] * elukey afk!
[18:00:48] <fdans>	 a-team not for grooming or anything but if anyone wants some hangout time I'll be in the batcave right after this
[18:01:40] <milimetric>	 I've got a screaming baby so I can join but not talk
[18:01:49] <joal>	 will join in 3 mins
[18:50:42] <wikibugs>	 10Quarry: Add a possibility to delete a draft - https://phabricator.wikimedia.org/T135908 (10CristianCantoro) My 2cents: I created a new query by mistake, it is a draft and the fact that I cannot delete it is super annoying. I am ok with the idea of not deleting published queries.  When you click publish you kno...
[19:34:41] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10RBrounley_WMF) Hi @fkaelin - it's nice to meet you, sounds like there are a lot of overlaps in your thinking and ours. On Okapi, in general, we are working on some things that may be...
[20:13:04] <wikibugs>	 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Milimetric) Ping @Pchelolo, @lexnasser was looking at this as the next thing he might focus on.  I hesitated to ping before becau...
[21:39:23] <wikibugs>	 (03PS2) 10Ebernhardson: refinery-drop-hive-partitions: Ensure verbose logging goes somewhere [analytics/refinery] - 10https://gerrit.wikimedia.org/r/661799
[21:54:41] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10fkaelin) To summarize my understanding: - for research, the html history is interesting because it expands templates and lua modules - for a revision of page p created at time t, we p...
[23:48:21] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10Ottomata) Templates are stored in wikitext (right...are they?).  If so, I wonder if [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history | mediawiki hist...
[23:53:48] <wikibugs>	 (03PS1) 10Eric Gardner: Update schema to handle quickview playback events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154)