[00:02:59] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:05:47] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:09:35] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10Ottomata) > refine job has an issue with URI being told relative while it seems absolute (Andrew is still working on it)  I think i fixed it.  There is some very strange problem...
[00:12:43] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:23:44] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:25:06] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:28:10] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:40:11] <wikibugs>	 10Analytics, 10Analytics-Kanban: Clean up issues with oozie jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric)
[00:40:28] <wikibugs>	 (03PS1) 10Milimetric: Clean up jobs after Hadoop upgrade [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663065 (https://phabricator.wikimedia.org/T274322)
[00:41:24] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:52:16] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:53:51] <wikibugs>	 (03PS2) 10Milimetric: Clean up jobs after Hadoop upgrade [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663065 (https://phabricator.wikimedia.org/T274322)
[00:54:01] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Clean up jobs after Hadoop upgrade [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663065 (https://phabricator.wikimedia.org/T274322) (owner: 10Milimetric)
[01:01:36] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:07:05] <milimetric>	 !log deployed refinery with some fixes after BigTop upgrade, will restart three coordinators right now
[01:07:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[01:08:24] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:09:50] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:18:42] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:23:48] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:34:00] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:58:30] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:02:58] <wikibugs>	 (03PS1) 10Milimetric: Clean up jobs after Hadoop Upgrade [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663096 (https://phabricator.wikimedia.org/T274322)
[02:08:46] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:10:08] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:23:16] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with oozie jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric)
[02:25:25] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with oozie jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric) mediacounts and mediarequest have the same problem now that the syntax was worked out.  Some hours, but not all hours, fail because of the way U...
[02:26:11] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "pushing this as it fixes the mechanics of the job, but Hive fails now with the same reason as mediarequest, that weird struct UDF problem." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663096 (https://phabricator.wikimedia.org/T274322) (owner: 10Milimetric)
[02:37:13] <milimetric>	 all looks relatively ok on the cluster, left notes about what I see outstanding on the dedicated task.  I think it all boils down to packaging, poms, etc. and deploying/restarting some jobs.
[07:20:52] <elukey>	 good morning :)
[07:27:42] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:30:14] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:38:18] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:54:05] <dcausse>	 congrats for the upgrade!
[07:55:41] <elukey>	 thanks David! It took just 12 hours instead of 4, sigh :D
[07:55:59] <elukey>	 I am going to open an incident report later on to explain what happened
[07:56:07] <elukey>	 very sneaky issue
[07:56:16] <elukey>	 all good with Airflow?
[07:56:24] <dcausse>	 looking, was restarting flink
[07:56:57] <elukey>	 perfect
[07:57:14] <dcausse>	 I see mjolnir-feature_vectors-norm_query-20180215-query_explorer-20210203-spark  in yarn so I suppose airflow is running fine
[07:58:22] <elukey>	 hopefully if it doesn't scream for hive or similar we are good :)
[08:03:56] <dcausse>	 I see some jobs waiting on some event partition that I think should have populated using canary events
[08:07:40] <dcausse>	 ignore that, my bad they are waiting on recent partitions 
[09:17:32] <wikibugs>	 10Analytics, 10Analytics-Kanban: Clean up issues with oozie jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey)
[09:20:38] <wikibugs>	 10Analytics, 10Analytics-Kanban: Clean up issues with oozie jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey) @Ottomata I noticed that eventlogging_legacy refine had some REFINE_FAILED flags for some events, together with a REFINED flag. Since there was a lot of hours to re-refin...
[09:34:26] <elukey>	 brb
[10:07:10] <wikibugs>	 10Analytics, 10Event-Platform, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): wikimedia-event-utilities should provide tools for JVM based apps producing directly to kafka - https://phabricator.wikimedia.org/T270371 (10Gehel) 05Open→03Resolved
[10:26:58] <wikibugs>	 10Analytics: Move the puppet codebase from cdh to bigtop - https://phabricator.wikimedia.org/T274345 (10elukey)
[10:31:56] <mforns>	 hey team, looking at the data quality issue
[10:35:01] <elukey>	 mforns: hola! I wanted to have a chat with you about those, can't really understand where hive fails :(
[10:38:24] <mforns>	 elukey: looking, however I see now that hue-next is completely broken
[10:38:54] <mforns>	 it doesn't show bundles or coordinators... only workflows...
[10:39:47] <mforns>	 and even if the workflow fails, it tags it as succeeded??
[10:44:27] <elukey>	 mforns: it works for me, I can see the bundles and coords
[10:44:37] <elukey>	 did you remove your username in the search bar? 
[10:44:43] <elukey>	 they automatically add it, it is very annoying
[10:44:58] <elukey>	 hue.wikimedia.org should work more or less if next doesn't
[10:46:25] <milimetric>	 both hues work for some jobs and not for others, CLI works well regardless
[10:47:01] <elukey>	 it works fine for me
[10:47:12] <elukey>	 except the usual hue bugs
[10:47:32] <elukey>	 hello :)
[11:20:20] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey) The upgrade was completed in way more hours than expected, so I think it is good to explain exactly what happened.  ==== High level background about Hadoop and the upgra...
[11:20:38] <elukey>	 I made a summary to send to people --^
[11:23:18] <elukey>	 mforns: if you have time to proof-read it to see if I a missing anything later on I'd be super grateful :)
[11:25:33] * elukey bbiab
[12:04:01] <mforns>	 elukey: sorry I'm on and off until eu-afternoon again...
[12:04:19] <elukey>	 mforns: ahhh okok sorry I thought you were already in, np! 
[12:05:02] <mforns>	 elukey, milimetric: I did remove my username filter in hue-next, and when I click on bundles or schedules, I see none, only workflows...
[12:07:35] <elukey>	 really weird, let's check it later
[12:08:51] <mforns>	 elukey: ok, I clicked on "log out" and it works now..
[12:09:08] <mforns>	 actually "sign out"
[12:10:55] <elukey>	 ahahhahah
[12:11:01] <elukey>	 hue is the best
[12:14:18] <mforns>	 elukey: the write up looks great to me, thanks for doing that!
[12:14:55] <elukey>	 mforns: <3
[12:24:00] <elukey>	 cannot really find a good info about why the data quality hive job fails
[12:32:51] <mforns>	 elukey: looking into it, running the query just now by hand
[12:34:15] <mforns>	 elukey: I ran the query without the insert into part, and it works fine....
[12:35:47] <elukey>	 mforns: I checked in the hive logs but I don't see anything related to the specific yarn app id that fails
[12:36:17] <mforns>	 running it now with insert overwrite to mforns.data_quality_stats_incoming
[12:36:23] <elukey>	 mforns: can you also try with the insert part? 
[12:36:25] <elukey>	 ah perfect :)
[12:37:11] <mforns>	 the error I saw in the logs, is non-descriptive, but seems to happen after the query has succeeded, it could be write permits
[12:38:54] <mforns>	 it failed
[12:41:13] <elukey>	 anything useful?
[12:42:44] <mforns>	 running again
[12:43:36] <mforns>	 no, error looks cryptic to me, but I think it's permssions, the mforns.data_quality_stats_incoming base directory was owned by analytics, and I was running the query as myself
[12:43:57] <mforns>	 trying again now with correct permissions
[12:45:48] <fdans>	 good morning team
[12:47:28] <elukey>	 hola hola
[12:48:01] <mforns>	 heya fdans 
[12:48:03] <mforns>	 :]
[12:48:13] <joal>	 Hi team - siesta time, I'm online!
[12:48:20] <mforns>	 elukey: no, it failed again, was not permissions
[12:48:23] <mforns>	 hey joal :]
[12:48:47] <joal>	 I'm planning now on upgrading refinery to use hadoop and hive nex versions - Anything else I can help with?
[12:49:41] <mforns>	 joal: I think that is important, it might rid us of problems we're having no?
[12:49:50] <elukey>	 fdans: I made a summary of why it took so long to upgrade in https://phabricator.wikimedia.org/T273711#6818136 and posted it on slack and announce@, so people know
[12:50:02] <joal>	 I hope mforns - I hope this might solve the mediarequest problem
[12:50:12] <fdans>	 elukey: yea reading now!
[12:50:13] <elukey>	 joal: gogogo
[12:50:14] <mforns>	 joal: and the log4j?
[12:50:22] <elukey>	 let's nuke cdh deps :D
[12:50:45] <joal>	 mforns: I don't think that one will be solved in that way, I can elaborate, but possibly not on IRC, too long :)
[12:50:54] <mforns>	 ok, no worries
[12:51:19] <mforns>	 nevertheless, +1 on updating libs!
[12:51:55] <mforns>	 elukey: the error I'm getting with data quality is at insert time (job 11 of 11): java.io.IOException: java.lang.reflect.InvocationTargetException
[12:52:17] <mforns>	 I can not find anything in the stacktrace that rings a bell
[12:52:33] <mforns>	 but it's not permissions
[12:53:33] <joal>	 mforns: This could very well be related to version issues in using hive UDF/UDAFs
[12:54:35] <mforns>	 joal: but the query without insert overwrite works perfectly
[12:54:47] <mforns>	 the problem must be in the insert/write
[12:54:49] <joal>	 elukey: if you have some time and agree, could we work together in releasing a new spark version that would uses a more recent hadoop?
[12:54:50] <mforns>	 but not permits
[12:55:13] <elukey>	 joal: I am going to step away for a couple of hours now, but I am available later on if you are
[12:55:20] <joal>	 mforns: I hear you
[12:55:38] <joal>	 elukey: No prblem - I'll be on late, if not today, tomorrow :)
[12:56:10] <elukey>	 joal: ah wait spark version, so the most recent upstream one ships with a 2.7+ client, that may work better than what we have
[12:56:21] <elukey>	 buuut I'd defer to Andrew, I think he knows best
[12:56:34] <elukey>	 I had issues with Refine eventlogging legacy this morning, but the rest works
[12:56:42] <elukey>	 anyway, bbl! 
[12:56:50] <joal>	 Bye elukey 
[12:56:54] <mforns>	 byee
[13:08:11] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:08:46] <joal>	 actually mforns, tests fail for refinery-cassandra with logging error - You were right!
[13:12:04] <joal>	 Now, I have no idea how the thing worked before! Anyway...
[13:12:50] <mforns>	 joal: tests fail? I compiled source yesterday several times
[13:13:01] <joal>	 mforns: with new hadoop/hive dependencies
[13:13:08] <mforns>	 ah ah
[13:13:18] * joal bow to mforns intuitions :)
[13:13:43] <mforns>	 huhu, more luck than that
[13:22:00] <wikibugs>	 (03PS1) 10Joal: Update hadoop and hive dependdencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322)
[13:22:23] <wikibugs>	 (03PS2) 10Joal: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322)
[13:55:24] <mforns>	 oh! if I remove the union all, it does not fail
[13:57:06] <joal>	 mforns: Interesting!
[14:16:28] <wikibugs>	 (03PS3) 10Joal: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322)
[14:34:04] <ottomata>	 hello team!
[14:34:08] <ottomata>	 checking email etc.
[14:37:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "hive/hadoop version LGTM, the rest looks ok but I didn't look much into it!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322) (owner: 10Joal)
[14:44:52] <ottomata>	 hehe elukey 
[14:44:56] <ottomata>	 how did you get the whitelist for
[14:44:56] <ottomata>	 https://phabricator.wikimedia.org/T274322#6817770
[14:44:57] <ottomata>	  ?
[14:45:08] <ottomata>	 that looks like you are using the 'legacy' job to refine non migrated event tables
[14:45:37] <ottomata>	 https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp#L6
[14:46:31] <ottomata>	 ok yeah i'm going to re-run the same command but ith refine_eventloggign_analytics
[14:48:07] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Ottomata)
[14:49:49] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Ottomata) > eventlogging_legacy refine had some REFINE_FAILED flags for some events, together with a REFINED flag  There were hours with both?  Hm, I guess that m...
[14:55:43] <elukey>	 hey ottomata, I got it from the monitor refine etc.. that was failing
[14:56:05] <elukey>	 ahhh snap yes I might have used the wrong one sorry :(
[14:56:13] <ottomata>	 yaa elukey and it is confusinig right now because of the migration
[14:56:20] <ottomata>	 the docs in puppet about which is which is good
[14:56:36] <elukey>	 I saw a failure for monitor_refine_eventlogging_legacy_failure_flags.service and used the wrong refine :(
[14:56:44] <ottomata>	 maybe i should add thta in wikitech, but it felt so 'temporary' and volitille that is code docs seemed better
[14:56:53] <elukey>	 nono I'll double check next time, sorry :(
[14:56:55] <ottomata>	 hm interesting.  they should use the same whitelists
[14:56:56] <ottomata>	 lemme see
[14:57:58] <ottomata>	 oh
[14:57:58] <ottomata>	 monitor_refine_eventlogging_legacy_failure_flags
[14:57:59] <ottomata>	 oops
[14:58:05] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/616198/
[14:58:05] <ottomata>	 hm
[14:59:39] <ottomata>	 huh
[14:59:44] <ottomata>	 ok yeah elukey interesting
[14:59:55] <ottomata>	 we use the same monitor_refine failure job for both!
[15:00:01] <ottomata>	 which is extra confusing
[15:00:21] <ottomata>	 i guess that makes sense, as we don't really want to use the table whitelist there, since there could be a problem withh the whitelist?
[15:00:23] <ottomata>	 or...do we?
[15:00:37] <ottomata>	 maybe the failure checker job should use the exact same config as the refine...
[15:00:51] <ottomata>	 although maybe its purpose is to catch tihngs that bugs in refine would miss
[15:06:57] <wikibugs>	 10Analytics: Make RefineFailuresChecker checker jobs use the same parameters as Refine jobs - https://phabricator.wikimedia.org/T274376 (10Ottomata)
[15:08:08] <ottomata>	 lookinginto camus failure reports
[15:11:34] <elukey>	 (brb)
[15:16:20] <ottomata>	 dcausse: somethign i hadn't thought of!  
[15:16:27] <ottomata>	 productin canary events  uses eventgate
[15:16:32] <ottomata>	 producing*
[15:16:45] <ottomata>	 if we don't set a destination_event_service, the job will fail
[15:16:46] <ottomata>	 hm
[15:17:02] <ottomata>	 i guess we need to set it after all, i'll add a comment to the streram config
[15:18:09] <dcausse>	 ottomata: ah, I can live without canary events if you want?
[15:18:41] <milimetric>	 hey yall
[15:18:59] <ottomata>	 dcausse: i dunno what is better.  yeahhh maybe not since you are producing directly anyway?
[15:19:00] <ottomata>	 hm
[15:19:03] <ottomata>	 yeah let's do that dcausse 
[15:19:07] <ottomata>	 we can figure it out later if we need to
[15:19:10] <dcausse>	 sure
[15:19:12] <ottomata>	 making patch
[15:19:15] <dcausse>	 thanks
[15:28:29] <joal>	 My tests with updated jars are not positive: new UDFs don't help with the mediarequest problem, and dependency resolution for cassandra leads to no log but still failing job (without logs ... hum)
[15:28:54] <joal>	 Will be back in ~2h for meetings then fix-work
[15:34:53] <fdans>	 mforns are you on the mediarequests job? do you want to pair on it?
[15:46:37] <gmodena>	 hey all! It looks like SparkSession is not automatically initialised in managed jupyter notebooks anymore. 
[15:46:45] <gmodena>	 No biggie , but I was wondering if something might be borked on my end or if it Is a known post-upgrade issue.
[15:49:18] <elukey>	 gmodena: hey! On what stat100x node are you?
[15:49:43] <mforns>	 fdans: hi! no, I'm on data quality one, but if you want to pair on any of those, I'm in
[15:51:27] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:53:29] <gmodena>	 elukey stat1005
[15:53:59] <gmodena>	 elukey with a fresh kerberos ticket (if relevant)
[15:55:30] <ottomata>	 nice summary elukey :)
[15:55:32] <elukey>	 gmodena: weird I don't see your unit active, did you turn it down?
[15:55:33] <ottomata>	 of problems yesterday
[15:55:39] <elukey>	 thanks :)
[15:56:09] <ottomata>	 gmodena:  do you mean the python spark notebookes?
[15:57:30] <elukey>	 from the ps it seems as if Gabriele uses a vevn not managed by our jupyterhub, possible?
[15:57:53] <elukey>	 ottomata: ignorant question - does newpyter still use the ephemeral systemd units or not?
[15:57:59] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Ottomata)
[15:59:06] <gmodena>	 elukey now I am, to test some custom init. Previously I used a managed one.
[15:59:15] <elukey>	 ahhh
[15:59:52] <gmodena>	 elukey tbh it could very much be an issue on my end
[16:01:07] <gmodena>	 i just wanted to exclude known/wip issues before digging into it 
[16:02:04] <wikibugs>	 10Analytics: Repackage spark without hadoop, use provided hadoop jars - https://phabricator.wikimedia.org/T274384 (10Ottomata)
[16:02:31] <ottomata>	 elukey:  yes it still uses systemdd in the same way
[16:02:48] <ottomata>	 gmodena: FYi, we are going to soon recommend to not use pre-packaged pyspark notebook kernels
[16:02:54] <ottomata>	 and instead just use regular python notebook
[16:02:58] <ottomata>	 and instantiate SparkSession yourself
[16:02:59] <ottomata>	 e.g.
[16:03:03] <ottomata>	 https://wikitech.wikimedia.org/wiki/User:Ottomata/Jupyter#PySpark_and_wmfdata
[16:08:19] <elukey>	 gmodena: please let us know if you are blocked, we can help :)
[16:13:16] <gmodena>	 ottomata elukey ack! Not a blocker :)
[16:21:30] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Ottomata)
[16:22:50] <wikibugs>	 10Analytics-Clusters, 10Product-Analytics: Upgrade Hive to ≥ 2.0 - https://phabricator.wikimedia.org/T203498 (10elukey)
[16:22:52] <wikibugs>	 10Analytics: Default hive table creation to parquet - needs hive 2.3.0 - https://phabricator.wikimedia.org/T168554 (10elukey)
[16:23:01] <wikibugs>	 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10elukey) 05Open→03Resolved a:03elukey We did it in https://phabricator.wikimedia.org/T273711
[16:23:35] <wikibugs>	 10Analytics-Clusters, 10Product-Analytics: Upgrade Hive to ≥ 2.0 - https://phabricator.wikimedia.org/T203498 (10elukey) 05Open→03Resolved a:03elukey This was finally done in https://phabricator.wikimedia.org/T273711, we have hive 2.3.6 now :)
[16:28:13] <milimetric>	 fdans / mforns: you all pairing on the mediarequest / mediacounts thing?
[16:28:37] <mforns>	 milimetric: no, I'm on data_quality_stats failures rigth now
[16:28:39] <fdans>	 not right now, I'm working on it
[16:28:59] <milimetric>	 oh, ok, lemme join you then fdans
[16:29:06] <milimetric>	 pick a venue, any venue
[16:29:34] <fdans>	 milimetric:  I'm in the cave
[16:35:06] * elukey errand before standup
[16:42:19] <ottomata>	 elukey:  razzi,  got a minute for a spark / anaconda / hadoop / debian complexity?  i need a brain bounce :)
[16:42:25] <ottomata>	 oh errand proceed!
[16:42:56] <razzi>	 ottomata: doing a new employee checkin myself
[16:43:00] <ottomata>	 ok
[16:43:02] <ottomata>	 no worrieds!
[16:48:17] <wikibugs>	 (03Abandoned) 10Mholloway: Update session_tick `test` defn to take string values [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661963 (owner: 10Mholloway)
[16:56:53] <ottomata>	 razzi: elukey  phewf, complexity averted.  I had a conda env that had python 3.8, and for a minute i thought that was the default that anaconda-wmf had, which would be very bad since we don't have python 3.8 spark dependencies installed or packaged
[16:57:03] <ottomata>	 and packaging them woudl be hard without a debian provided python 3.8 
[16:57:16] <ottomata>	 but!  i was wrong, anaconda-wmf has python 3.7, which debian also has so we are ok.
[16:57:18] <ottomata>	 phewf
[17:05:05] <wikibugs>	 (03PS1) 10Mforns: Replace UNION ALL with UNION to unbreak data_quality_stats job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663245 (https://phabricator.wikimedia.org/T274322)
[17:12:31] <Lofhi>	 Hey, I'm curious: does the pageview complete dumps were generated a bit later than usual because of the Hadoop upgrade?
[17:12:56] <wikibugs>	 (03CR) 10Joal: Replace UNION ALL with UNION to unbreak data_quality_stats job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663245 (https://phabricator.wikimedia.org/T274322) (owner: 10Mforns)
[17:15:29] <joal>	 Lofhi: Yes - the cluster has not processed data for a bit less than 12h - It then needed to catch up
[17:22:39] <Lofhi>	 Thanks for the answer! Still curious, because I'm reading the documentation and diagrams and didn't find a good answer: why the dumps are available but the same data is not available through the Pageview API? Is it only because the job(s) that fill Cassandra is (are) broken?
[17:23:21] <joal>	 you're absolutely correct Lofhi :)
[17:23:54] <Lofhi>	 I don't have any more questions than, thans for your time joal!
[17:24:10] <joal>	 Data gets computed on HDFS, dumps are ok, but we still have problems with cassandra jobs (we also have problems with mediarequest/mediacounts) - We are actively working on that :)
[17:24:16] <joal>	 Lofhi: --^
[17:24:38] <Lofhi>	 Thanks for details <:-)
[17:26:23] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Patch-For-Review: Stream cc map should not be generated on every pageload - https://phabricator.wikimedia.org/T256169 (10kzimmerman) 05Open→03Declined Closed as abandoned
[17:26:25] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics, and 2 others: EventLogging MEP Upgrade Phase 3 (Stream cc-ing) - https://phabricator.wikimedia.org/T256165 (10kzimmerman)
[17:26:38] <joal>	 You're welcome!
[17:28:01] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 4 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10kzimmerman) a:05jlinehan→03CBogen @CBogen can you verify with your team that...
[17:56:01] <wikibugs>	 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi)
[17:56:20] <razzi>	 !log rebalance kafka partitions for eventlogging-client-side
[17:56:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:29:51] <elukey>	 razzi: 10 mins and we meet? Would it be ok?
[18:29:57] <razzi>	 elukey: sounds good
[18:42:27] <razzi>	 elukey: in bc
[19:07:15] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Product-Data-Infrastructure: Define acceptable usage of the `meta` object in event schemas - https://phabricator.wikimedia.org/T273293 (10Mholloway)
[19:07:17] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure: Define event stream configuration syntax - https://phabricator.wikimedia.org/T273235 (10Mholloway)
[19:11:34] <elukey>	 !log drop /user/oozie/share + chown o+rx -R /user/oozie/share + restart oozie
[19:11:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:30:52] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10Ottomata)
[19:30:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Ottomata)
[19:34:23] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research-Backlog: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10fkaelin) a:03fkaelin
[19:35:31] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research-Backlog: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila) For context: I asked @fkaelin to pick this task up as it remains a high priority request from the research community to our team and we seem to have enough in place to...
[19:36:24] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila)
[19:54:01] <ottomata>	 razzi: am free in 40 mins and will work on spark stuff
[19:56:18] <wikibugs>	 (03PS4) 10Joal: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322)
[20:07:12] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10Ottomata) Related: {T254275}
[20:17:59] <wikibugs>	 (03PS5) 10Joal: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322)
[20:28:23] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10bd808) >>! In T182351#6820212, @Ottomata wrote: > Related: {T254275}  Even more relevant {T273585}
[20:31:44] <wikibugs>	 (03PS2) 10Eric Gardner: Update schema to handle quickview copy events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661273 (https://phabricator.wikimedia.org/T263663)
[20:31:58] <wikibugs>	 (03PS3) 10Eric Gardner: Update schema to handle quickview copy events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661273 (https://phabricator.wikimedia.org/T263663)
[20:32:22] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10ArielGlenn) @leila for clarification, do you need HTML of all revisions of all pages, or only the current version of page content?
[20:32:24] <wikibugs>	 (03PS4) 10Eric Gardner: Update schema to handle quickview copy events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661273 (https://phabricator.wikimedia.org/T263663)
[20:33:48] <wikibugs>	 (03PS6) 10Joal: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322)
[20:39:32] <wikibugs>	 (03CR) 10Nettrom: [C: 03+2] Update schema to handle quickview copy events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661273 (https://phabricator.wikimedia.org/T263663) (owner: 10Eric Gardner)
[20:40:28] <ottomata>	 phewf ok razzi  staring to work finally
[20:43:02] <razzi>	 ottomata: cool, shall we hop on a call? bc occupied
[20:43:52] <ottomata>	 gimme 5
[20:45:48] <elukey>	 joal: any improvement with the new shlib?
[20:45:54] <joal>	 nope :(
[20:45:59] <joal>	 I am sadness
[20:46:12] <wikibugs>	 (03PS1) 10Milimetric: Make null result same shape as normal result [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663300
[20:47:32] <elukey>	 :(
[20:48:30] <wikibugs>	 (03PS1) 10Milimetric: [WIP] Reverting the temp table hack still doesn't work with our new hadoop environment, not sure why. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663301
[20:52:56] <elukey>	 ottomata: qq - did you re-run the refine failed hours? I still see REFINE_FAILED flags from the monitor refine etc..
[20:53:11] <elukey>	 is it because REFINED flags are there, and FAILED ones are not automatically removed?
[20:54:01] <ottomata>	 elukey:  i did yes, 
[20:54:16] <ottomata>	 if _REFINED is there then all is well, i can't remember but i dont think refine removed _REFINED_FAILED flags
[20:54:32] <ottomata>	 razzi:  am here
[20:54:32] <ottomata>	 https://meet.google.com/xxa-zvzw-gvi
[20:54:36] <elukey>	 ahhh okok the I'll double check tomorrow morning, thanks :)
[20:54:40] * elukey afk again
[21:04:51] <wikibugs>	 (03PS7) 10Joal: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322)
[21:08:40] <wikibugs>	 (03PS1) 10Milimetric: Use temp directory hack on mediacounts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663305 (https://phabricator.wikimedia.org/T274322)
[21:10:43] <razzi>	 !log rebalance kafka partitions for codfw.mediawiki.cirrussearch-request
[21:10:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:25:01] <milimetric>	 status update: the biggest fire right now I think is the pageview API not being loaded, people waiting on the data.  Joseph tried to figure out the jar dependency hell but no luck so far.  We decided to try and load the data with spark, so I'm writing the spark job to do that now.  Wish me luck or come pair with me if you like
[21:35:33] <joal>	 milimetric: hold on your spark!!!
[21:35:52] <milimetric>	 hahaha, ok...
[21:36:06] <joal>	 milimetric: to the cave!
[21:43:13] <milimetric>	 joal: it's https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/663191 ?  I didn't see anything else in gerrit
[21:43:18] <milimetric>	 (you were saying PS4?
[21:43:42] <joal>	 Correct milimetric - PS4
[21:44:01] <milimetric>	 ok, I'll overwrite the change with that version
[21:44:03] <joal>	 hm - I can't push ps4 again :(
[21:44:11] <joal>	 Ok great - thanks for that :)
[21:44:37] <joal>	 Then it's my leave! Thanks milimetric for taking up from there - see you tomorrow team
[21:44:41] <milimetric>	 o/
[21:45:21] <wikibugs>	 (03PS8) 10Milimetric: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322) (owner: 10Joal)
[21:46:42] <razzi>	 !log rebalance kafka partitions for eqiad.mediawiki.cirrussearch-request
[21:46:44] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322) (owner: 10Joal)
[21:46:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:52:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update hadoop and hive dependencies versions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663191 (https://phabricator.wikimedia.org/T274322) (owner: 10Joal)
[21:56:35] <wikibugs>	 (03PS1) 10Milimetric: Update changelog.md for v0.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663309
[21:56:54] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Update changelog.md for v0.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663309 (owner: 10Milimetric)
[21:57:10] <milimetric>	 (deploying refinery-source and preparing to restart most of the jobs that have been failing)
[22:02:02] <wikibugs>	 (03PS1) 10Milimetric: Fix interlanguage job syntax [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663312 (https://phabricator.wikimedia.org/T274322)
[22:02:17] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix interlanguage job syntax [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663312 (https://phabricator.wikimedia.org/T274322) (owner: 10Milimetric)
[22:02:23] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Use temp directory hack on mediacounts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663305 (https://phabricator.wikimedia.org/T274322) (owner: 10Milimetric)
[22:03:11] <wikibugs>	 (03Merged) 10jenkins-bot: Update changelog.md for v0.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663309 (owner: 10Milimetric)
[22:03:25] <mw-jenkinsbot>	 Starting build #74 for job analytics-refinery-maven-release-docker
[22:15:12] <mw-jenkinsbot>	 Project analytics-refinery-maven-release-docker build #74: 09SUCCESS in 11 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/74/
[23:04:17] <wikibugs>	 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi)
[23:16:44] <mw-jenkinsbot>	 Starting build #38 for job analytics-refinery-update-jars-docker
[23:17:14] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.1 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663320
[23:17:15] <mw-jenkinsbot>	 Project analytics-refinery-update-jars-docker build #38: 09SUCCESS in 31 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/38/
[23:17:45] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.1 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663320 (owner: 10Maven-release-user)
[23:38:13] <wikibugs>	 (03PS1) 10Milimetric: Bump jar versions to 0.1.1 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663329
[23:38:26] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Bump jar versions to 0.1.1 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663329 (owner: 10Milimetric)
[23:47:10] <wikibugs>	 10Analytics, 10SRE, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10kzimmerman) @Vgutierrez (I saw you were listed on [[ https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty | Clinic Duty ]]) - I ran into access problems again today; do you need a...
[23:54:12] <wikibugs>	 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10fkaelin) @ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the revision was created. The motivation for this...