[07:04:58] <joal>	 Good morning team
[07:25:56] <joal>	 There seem to be an issue with Oozie this morning
[07:26:20] <joal>	 Starting yesterday at hour 20 UTC
[07:28:04] <joal>	 Issue is traceable from oozie SLA alerts emails: oozie jobs don't finish anymore
[07:28:19] <joal>	 However, the cluster is almost empty, no oozie job is there
[07:28:48] <joal>	 !log Suspend/resume stalled coordinators in hue
[07:28:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:31:12] <joal>	 Didn't work
[07:32:09] <joal>	 !log Rerun webrequest-load for text and upload, hours 21 and 22
[07:32:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:35:34] <wikibugs>	 10Analytics, 10Product-Analytics, 10Reading Depth, 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): Reading_depth: remove  eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10Groceryheist) Sorry I lost track of this bug until today.  I think it is real...
[07:57:35] <joal>	 Looks like oozie is a worse state than expected - Killing restarting jobs has not helped
[08:04:03] <joal>	 I'd like to give oozie a bump (service oozie restart), but sudo is needed
[08:04:08] <joal>	 Will wait for Luca
[08:06:37] <joal>	 Or maybe I can an ops hanging around to help - moritzm - Would you be here by any chance?
[08:11:21] <moritzm>	 sure, which host?
[08:11:37] <joal>	 Hi moritzm - host is an-coord1001.eqiad.wmnet
[08:11:50] <joal>	 moritzm: and command would be 'sudo service oozie restart'
[08:16:37] <moritzm>	 doing that now
[08:17:12] <joal>	 Thanks a lot moritzm - May I ask you to log onto the chan once done please? I can also do it if you prefer
[08:17:39] <moritzm>	 journalctl states that it's back up
[08:17:55] <moritzm>	 I logged in -operations, but can also copy here
[08:18:05] <moritzm>	 !log restarted oozie on an-coord1001
[08:18:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:19:16] <joal>	 moritzm: Thanks again - Will continue to try and make it work
[08:19:42] <joal>	 moritzm: it looks a lot better :) \o/
[08:19:50] <moritzm>	 cool, ping me if you need other intervention :-)
[08:28:33] <joal>	 webrequest is still stuck, always same place :(
[08:28:40] <joal>	 grumble grumble
[08:28:50] <joal>	 I'm gonna kill the bundle and restart it
[08:29:42] <joal>	 !log Kill webrequest-load bundle
[08:29:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:31:11] <joal>	 ok - looks like the loops from dumps test job are not finished
[08:32:34] <joal>	 !log Manually kill all leftover workflows from mediawiki-history-dumps
[08:32:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:35:55] <joal>	 !log Restart webrequest bundle
[08:35:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:54:41] * joal not understand :(
[08:57:05] <joal>	 From what I read, actions are needed in the mysql-db
[08:57:27] <joal>	 I'm going to wait for more people to be there, in order not to take actions alone
[09:02:37] <elukey>	 joal: morning :)
[09:02:46] <joal>	 elukey: Hi !
[09:02:57] <joal>	 elukey: the jinx has worked pretty well :)
[09:02:57] <elukey>	 I noticed the alarms :
[09:02:58] <elukey>	 :(
[09:03:00] <elukey>	 yeah
[09:03:13] <joal>	 elukey: here is a summary
[09:03:48] <joal>	 elukey: When I started I noticed the SLA alarms, and looked at oozie: jobs were stall from yesterday hour 20/21, without any oozie worker in hadoop scheduler
[09:05:15] <joal>	 I tried with webrequest to suspend/resume workflow, kill/rerun workflow, kill/restart coordinator, reboot oozie then kill restart --> Still stuck at check_sequence_statistics step
[09:05:37] <joal>	 Other problematic thing: there are 3 workflow that I can't kill
[09:06:15] <joal>	 My reading have told me that the solution is to remove them from DB, which I can do (I am connected, I have checked the data in DB, I can easily do it)
[09:06:31] <elukey>	 ah super brutal
[09:06:34] <joal>	 But I'd rather have an acknowledgement from you and/or nuria 
[09:06:41] <elukey>	 lemme take a dump of the db first
[09:06:47] <joal>	 Yessir :)
[09:07:46] <elukey>	 what is currently stuck? webrequest seems proceeding affaics
[09:07:50] <fdans>	 haha elukey "lemme take a dump"
[09:07:57] <joal>	 3 workflows that I don't manage to kill: 0000000-190822081651905-oozie-oozi-W, 0000001-190822081651905-oozie-oozi-W, 0000004-190822081651905-oozie-oozi-W
[09:08:11] <joal>	 webrequest is not proceeding from what I see
[09:08:19] <elukey>	 fdans: that is extracting informations without context :D
[09:08:33] <fdans>	 in a masterful way
[09:09:13] <elukey>	 joal: it is running check_sequence_statistics but it started 20 mins ago, weird
[09:09:25] <joal>	 elukey: stuck at check_sequence_statistics step
[09:09:33] <joal>	 in a state I don't understand
[09:10:18] <joal>	 elukey: My thinking is to remove unclean stuff (the 3 workflows I can't kill) make sure oozie is clean, maybe even restart it, and then try to restart jobs
[09:11:05] <elukey>	 okok, what are the 3 worflows that are stopped? (not questioning just doing sanity check :)
[09:11:15] <joal>	 no problem asking :)
[09:11:33] <joal>	 those 3 workflows are loop-steps from mforns testing on mediawiki-history-dumps
[09:12:19] <joal>	 the coordinator is already done (don't know if succeeeded or failed), but those workflows are here, and unkillable - When oozie got restarted, the workflows restarted straight away
[09:13:00] <elukey>	 lovely
[09:18:47] <elukey>	 (I am trying to see if via oozie cli I can kill, probably already done, but want to triple check before cleaning up the db)
[09:19:22] <joal>	 elukey: Here is what I plan to execute on the db https://gist.github.com/jobar/d97c728403f4dd8c0834c44831ca753d
[09:21:45] <elukey>	 stopped/started oozie again after trying to kill all marcel's jobs, same thing
[09:22:08] <joal>	 elukey: we should have known that: marcel stuff is unkillable
[09:22:20] <joal>	 :)
[09:23:06] <elukey>	 sql looks good!
[09:23:24] <elukey>	 gimme 2 mins
[09:24:24] <joal>	 sure
[09:25:27] <joal>	 Ah - I need to change the ids :)
[09:26:18] <elukey>	 joal: yeah I was about to say, was able to kill 0000001-190822081651905-oozie-oozi-W but then it got recreated
[09:26:25] <joal>	 right
[09:26:37] <joal>	 it got killed when oozie died, then recreated
[09:26:52] <elukey>	 ok so I'll stop oozie, you can clean up the db, and then we restart
[09:26:55] <elukey>	 would it be ok?
[09:27:27] <joal>	 elukey: I'm afraid stoping oozie won't do - job will be killed, then recreated
[09:27:30] <joal>	 I think
[09:29:10] <elukey>	 we can try first with stop clean start, then we can do on the fly 
[09:29:16] <elukey>	 I'd prefer to be cautious
[09:29:16] <joal>	 works for me
[09:29:20] <elukey>	 ack thanks
[09:29:30] <joal>	 ok, I let you take your dump :-P
[09:29:35] <elukey>	 already done
[09:29:41] <joal>	 ok
[09:29:44] <elukey>	 are you ready with the sql?
[09:30:09] <joal>	 yes
[09:30:22] <elukey>	 oozie stopped
[09:30:23] <joal>	 updated it changing the oozie-restat date in ids
[09:31:10] <elukey>	 when you are done I'll start
[09:31:59] <joal>	 done
[09:32:16] <wikibugs>	 10Analytics, 10Anti-Harassment (The Letter Song), 10MW-1.34-notes (1.34.0-wmf.20; 2019-08-27): Instrument Special:Mute - https://phabricator.wikimedia.org/T224958 (10dom_walden) When I submit the Special:Mute page on beta, on deployment-eventlog05.deployment-prep.eqiad.wmflabs I see the event logged in /srv/...
[09:32:26] <elukey>	 oozie up
[09:32:57] <joal>	 Ok - No more marcel's jobs :)
[09:33:01] <joal>	 good
[09:33:24] <joal>	 elukey: I'll try kill restart the actions that are stuck, see if changes anything first
[09:33:39] <elukey>	 joal: I can see webrequest_load in refine
[09:33:43] <joal>	 (webrequest upload and text, hours 21 and 22)
[09:34:05] <joal>	 \o/ !!!!
[09:34:10] <elukey>	 yeah it seems working :)
[09:34:15] <elukey>	 let's wait a bit
[09:34:26] <joal>	 Sure, I'm gonna monitor this you can be sure
[09:34:28] <joal>	 Thanks a mil
[09:34:52] <elukey>	 !log clean up on the oozie db loop_* workflows (oozie stuck for some reason, most of the coords not processing anything since hours)
[09:34:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:38:07] <elukey>	 all right going to check later :)
[09:38:43] <elukey>	 joal: ok if I go afk for a bit?
[09:38:49] <elukey>	 (ping me on the phone if needed)
[09:39:00] <joal>	 it is elukey - Thanks a lot
[09:40:05] <elukey>	 o/
[09:41:25] <joal>	 Heya fdans - do we pend a minute on the table format for mediarequest?
[09:41:55] <fdans>	 joal: you want to batcave now?
[09:42:06] <joal>	 sure!
[10:08:18] <icinga-wm>	 PROBLEM - Check the last execution of check_webrequest_partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:17:13] <joal>	 fdans: just confirming - Can you please check we get accurate mediacount out of mediarequests?
[10:17:30] <fdans>	 joal: yea, on it
[10:17:36] <joal>	 <3
[10:17:55] <joal>	 we talked about many things, I thought I'd rather confirm :)
[11:30:22] <elukey>	 joal: everything seems good right?
[11:36:14] <joal>	 so far so good :)
[11:40:47] <wikibugs>	 10Analytics: Refactor quenename into HQL hive2 action oozie jobs - https://phabricator.wikimedia.org/T231002 (10JAllemandou)
[11:41:05] <wikibugs>	 (03PS1) 10Joal: Update webrequest oozie job for yarn queue to work [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531682 (https://phabricator.wikimedia.org/T231002)
[11:42:48] <elukey>	 ah snap joal, does --^ require a roll restart of all the coords/bundles?
[11:43:07] <joal>	 elukey: it'll require a patch on all, plus restarts
[11:43:12] <elukey>	 :(
[11:44:22] <elukey>	 I didn't realize it, let me know if I can help with this painful process
[11:44:41] <elukey>	 we basically just completed the second (analytics user, hive -> hive2 actions) roll restart and we need the third :(
[11:45:13] <joal>	 elukey: The thing that I'm still not sure is, why do the config param not work?
[11:45:51] <joal>	 elukey: I tested manually and it works fine, the way passing params is the same for hive and beeline- I'm a bit in wonder
[11:47:18] <elukey>	 joal: maybe sending the query directly to the hive2 server, rather than using hdfs/metastore, requires the parameter to be set in a different way
[11:47:46] <joal>	 elukey: as you were saying, the action runs beeline the same it ran hive I guess
[11:48:51] <elukey>	 joal: it must be something related to oozie launching the beeline query vs query execution on the hive2 server
[11:50:18] <joal>	 hm
[11:50:54] <joal>	 elukey: other unrelated question: is the patch from Andrew on kerb principal for refine tested on non-kerb cluster?
[11:51:12] <joal>	 elukey: the task is in ready to deploy and I wonder if I should do it or not
[11:52:22] <elukey>	 joal: as far as I know the patch was tested in both clusters, and it is already merged/deploy (but only used in the kerb cluster's puppet code)
[11:52:32] <elukey>	 *deployed
[11:52:46] <joal>	 Oh !!
[11:52:57] <joal>	 sorry, didn't even checked if the code was merged
[11:52:59] <joal>	 :S
[11:53:02] <joal>	 My bad
[11:58:24] <wikibugs>	 (03PS1) 10Joal: Update changelog.md for v0.0.98 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531684
[11:58:31] <joal>	 elukey, fdans --^ please
[11:59:21] <elukey>	 lgtm but I'll leave fran to comment since I didn't follow closely his work :)
[11:59:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "lgtm" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531684 (owner: 10Joal)
[12:01:39] <joal>	 elukey: I'll take that as a yes :)
[12:01:43] <joal>	 Thanks !
[12:01:54] <joal>	 elukey: you were supposed to rest this afternoon :(
[12:01:59] <joal>	 I'm sorry to trouble
[12:02:28] <elukey>	 don't even say that, I am good :) took some rest this morning, will take it easy this afternoon, I have some tasks to complete before holidays
[12:02:43] <elukey>	 will try not to do any work that can harm our infra :D
[12:02:58] <elukey>	 (currently installing the os etc.. on the new analytics zookeeper nodes)
[12:10:26] <joal>	 !log Release refinery-source v0.0.98 to jenkins
[12:10:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:10:42] <joal>	 !log Release refinery-source v0.0.98 to archiva (correction)
[12:10:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:16:31] <joal>	 Aouch - Went too fast into deploy, didn't even realize the changelog.md wasn't merge :(
[12:16:34] <joal>	 pffff
[12:17:05] <wikibugs>	 (03CR) 10Joal: [C: 03+2] Update changelog.md for v0.0.98 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531684 (owner: 10Joal)
[12:22:00] <wikibugs>	 (03Merged) 10jenkins-bot: Update changelog.md for v0.0.98 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531684 (owner: 10Joal)
[12:31:28] <fdans>	 thank you for doing this joal 
[12:34:34] <wikibugs>	 (03PS1) 10Joal: Revert pom changes from erroneous deploy [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531689
[12:34:47] <joal>	 fdans, elukey --^ fixing my mess with dpeloy (sorry)
[12:38:54] <elukey>	 joal: (ignorant qs) will this allow you to rebuild 0.98 with the correct changes?
[12:39:34] <joal>	 elukey: yes - here is my process: drop artifacts from archiva, correct code (the above), drop tag
[12:39:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert pom changes from erroneous deploy [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531689 (owner: 10Joal)
[12:40:20] <joal>	 elukey: by doing so, I'm basically puting the codebase in 0.0.97 state (except for my mess in history), ready to be deployed with jenkins
[12:40:41] <elukey>	 yep yep makes sense
[12:41:02] <joal>	 Thanks elukey - Will +2 and wait for merge before starting jenkins (not twice in a day)
[12:41:30] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531689 (owner: 10Joal)
[12:45:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert pom changes from erroneous deploy [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/531689 (owner: 10Joal)
[12:48:45] <joal>	 !log Releasing refinery v0.0.98 on archiva from jenkins after correction
[12:48:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:52:38] * elukey afk for a bit
[13:03:08] <joal>	 fdans: Hola :)
[13:03:30] <fdans>	 heloo
[13:04:06] <joal>	 fdans: from https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediarequests/hourly/coordinator.properties#L31, it looks like nothing need to be changed before deploying - Can you confirm?
[13:04:57] <fdans>	 joal: if that’s the version that’s been deployed, that’s correct yeah :)
[13:05:15] <joal>	 good
[13:18:11] <joal>	 !log Deploying refinery with scap
[13:18:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:41:12] <joal>	 !log Deploying refinery onto hdfs
[13:41:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:51:54] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Performance-Team: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) >>! In T230772#5423807, @elukey wrote: > @ema hi :) Are response headers like Cache-Control used by Varnish in case `caching: 'pass'` is configured?   They are not. The meaning of...
[13:52:10] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Performance-Team, 10Traffic: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) p:05Triage→03Normal
[14:03:58] <joal>	 cluster is back in track :) \o/
[14:05:49] <joal>	 fdans: just noticing as I'm looking at data already loaded in mediarequest
[14:06:07] <joal>	 Format for partitionining is different
[14:06:18] <fdans>	 joal: yes, by snapshot
[14:06:21] <fdans>	 sorry
[14:06:28] <fdans>	 by timestamp
[14:06:52] <fdans>	 joal: dan suggested it as that's what we agreed on going forward right?
[14:08:33] <joal>	 I'm kinda inconfortable, as yes we said the format would probably be better for various reasons, but we also said we would change/backfill other tables
[14:10:23] <joal>	 hm
[14:10:31] <mforns>	 hi team!
[14:10:50] <joal>	 fdans: As other people in the team have all agreed, I will join then :)
[14:11:30] <joal>	 fdans: Shall I start the job from 2019-08-14T12:00 ?
[14:11:59] <fdans>	 joal: sounds good to me, thank you :)
[14:15:11] <joal>	 hmActually, some relatively important point about the timestamp field: we should have a timestamp field inside the data itself, facilitating having timetamp without knowing the folder - And about the format, using an ISO-correct format (yyyy-mm-ddTHH:MM for instance) would be a good idea to facilitate hive/spark time computationg
[14:15:57] <joal>	 fdans: --^
[14:16:11] <joal>	 Let's discuss this in standup - I'll start the job
[14:18:53] <joal>	 Question on the name as well - we have so far mostly not used plural - Why has it been prefered to use it for that one?
[14:20:09] <joal>	 !log Start mediarequests oozie coordinator from 2019-08-14T12:00
[14:20:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:37:12] <nuria>	 joal: did we figured out what was going on with cluster?
[14:39:16] <nuria>	 fdans: hello, can you add to ticket the cross checking you have done with mediacounts to make sure  that data is being backfilled correctly? That way i can also look but not repeat your work. 
[14:39:55] <wikibugs>	 10Analytics, 10Cleanup, 10Editing-team: Deletion of limn-edit-data repository - https://phabricator.wikimedia.org/T228982 (10fdans)
[14:41:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10fdans) @Nuria just the removal of repositories, but since there's tasks open for each of them, I think we can mark this one as done?
[14:41:20] <elukey>	 nuria: o/ it seems that the loop_* workflows that Marcel launched yesterday caused some issue to Oozie, and all our attempts to kill them failed (the workflow respawned for some reason). Then Joseph used the Oozie hammer, removing the workflows from the database directly, and after a start/stop oozie recovered
[14:41:24] <elukey>	 super weird
[14:41:34] <fdans>	 nuria: I'm not completely done with that, will update the task as soon as I am
[14:42:31] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10Nuria) @fdans sounds good, closing.
[14:42:40] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10Nuria) 05Open→03Resolved
[14:42:43] <wikibugs>	 10Analytics, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Status of analytics/limn-*-data git repositories? - https://phabricator.wikimedia.org/T221064 (10Nuria)
[14:42:45] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Nuria)
[14:44:08] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Nuria) Ok, one step closer to removing mysql-consumer as all queries have been moved out of limn repositories into the reportupdater one, the next step is https:/...
[14:46:12] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop - https://phabricator.wikimedia.org/T223414 (10Nuria) Note to team that @kaldari has confirmed that we do not need the page-creation dashboards now that that page creation data is o...
[14:46:29] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop - https://phabricator.wikimedia.org/T223414 (10Nuria)
[14:47:12] <nuria>	 fdans: i have moved the next step for reportupdater to kanban, once this one is done and otto does magic for beta we can deprecate the mysql consumer, see: https://phabricator.wikimedia.org/T223414
[14:47:51] <fdans>	 nuria: yea we talked about it before he left, we're going to port all the things together when he's back
[14:52:51] <elukey>	 just created the new analytics zookeeper cluster
[14:53:05] <nuria>	 elukey: wow
[14:53:47] <elukey>	 running buster and java 11
[14:53:55] <nuria>	 fdans: to port jobs to hive andrews help is not needed is just busywork, i think ( unless i'm missing something), the work we need to do that now does not exist is to have a way to log events in beta that is not mysql
[14:54:05] <nuria>	 elukey: more wow
[14:54:34] <fdans>	 nuria: oh right, I see
[14:55:44] <wikibugs>	 (03CR) 10Nuria: "Change s look good, virtual +2 if we have tested the job itself" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531682 (https://phabricator.wikimedia.org/T231002) (owner: 10Joal)
[14:59:34] <wikibugs>	 10Analytics, 10User-Elukey: decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] - https://phabricator.wikimedia.org/T217057 (10elukey)
[14:59:52] <wikibugs>	 10Analytics, 10User-Elukey: Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] - https://phabricator.wikimedia.org/T217057 (10elukey)
[15:01:48] <wikibugs>	 10Analytics, 10User-Elukey: Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] - https://phabricator.wikimedia.org/T217057 (10elukey) The zookeeper analytics-eqiad cluster has been created in T227025. The remaining steps are:  1) test the new cluster properly (it runs java11 and bust...
[15:01:58] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] - https://phabricator.wikimedia.org/T217057 (10elukey)
[15:06:44] <wikibugs>	 10Analytics: Geoeditors_private deletion scripts scheduled day conflicts with retention period - https://phabricator.wikimedia.org/T231017 (10mforns)
[15:13:40] <elukey>	 !log remove reading_depth druid load job from an-coord1001
[15:13:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:13:43] <elukey>	 nuria: --^
[15:13:54] <nuria>	 elukey: yessir
[15:17:01] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531682 (https://phabricator.wikimedia.org/T231002) (owner: 10Joal)
[15:17:11] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_readingdepth_hourly on an-coord1001 is CRITICAL: NRPE: Command check_check_eventlogging_to_druid_readingdepth_hourly_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:18:15] <elukey>	 yes yes :)
[15:18:25] <elukey>	 will go away when puppet runs on the icinga host
[15:21:28] <nuria>	 elukey: i see, and we need another patch to remove the thingy now, right?
[15:21:56] <elukey>	 nuria: exactly, already filed
[15:22:23] <elukey>	 and merged now :)
[15:23:46] <nuria>	 elukey: ooohh, super thanks, will do likewise in teh future
[15:24:21] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_readingdepth_daily on an-coord1001 is CRITICAL: NRPE: Command check_check_eventlogging_to_druid_readingdepth_daily_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:24:54] <elukey>	 nuria: it is a bit cumbersome but puppet cleans up all the files etc.. for us, otherwise it is a mess :(
[15:25:18] <nuria>	 elukey: totally
[15:27:51] <wikibugs>	 10Analytics, 10Operations, 10vm-requests: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10elukey)
[15:31:46] <leila>	 elukey: o/
[15:32:17] <wikibugs>	 (03PS8) 10Mforns: [WIP] Add Oozie job for mediawiki history dumps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/530002 (https://phabricator.wikimedia.org/T208612)
[15:32:18] <leila>	 elukey: heads up that you will receive an invitation from Toby/Linh for a meeting about machine vision. I asked you and Filippo to be added.
[15:32:27] <leila>	 elukey: this is a one time meeting.
[15:32:44] <wikibugs>	 (03PS4) 10Mforns: [WIP] Add spark job to create mediawiki history dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/528504 (https://phabricator.wikimedia.org/T208612)
[15:32:48] <elukey>	 leila: o/ more than happy to join, but I'll be on holidays from next monday to Sept 6th
[15:33:48] <leila>	 elukey: yeah. no worries. as long as your calendar is marked, it will be fine. The meeting is not urgent and can wait for when you're back.
[15:34:12] <elukey>	 super
[15:34:39] <elukey>	 nuria: I am about to kill analytics-tool1002, the old turnilo vm. Ok to proceed?
[16:39:05] <wikibugs>	 10Analytics, 10Analytics-SWAP, 10Product-Analytics: Provide Python 3.6 on SWAP - https://phabricator.wikimedia.org/T212591 (10elukey) As FYI we have now Python3.7 + libpython3.7 on notebooks:  ` elukey@notebook1003:~$ python3.7 Python 3.7.1 (default, Dec 16 2018, 12:33:36) [GCC 6.3.0 20170516] on linux Type...
[16:47:51] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10StructuredDataOnCommons, 10Tool-Pageviews: Change name and format of partition - https://phabricator.wikimedia.org/T231030 (10Nuria)
[16:48:19] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10StructuredDataOnCommons, 10Tool-Pageviews: Change name and format of partition - https://phabricator.wikimedia.org/T231030 (10Nuria)
[16:50:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10StructuredDataOnCommons, 10Tool-Pageviews: Change name and format of partition - https://phabricator.wikimedia.org/T231030 (10Nuria)
[16:55:42] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10elukey) We added the following kafka supervisor in druid:  ` curl -L -X POST -H 'Content-Type: application/json' -d '{   "type": "kafka",   "dataSchema": {     "dataSource": "w...
[16:57:51] <wikibugs>	 10Analytics, 10Product-Analytics, 10Reading Depth, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): Reading_depth: remove  eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10ovasileva) Discussed with @kzimmerman today and decided the best option forward would be to transfe...
[17:03:47] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10StructuredDataOnCommons, 10Tool-Pageviews: Change name and format of partition  column in mediarequest table - https://phabricator.wikimedia.org/T231030 (10Nuria)
[17:03:51] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10StructuredDataOnCommons, 10Tool-Pageviews: Change name and format of partition  column in mediarequest table - https://phabricator.wikimedia.org/T231030 (10fdans) p:05Triage→03High
[17:04:31] <wikibugs>	 10Analytics, 10Operations, 10vm-requests, 10Patch-For-Review: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10fdans) p:05Triage→03High
[17:04:56] <wikibugs>	 10Analytics: Geoeditors_private deletion scripts scheduled day conflicts with retention period - https://phabricator.wikimedia.org/T231017 (10fdans) p:05Triage→03High
[17:04:58] <wikibugs>	 10Analytics, 10ChangeProp, 10Discovery-Search, 10EventBus, and 3 others: Better way to pause writes on elasticsearch - https://phabricator.wikimedia.org/T230730 (10Gehel) >>! In T230730#5422294, @mobrovac wrote: > There already is a mechanism in change propagation to back off and wait / retry later.  Does...
[17:07:32] <wikibugs>	 10Analytics: Geoeditors_private deletion scripts scheduled day conflicts with retention period - https://phabricator.wikimedia.org/T231017 (10mforns) We should ensure that at least we keep last 90 days. And delete the data as soon as possible after that.
[17:07:49] <wikibugs>	 10Analytics, 10Patch-For-Review: Refactor quenename into HQL hive2 action oozie jobs - https://phabricator.wikimedia.org/T231002 (10fdans) p:05Triage→03High
[17:11:12] <wikibugs>	 10Analytics: Turnilo: Remove count metric for edit_hourly data cube - https://phabricator.wikimedia.org/T230963 (10fdans) Let's try to remove this changing Turnilo's configuration.
[17:12:44] <wikibugs>	 10Analytics, 10Analytics-Data-Quality: Import of MediaWiki tables into the Data Lakes mangles usernames - https://phabricator.wikimedia.org/T230915 (10fdans) p:05Triage→03Normal
[17:12:56] <wikibugs>	 10Analytics, 10Operations, 10vm-requests, 10Patch-For-Review: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics-tool1002.eqiad.wmnet` -  analytics-tool1002.e...
[17:20:05] <wikibugs>	 10Analytics, 10Product-Analytics: Ensure Wikitech page about custom jupyter notebooks exists and is up to date - https://phabricator.wikimedia.org/T230742 (10fdans) p:05Triage→03Normal
[17:20:16] <wikibugs>	 10Analytics, 10Product-Analytics: Ensure Wikitech page about custom jupyter notebooks exists and is up to date - https://phabricator.wikimedia.org/T230742 (10fdans) a:03JAllemandou
[17:27:28] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Performance-Team, 10Traffic: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10fdans) a:05Nuria→03elukey
[17:29:03] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Performance-Team, 10Traffic: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10fdans) a:05elukey→03Nuria
[17:30:38] <mforns>	 joal, !
[17:30:49] <mforns>	 :[
[17:30:55] <joal>	 yes?
[17:30:57] <mforns>	 :D
[17:31:02] <mforns>	 I thought I'd lost you
[17:31:09] <joal>	 Here I am :)
[17:31:17] <mforns>	 do I have permission to try the dumps job with sequential loop?
[17:31:47] <joal>	 mforns: if you don't mind let's start that tomorrow morning, so that we can react relatively fast if stuff break
[17:31:59] <mforns>	 ok, makes sense
[17:32:12] <joal>	 if it runs now and fail in the middle of the night it'll be a few hours to catch (as today)
[17:32:20] <joal>	 Thanks :)
[17:32:26] <wikibugs>	 10Analytics: Upgrade Turnilo to its latest upstream - https://phabricator.wikimedia.org/T230709 (10elukey)
[17:32:29] <mforns>	 joal, I can launch it at 9:30am, but then I have to leave, is that OK for you?
[17:32:30] <wikibugs>	 10Analytics, 10Operations, 10vm-requests, 10Patch-For-Review: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10elukey) 05Open→03Resolved
[17:32:33] <wikibugs>	 10Analytics: Upgrade Turnilo to its latest upstream - https://phabricator.wikimedia.org/T230709 (10elukey)
[17:32:43] <wikibugs>	 10Analytics, 10Analytics-Kanban: Upgrade Turnilo to its latest upstream - https://phabricator.wikimedia.org/T230709 (10elukey)
[17:32:47] <joal>	 Super great, please ping me when you start so that I know :)
[17:32:50] <joal>	 mforns: --^
[17:32:56] <joal>	 I'll monitor it
[17:32:56] <elukey>	 analytics-tool1002 removed people
[17:32:59] <mforns>	 joal, actually, I can launch it pointing at the already existing data in /tmp
[17:33:01] <elukey>	 turnilo upgrade completed
[17:33:14] <mforns>	 so it will run pretty quickly (in theory)
[17:33:22] <mforns>	 wowowo
[17:33:23] <joal>	 soun
[17:33:31] <joal>	 sounds good to me
[17:33:48] <joal>	 tomorrow morning is still preferable if you're ok, so that I can stop that already long day :D
[17:33:50] <mforns>	 cool, tomorrow around 9am, thanks!
[17:33:57] <joal>	 \o/
[17:34:00] <mforns>	 ooof course!
[17:34:30] <joal>	 <3 elukey - I'll see you tomorrow for moar hive fun ;)
[17:34:37] <joal>	 Have a good night team
[18:02:56] * elukey off!
[18:08:57] <mforns>	 byeeee
[19:20:34] <wikibugs>	 (03CR) 10Mforns: "I'm going to test now." (036 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/531148 (https://phabricator.wikimedia.org/T230514) (owner: 10Fdans)
[20:46:14] <wikibugs>	 (03CR) 10Mforns: "I tested, and could not find any problem :]" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/531148 (https://phabricator.wikimedia.org/T230514) (owner: 10Fdans)
[22:08:22] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) CR to add source and dest country code: https://gerrit.wikimedia.org/r/c/operations/puppet/+/531752 Dimensions are: ` "country_ip_src": "US", "country_ip_dst": "US", `