[03:15:13] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10RBrounley_WMF)
[03:16:14] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10RBrounley_WMF) Great, thanks @CDanis - cited you here on the task related to the 429 errors we're getting. https://phabricator.wikimedia.org/T255524
[04:49:46] <wikibugs>	 10Analytics, 10DBA, 10Patch-For-Review: Upgrade analytics dbstore databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254870 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbstore1004.eqiad.wmnet'] ` The log can be found...
[05:10:19] <wikibugs>	 10Analytics, 10DBA, 10Patch-For-Review: Upgrade analytics dbstore databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254870 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbstore1004.eqiad.wmnet'] `  and were **ALL** successful.
[05:12:47] <wikibugs>	 10Analytics, 10DBA: Upgrade analytics dbstore databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254870 (10Marostegui)
[07:15:50] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) 05Stalled→03Open
[07:15:52] <wikibugs>	 10Analytics, 10Analytics-Kanban: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10elukey)
[07:15:54] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10elukey)
[07:18:47] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) In T250709 Dan was able to pull all data from db1108's `log` database on HDFS, and my team is currently vetting...
[07:18:56] <wikibugs>	 10Analytics-Radar, 10CPT Initiatives (Revision Storage Schema Improvements), 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Marostegui)
[07:19:54] <wikibugs>	 10Analytics, 10Analytics-Kanban: Spike, see how easy/hard is to scoop all tables from Eventlogging log database - https://phabricator.wikimedia.org/T250709 (10elukey) Once we agree on proceeding, I'll probably reimage db1108 to Buster wiping all the data to start fresh, keep it in mind before giving me the gre...
[07:20:55] <wikibugs>	 10Analytics-Cluster, 10Analytics-Radar, 10User-Elukey: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10elukey) a:05Milimetric→03None
[07:27:37] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Marostegui) That sounds good to me. Reminder, use: `echo partman/custom/db.cfg ;; \` recipe when reimaging so everything...
[07:48:41] <joal>	 Hi team
[08:45:55] <wikibugs>	 10Analytics: Establish if Camus can support TLS encryption + Authentication to Kafka with a minimal code change - https://phabricator.wikimedia.org/T250148 (10elukey) 05Open→03Declined This seems not the road to follow, declining the task. We can re-open if we feel it is needed.
[08:45:59] <wikibugs>	 10Analytics: Add Authentication/Encryption to Kafka Jumbo's clients - https://phabricator.wikimedia.org/T250146 (10elukey)
[08:48:15] <wikibugs>	 10Analytics, 10Analytics-Cluster: Enforce authentication for Kafka Jumbo Topics - https://phabricator.wikimedia.org/T255543 (10elukey)
[08:48:55] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Operations, 10decommission-hardware, 10ops-eqiad: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey)
[08:51:30] <wikibugs>	 10Analytics, 10Analytics-Cluster: Verify if Superset can authenticate to Druid via TLS/Kerberos - https://phabricator.wikimedia.org/T250487 (10elukey)
[08:51:48] <wikibugs>	 10Analytics, 10Analytics-Cluster: Verify if Turnilo can pull data from Druid using Kerberos/TLS - https://phabricator.wikimedia.org/T250485 (10elukey)
[09:23:36] <wikibugs>	 10Analytics, 10Analytics-Cluster: Enforce authentication for Druid datasources - https://phabricator.wikimedia.org/T255545 (10elukey)
[09:51:54] <wikibugs>	 10Analytics, 10Analytics-Kanban: Update skewed-join strategy in Mediawiki-history to prevent errors in case of task-retry - https://phabricator.wikimedia.org/T255548 (10JAllemandou)
[10:16:45] <wikibugs>	 10Analytics, 10Analytics-Kanban: Update skewed-join strategy in Mediawiki-history to prevent errors in case of task-retry - https://phabricator.wikimedia.org/T255548 (10JAllemandou)
[10:42:25] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) In T252913 Keith is working on moving ES and Kafka to profile::java, so the one missing is Cassandra p...
[10:55:08] <milimetric>	 elukey: just making sure you see https://phabricator.wikimedia.org/T255485 relatively early, the history dumps didn’t sync yet this month
[10:55:50] <milimetric>	 (and I think I don’t have access to help)
[10:58:44] <elukey>	 milimetric: I wanted to check them after the fix, I thought they were working but apparently not, checking
[11:03:27] <joal>	 hi milimetric - Would you mind triple checking that T255548 makes sense?
[11:03:28] <stashbot>	 T255548: Update skewed-join strategy in Mediawiki-history to prevent errors in case of task-retry - https://phabricator.wikimedia.org/T255548
[11:15:42] <elukey>	 milimetric: what is the  condition to trigger the rsync?
[11:15:53] <elukey>	 the hdfs dfs -test -d seems to take only one dir path
[11:16:07] <elukey>	 and it doesn't like the {etc..,etc..} format
[11:16:33] <elukey>	 the condition is
[11:16:33] <elukey>	 hdfs dfs -test -d hdfs:///wmf/data/archive/mediawiki/history/{$(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +"%Y-%m"),$(/bin/date --date="$(/bin/date +%Y-%m-15) -2 month" +"%Y-%m")}
[11:17:31] <elukey>	 see
[11:17:31] <elukey>	 elukey@labstore1007:~$ echo /wmf/data/archive/mediawiki/history/{$(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +"%Y-%m"),$(/bin/date --date="$(/bin/date +%Y-%m-15) -2 month" +"%Y-%m")}
[11:17:35] <elukey>	 /wmf/data/archive/mediawiki/history/2020-05 /wmf/data/archive/mediawiki/history/2020-04
[11:17:38] <elukey>	 elukey@labstore1007:~$ echo /wmf/data/archive/mediawiki/history/\{$(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +"%Y-%m"),$(/bin/date --date="$(/bin/date +%Y-%m-15) -2 month" +"%Y-%m")\}
[11:17:42] <elukey>	 /wmf/data/archive/mediawiki/history/{2020-05,2020-04}
[11:17:45] <elukey>	 (the latter has extra \{ \})
[11:17:58] <elukey>	 both cases are not accepted by hdfs dfs -test -d
[11:18:26] <elukey>	 the first one doesn't because it needs only one dir, not two, the latter since it seems that the {..} is not good
[11:19:09] <elukey>	 I don't recall exactly but is it either one or the other dirs are present, then rsync, or both needs to be there?
[11:22:47] <wikibugs>	 10Analytics, 10Analytics-Kanban: mediawiki history dumps sync not working - https://phabricator.wikimedia.org/T255485 (10elukey) The test condition seems not right:  ` elukey@labstore1007:~$ echo /wmf/data/archive/mediawiki/history/{$(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +"%Y-%m"),$(/bin/date --d...
[11:36:07] <elukey>	 !log reboot an-druid100[1,2] for kernel upgrades
[11:36:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:44:13] <milimetric>	 I’ll check in a bit, sorry
[11:45:31] <milimetric>	 I think the intention was to make sure there were three before deleting one, but that shouldn’t be a condition to sync in the first place
[11:46:34] <elukey>	 milimetric: it is needed since the rsync fails returning non-zero if no data is present
[11:47:01] <elukey>	 so since we try to rsync both, they need to be there 
[11:48:08] <elukey>	 yeah so I'd say both
[11:48:27] <elukey>	 I'll fix the script after lunch
[11:51:06] <elukey>	 !log re-run webrequest-druid-hourly-coord 16/06T10
[11:51:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:13:57] * elukey lunch!
[13:33:05] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Ottomata) I dunno why clients would send this, but we could avoid this on our side in two ways:  A. Set `minimum` and `maximum` values for this (and other?) numeric fields. B. Make EventGate a...
[13:33:54] <ottomata>	 `test1
[13:33:58] <ottomata>	 `test`
[13:34:09] <ottomata>	 huh!  i didn't know irclcoud formatted `    `
[13:34:11] <ottomata>	 cool
[13:59:11] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Ottomata) Ya we just need to add a Refine transform function for this.
[14:06:59] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Milimetric) I think it's expected.  If something is typed `int` it should expect to reject values larger than maxint.  And if clients need to set something bigger, they should use a string or...
[14:11:56] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Ottomata) >  we're thinking about listening to Kafka through this endpoint below or something similar The [[ https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams | Eve...
[14:16:54] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Ottomata) Since JSONSchema only has numeric and integer types that have no min or max values, we'll have to choose.  For integer types in Hive, we assume long, since assuming integer might los...
[14:23:15] <elukey>	 !log reboot druid100[7,8] for kernel upgrades
[14:23:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:23:20] <ottomata>	 hmm hello a-team!
[14:23:28] <ottomata>	 did the refinery train not happen last week?
[14:23:33] <joal>	 Hi ottomata
[14:23:46] <joal>	 nope ottomata, blocked because of jar-versions in refinery
[14:23:51] <joal>	 Dhould happen this week
[14:24:07] <ottomata>	 because of jar versions?
[14:24:14] <elukey>	 ottomata: hello! My fault, archiva clean up
[14:24:18] <ottomata>	 ahhh
[14:24:21] <elukey>	 yeah :(
[14:24:25] <elukey>	 should be fixed now
[14:24:34] <ottomata>	 hm, can I do the deploy now then?
[14:24:41] <ottomata>	 i was going to work on the search satisfaction migration today and this week
[14:25:12] <elukey>	 ottomata: source was published IIRC, so yes in theory you can deploy
[14:25:24] <joal>	 ottomata: let's ask milimetric when he plans on deploing, could be today
[14:25:26] <elukey>	 there are some follow ups to do, jobs etc.. maybe Dan can do them later on?
[14:25:35] <ottomata>	 oh, the refinery-source deploy got done, but just not refinery scap?
[14:25:49] <joal>	 hm, I htink so yeah
[14:25:55] <milimetric>	 I'm at your service
[14:26:04] <elukey>	 ottomata: yep, because jars missing
[14:26:06] <ottomata>	 ok
[14:26:06] <milimetric>	 I'll deploy whenever, but I want the rsync to be fixed
[14:27:02] <elukey>	 !log stop timers on an-launcher1001, prep before rebooting an-coord1001
[14:27:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:27:05] <elukey>	 milimetric: --^
[14:27:18] <elukey>	 let's coordinate when we do maintenance/deploy
[14:27:27] <elukey>	 i need to reboot some hosts for kernel upgrades
[14:27:28] <milimetric>	 ok, but did you figure out what you needed?  I'm still confused about that command
[14:27:44] <elukey>	 milimetric: for the rsync? 
[14:28:01] <milimetric>	 yes, you said it failed but we didn't get any alarms, right?
[14:28:26] <milimetric>	 I just knew to look because one of the users told me data was missing
[14:28:36] <elukey>	 milimetric: we didn't because  the script doesn't return non zero, i have to fix it, after that it should be fine
[14:28:39] <elukey>	 sneaky failure
[14:28:44] <milimetric>	 oh ok
[14:28:57] <elukey>	 should be done by EOD
[14:28:59] <elukey>	 is it ok?
[14:29:39] <milimetric>	 totally fine, just making sure we don't need to deploy anything, by the script you mean the stuff in puppet, right?
[14:29:55] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Nuria) >The EventStreams external endpoint will work, but I don't think it should really be used to build reliable and scalable production upserts This is an important point, eve...
[14:30:07] <elukey>	 milimetric: yep correct
[14:32:12] <milimetric>	 ok, and I didn't fully understand when you said "<@elukey> I don't recall exactly but is it either one or the other dirs are present, then rsync, or both needs to be there?", sorry my brain is not working today.  But did you resolve that?
[14:33:56] <elukey>	 milimetric: I think so, we require rsync to check both dirs so I'd say that we need to run it when both are present
[14:33:59] <elukey>	 to avoid failures
[14:34:14] <elukey>	 or break down the rsync in two
[14:34:31] <elukey>	 will file a code review and add you to it so we can discuss
[14:34:33] <ottomata>	 hmmm milimetric:  iiuc thte rsync is a shell script problem in puppet, so deploy won't make a difference either way, if we could deploy I could get unblocked on my stuff! :D 
[14:34:34] <milimetric>	 elukey: but right now 2020-04 and 2020-03 are present, so it shouldn't check 2020-05 right?  That's the one it's trying to sync
[14:35:08] <elukey>	 ottomata: yes you can go ahead
[14:35:46] <milimetric>	 ottomata: sorry, yes, if you wanna go real fast you can deploy, I'll do my slowpoke checks and restart everything after
[14:35:54] <milimetric>	 if I go fast I 100% break something
[14:35:57] <ottomata>	 no no don't need real fast
[14:35:58] <elukey>	 milimetric: the script checks, in the if, /wmf/data/archive/mediawiki/history/2020-04 and /wmf/data/archive/mediawiki/history/2020-05 now
[14:36:05] <ottomata>	 just was hoping to work on it today
[14:36:48] <milimetric>	 elukey: yeah that seems wrong... no?
[14:38:02] <elukey>	 milimetric: I didn't do it originally, but I assume there was a reason for this
[14:38:11] <milimetric>	 oh!!!
[14:38:17] <milimetric>	 doh sorry, it's checking the SOURCE for those dirs
[14:38:21] <milimetric>	 wait... why...
[14:38:38] <milimetric>	 shouldn't it just check the one it's synching?  Like 2020-05?
[14:39:12] <milimetric>	 mforns: why's is the history rsync checking that 2020-04 exists when it's trying to sync 2020-05?
[14:39:31] <mforns>	 hmmmmmmm milimetric 
[14:40:02] <mforns>	 milimetric: lemme look at the code
[14:40:14] <milimetric>	 mforns: https://github.com/wikimedia/puppet/blob/d921b7a69aa94d664db4c7d4cedad5131fe2c48d/modules/dumps/manifests/web/fetches/stats.pp
[14:40:31] <mforns>	 thx
[14:41:17] <mforns>	 milimetric: it does both the last month and the one before the last no?
[14:41:58] <elukey>	 joal did it IIRC
[14:41:58] <milimetric>	 mforns: in case there were updates to the last month or something?  Shouldn't we just do that manually?
[14:42:08] <mforns>	 I guess... it launches rsync for the prior month, in case we did some correction to the last month's data
[14:42:17] <elukey>	 "# Copying only the last 2 dumps explicitely"
[14:42:27] <elukey>	 yeah I think so
[14:42:46] <mforns>	 I think it doesn't hurt, given that if files are unchanged, there will be no actual copying right?
[14:42:50] <milimetric>	 this seems confusing, we should at least add a comment explaining why.  We don't do that with any of the other syncs, and it would make just as much sense for any of them
[14:42:55] <joal>	 It's actually about deletion more than copying - I should have been more clear elukey, mforns, milimetric 
[14:42:59] <mforns>	 aha
[14:43:15] <elukey>	 never doubt what Joseph did, this is something I learned 
[14:43:20] <mforns>	 xD
[14:43:21] <mforns>	 yea
[14:43:23] <elukey>	 there is always a reason :D
[14:43:26] <milimetric>	 I doubt everything :)
[14:43:32] <elukey>	 how dare you!
[14:43:38] <mforns>	 heheh
[14:43:49] <milimetric>	 my life is just a spiral of doubt, I'm barely hanging on to physical things I can touch and I half-distrust those too
[14:44:09] <joal>	 The hdfs-rsync command contains `--delete`, so stuff present in destination but not in source is deleted, therefore all dumps but the last 2 - And, normally sync for the previous is very cheap as already done
[14:44:12] <milimetric>	 https://www.ted.com/talks/the_ted_interview_donald_hoffman_has_a_radical_new_theory_on_how_we_experience_reality
[14:44:37] <joal>	 And I agree with milimetric - Please don't trust me :)
[14:45:08] <milimetric>	 got it, ok, so we'll slap a comment on there and everything will make sense.  Luca's adding me to the code review, I'll take care of it then
[14:45:22] <milimetric>	 ok then, deploying
[14:45:44] <milimetric>	 thanks all!  I am the BEST at ops weeks, instead of solving something I just BOTHER EVERYONE.  Sorry :/
[14:45:53] <joal>	 Thanks a lot folks for the checks - sorry for the non-explicit enough :S!
[14:46:39] <ottomata>	 milimetric:  that is a really great way to solve ops week
[14:47:04] <elukey>	 ottomata: I am waiting for the cluster to drain before rebooting an-coord1001, tell me something before deployng
[14:47:05] <milimetric>	 be so annoying that I get kicked out of rotation?  I know! shhhh
[14:47:26] <milimetric>	 wait elukey / ottomata: I'm deploying right now
[14:47:48] <elukey>	 okok, but please don't restart anything etc.. after it
[14:47:58] <elukey>	 I need to reboot the coordinator
[14:48:06] <elukey>	 and all the timers have been disabled
[14:48:23] <milimetric>	 yep, I'm coordinating
[14:52:22] <elukey>	 milimetric: but have you already started the scap deploy?
[14:52:31] <milimetric>	 no, I'm doing checks now elukey 
[14:52:48] <elukey>	 okok please don't run it then, I am about to reboot
[14:53:32] <milimetric>	 mforns: the deploy you did last week, you didn't do any of the refinery stuff, right?
[14:53:58] <milimetric>	 you just built v0.0.126, right?
[14:54:15] <mforns>	 milimetric: I built 126, yes and did not deploy refinery
[14:54:28] <mforns>	 we needed to take care of jars first
[14:54:29] <milimetric>	 ok, so I'll just copy that train up to this one,k
[14:54:32] <milimetric>	 yep
[14:54:36] <mforns>	 ok, thanks!
[15:03:25] <milimetric>	 ottomata: you want to do the jar version bump to 0.0.126 yourself?  What's this about?
[15:03:29] <milimetric>	 (for refine)
[15:04:02] <milimetric>	 "There's some pending changes for Refinery to be activated by bumping up refinery_jar_version in puppet (ask ottomata)"
[15:04:10] <milimetric>	 in puppet?!
[15:05:53] <elukey>	 yes it is in puppet
[15:06:21] <elukey>	 it is the jar used when launching spark via timers
[15:06:23] <elukey>	 IIRC
[15:06:35] <elukey>	 rebooting coord!
[15:06:43] <elukey>	 !log reboot an-coord1001 for kernel upgrades
[15:06:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:07:05] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) Migration plan:   0. Switch all refine jobs to refinery 0.0.126 and make eventlogging_analytic...
[15:07:20] <ottomata>	 milimetric: ya i want to do the refine puppet part
[15:07:41] <ottomata>	 once you've deployed i can verify some things with that jar version before I acutally do it
[15:07:53] <ottomata>	 described what i'll be doing in ^^^
[15:08:49] <milimetric>	 ottomata: but refinery-source is already deployed, with 0.0.126
[15:09:09] <milimetric>	 so are you saying I should deploy refinery as is, then you'll bump the bundle.properties and I'll deploy again?
[15:09:32] <milimetric>	 I'm not sure what yall mean about "refine in puppet", these are the refine properties: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/bundle.properties
[15:09:38] <milimetric>	 and they're currently set to 0.0.122
[15:09:51] <ottomata>	 milimetric: ?
[15:10:01] <ottomata>	 refine is what is doing event data
[15:10:04] <ottomata>	 nothing in oozie
[15:10:11] <ottomata>	 its a puppet change
[15:10:16] <milimetric>	 oh we're overloading terms
[15:10:20] <joal>	 Hi ebernhardson or dcausse - Would you please rate-limit our airflow jobs? there currently are 3 spark jobs running, each taking quite some resource
[15:10:26] <milimetric>	 when we talked about "refine" so far we talked about webrequest refine
[15:10:57] <milimetric>	 you're talking event refine, ok
[15:11:04] <ottomata>	 hehe ok milimetric  afaik we called that webrequest load, but it is the sam ething
[15:11:08] <ottomata>	 just done via hive and oozie
[15:11:37] <ottomata>	 milimetric:  i need refinery to be deployed to get the new version on e.g. stat boxes and an-launcher
[15:11:40] <ottomata>	 via scap
[15:11:50] <ebernhardson>	 joal: i just started a backfill for may, each job is only ~5 minutes but will take a minute.  I'll see if i can have it spread them a bit
[15:12:05] <milimetric>	 right, so but webrequest load isn't affected, right?  No bump needed there, fine to keep going with 0.0.122
[15:12:08] <ottomata>	 right
[15:12:13] <joal>	 ack ebernhardson - gentle backfillin
[15:12:16] <joal>	 please :)
[15:12:35] <milimetric>	 ok, then I'll do scap deploy as soon as elukey tells me the reboot is done
[15:13:00] <elukey>	 !log re-enabling timers on launcher after maintenance
[15:13:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:13:03] <elukey>	 milimetric: green light
[15:18:09] <elukey>	 ottomata: ok if I reboot jumbo 1007/8?
[15:18:18] <ottomata>	 sure
[15:18:22] <elukey>	 ack thanks 
[15:19:37] <milimetric>	 mforns: in https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/601773/ you only change the hourly bundle but in the etherpad you say restart hourly and daily ones, making sure I'm not missing anything
[15:20:11] <elukey>	 !log reboot kafka-jumbo1007 for kernel upgrades
[15:20:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:20:22] <mforns>	 milimetric: lookin
[15:21:36] <mforns>	 milimetric: the changes in the bundle.xml should only be hourly, that's fine. we still want to receive alerts for the daily metrics.
[15:22:18] <milimetric>	 ok mforns thx I'll restart just the hourly then
[15:22:38] <mforns>	 milimetric: but both bundles should be restarted because of the changes in refinery-sourcve
[15:23:01] <mforns>	 the fix for timeseries with holes (sparse)
[15:27:08] * elukey afk a bit before standup
[15:28:10] <milimetric>	 ooh, scap was a lot quicker this time, thx
[15:28:27] <milimetric>	 oh ok mforns, thx, got it
[15:29:11] <milimetric>	 oh wait, ottomata then this fix won't make it in: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/602463/
[15:29:32] <ottomata>	 milimetric:  why not?  its ok though, that one isn't important
[15:29:40] <milimetric>	 'cause refinery was built last week
[15:29:50] <milimetric>	 I can build again and deploy again after you get unblocked
[15:29:54] <ottomata>	 before june 10?
[15:29:54] <ottomata>	 its ok
[15:29:56] <ottomata>	 i don't need that fix
[15:30:28] <ottomata>	 i need oh
[15:30:35] <ottomata>	 i need two others that were merged june 10-
[15:30:40] <ottomata>	 milimetric:  those should make it in, no?
[15:30:54] <ottomata>	 they were merged before refinery-source was released?
[15:31:04] <ottomata>	 specifcall
[15:31:05] <ottomata>	 i need
[15:31:05] <ottomata>	 https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/602475
[15:31:12] <ottomata>	 and https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/601865
[15:31:20] <milimetric>	 (checking)
[15:32:51] <milimetric>	 yes
[15:33:00] <milimetric>	 those are in 0.0.126
[15:33:05] <ottomata>	 great
[15:33:57] <milimetric>	 !log refinery deployed and synced to hdfs, with refinery-source at 0.0.126
[15:33:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:34:16] <milimetric>	 ottomata: k done ^
[15:34:26] <ottomata>	 yeehaw ty
[15:35:23] <milimetric>	 there's a bunch of jobs to restart, with the archive cleanup, I'll look at those more carefully and do them after standup
[15:36:30] <joal>	 milimetric: there is no real need to restart them - They are working fine as they are, and code is ready in case of restart- let's discuss that option in standup
[15:37:39] <joal>	 Hey ebernhardson, I din't meant to fully stop the backfilling :-S
[15:38:40] <ebernhardson>	 joal: i'll start it up again in a moment, i wrote a patch for our airflow to adjust resourcing and limit concurrency
[15:38:46] <joal>	 \o/
[15:38:49] <joal>	 thanks ebernhardson :)
[15:39:51] <ebernhardson>	 we also only have 1 job left in oozie, quite happy with it (at least, compared to oozie) :)
[15:41:25] <fdans>	 joal: I caved in and moved the business logic of the UDFs to core :)
[15:41:43] <joal>	 Thanks a lot fdans :)
[15:41:49] <joal>	 I'll review!
[15:41:58] <fdans>	 joal: haven't sent patches yet
[15:42:14] <fdans>	 they'll be there just before or after meetings
[15:42:15] <joal>	 ack
[16:01:03] <nuria>	 ping fdans 
[16:01:12] <fdans>	 wo sorry
[16:02:35] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:02:47] <elukey>	 !log reboot kafka-jumbo1008 for kernel upgrades
[16:02:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:04:17] <elukey>	 Exception in thread "main" java.io.FileNotFoundException: File file:/srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.0.105.jar does not exist
[16:04:22] <elukey>	 :P
[16:04:55] <elukey>	 downtimed otherwise it will be a mes
[16:04:58] <elukey>	 *mess
[16:51:31] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10jlinehan)
[17:03:58] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Epic, and 2 others: Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10mpopov)
[17:06:11] * milimetric lunching and then restarting
[17:06:31] <milimetric>	 joal: we didn't get to talk after standup, but should I basically just hold off restarting jobs except for the ones explicitly mentioned in the train?
[17:08:08] <elukey>	 milimetric: I think that what Joseph was trying to say is that we don't really need to do it now since the HDFS refinery version that every job uses is what counts, so we can restart at the pace that we want to avoid super annoying ops weeks
[17:08:20] <elukey>	 and I feel guilty of this since I caused this problem :(
[17:08:44] <elukey>	 also, milimetric, ok if I merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/605959/ ?
[17:09:09] <elukey>	 some timers are failing after the deployment (rightfully, didn't think about them)
[17:09:24] <milimetric>	 elukey: sure, ok to merge
[17:09:56] <milimetric>	 I'm not annoyed, I'm happy to remove annoyance from the team, for a change, instead of adding it.  So I'll do that then, slowly over the next few days
[17:10:06] <wikibugs>	 10Analytics, 10Growth-Team, 10Product-Analytics (Kanban): Newcomer tasks: update schema whitelist for Guidance - https://phabricator.wikimedia.org/T255501 (10LGoto) p:05Triage→03Medium
[17:13:41] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Nuria) > B. is probably the best, but perhaps that could also lead to unexpected behavior?  I think B is the best option, but min/max values should not "match" the clients but rather (i agree...
[17:20:26] <elukey>	 going afk for a bit to have a run until there is light :)
[17:44:43] <ottomata>	 OHHH mforns  i remember why
[17:44:59] <ottomata>	 refine uses event data to find the schema url
[17:45:21] <ottomata>	 for eventlogging metawiki, it uses the 'schema' field and constructs a metawiki api url
[17:45:50] <ottomata>	 for MEP schema repo, it uses the $schema field and looks in configured schema repos for that uri
[17:46:10] <ottomata>	 EventLogging client always sets 'schema' field, in both cases
[17:46:41] <ottomata>	 so eventlogging metawiki refine can continue to find the schema on metawiki for both formats of event data
[17:47:04] <ottomata>	 but mep type refine job can't, as $schema is only set for migrated events
[17:54:21] <ottomata>	 actually, after the event ingestion stuff i'm working on, we probably can make refine use stream config to find the schemas! then we don't have to read event data at all
[17:54:23] <ottomata>	 that will be nice!
[17:56:42] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Ottomata) I think this would just make us more opinionated about valid JSON data than JSON is, which is a good thing.  Basically, numeric JSON values outside of the normal long or double range...
[18:00:54] <ottomata>	 hmmmmmmm
[18:00:56] <ottomata>	 nuria:  yt?
[18:10:42] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:17:47] <joal>	 ack milimetric - restart over next days is good (you can also not do it if you prefer :)
[18:23:02] <wikibugs>	 10Analytics-Radar, 10Operations, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843 (10Astonmalie) I thought ru stand for Russia, this can just be a Russia version of wikipedia  This is my 2 cent though, i only assist students with [...
[18:33:40] <wikibugs>	 (03PS1) 10Ottomata: event_transforms - Set legacy eventlogging `ip` field if it exists [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/605989 (https://phabricator.wikimedia.org/T238230)
[18:33:48] <ottomata>	 yargh
[18:33:50] <ottomata>	 joal:  i missed one
[18:33:51] <ottomata>	 https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/605989
[18:34:03] <ottomata>	 also need to set ip field for  backwards compaible with EL tables
[18:34:07] <ottomata>	 (unless no one uses it?)
[18:35:16] <joal>	 ottomata: Ah - reason is for keeping old fields having a value
[18:36:27] <joal>	 ottomata: I guess this is similar to keeping the UA fields updated
[18:36:51] <ottomata>	 yup
[18:36:53] <ottomata>	 exactly
[18:39:17] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Works for me - The number of cases to maintain backward compatibility grows ... Should we consider not actually doing it?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/605989 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[18:41:00] <ottomata>	 hahah
[18:41:02] <ottomata>	 i wish joal
[18:41:03] <ottomata>	 i wish.
[18:41:16] <ottomata>	 that was my orignal plan a year ago, nuria and jason convinced me otherwise
[18:41:29] <joal>	 ack ottomata :)
[18:42:37] <wikibugs>	 (03PS2) 10Ottomata: event_transforms - Set legacy eventlogging `ip` field if it exists [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/605989 (https://phabricator.wikimedia.org/T238230)
[18:45:37] <ottomata>	 milimetric:  joal , testing that now.  mind if I do a non-train deployment to get it out?  
[18:46:04] <joal>	 no problem for me
[18:46:12] <milimetric>	 likewise
[18:46:39] <milimetric>	 train is just supposed to be a convenience not a bottleneck
[18:46:48] <ottomata>	 ok
[18:52:39] <wikibugs>	 10Analytics, 10Analytics-EventLogging: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 (10Ottomata)
[18:53:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] event_transforms - Set legacy eventlogging `ip` field if it exists [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/605989 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[18:54:41] <wmf-insecte>	 Starting build #45 for job analytics-refinery-maven-release-docker
[19:04:37] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #45: 09SUCCESS in 9 min 56 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/45/
[19:09:00] <ottomata>	 joal:  to do the symlink update
[19:09:05] <ottomata>	 i just need to set RELEASE_VERSION right?
[19:09:17] <ottomata>	 i should leave the ZUUL_* fields alone?
[19:10:15] <joal>	 yes ottomata - Please double check https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source for formats and procedure :)
[19:10:49] <ottomata>	 yes it just doesn't mention the field
[19:10:50] <ottomata>	 will fix
[19:11:13] <joal>	 Thanks !
[19:11:32] <wmf-insecte>	 Starting build #17 for job analytics-refinery-update-jars-docker
[19:11:51] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.127 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605994
[19:11:51] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #17: 09SUCCESS in 18 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/17/
[19:17:01] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add refinery-source jars for v0.0.127 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605994 (owner: 10Maven-release-user)
[19:17:28] <ottomata>	 !log deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261
[19:17:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:17:30] <stashbot>	 T249261:  Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261
[19:41:43] <ottomata>	 !log bumping Refine refinery jar version to 0.0.127  - T238230
[19:41:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:41:45] <stashbot>	 T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230
[19:50:47] <nuria>	 joal: question if I may
[19:51:09] <nuria>	 or milimetric 
[19:51:30] <milimetric>	 sure nuria
[19:51:39] <nuria>	 milimetric: for files on commons
[19:52:00] <nuria>	 milimetric: do all of them have a page such querying the page table on mediawiki
[19:52:11] <nuria>	 milimetric: will tell you the number of distinct files ?
[19:52:44] <milimetric>	 yeah, if you look at the File: namespace you’d get all the stuff addressed as such
[19:52:54] <milimetric>	 in the page table, yes
[19:53:26] <milimetric>	 nuria: also I guess the archive table for deleted files
[19:53:33] <nuria>	 milimetric: so all files auploaded to commons will have a file:BLAH
[19:54:01] <milimetric>	 I think that might have changed with history, but that’s how it is now
[19:54:16] <milimetric>	 I think there was some Image: namespace
[19:54:26] <nuria>	 milimetric: i see, do you know what namespace is the File: one? 
[19:54:32] <milimetric>	 but that those other ways were deprecated and migrated?
[19:54:45] <nuria>	 milimetric: if not , no worries, i will check it out 
[19:54:50] <milimetric>	 no but it’s easy to find it in project_namespace_map
[19:54:56] <nuria>	 milimetric: k, will do
[19:55:27] <milimetric>	 would be interesting to select count * grouped by namespace from the page table
[19:58:47] <ottomata>	 !log evolving event.SearchSatisfaction Hive table using  /analytics/legacy/searchsatisfaction/latest schema
[19:58:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:00:24] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 (10Ottomata) Evolved event.searchsatisfaction:  ` 20/06/16 19:59:36 INFO DataFrameToHive: R...
[20:06:46] <wikibugs>	 10Analytics, 10Analytics-EventLogging: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 (10Tgr)
[20:07:16] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10NewcomerTasks 1.2, 10Product-Analytics: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 (10Tgr)
[20:11:15] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10NewcomerTasks 1.2, 10Product-Analytics: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 (10Tgr) This is actually [[https://tools.ietf.org/html/draft-zyp-json-schema-03#section-5.5|valid syntax]]...
[20:34:17] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10NewcomerTasks 1.2, 10Product-Analytics, and 2 others: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 (10Tgr)