[00:03:52] <icinga-wm>	 PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:20:05] <ottomata>	 !log manually running drop_event with --verbose flag
[02:20:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[02:29:10] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Ottomata) > replace their contents with {}  This will be ok once fully migrated, but the metawiki based pipe...
[06:03:11] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) Thanks for the reviews @elukey and @JAllemandou!  Based on your comments, my plan is to reimage an-master1002 on Tuesday May 25 before standup (14:30...
[06:30:12] <elukey>	 good morning folks, I'll be afk for ~2h, ttl!
[06:45:01] <joal>	 Good morning
[06:50:51] <joal>	 !log run manual unique-devices cassandra job for one day with debug logging
[06:50:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:23:16] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation: Temp files left around in wikistats_1/ ? - https://phabricator.wikimedia.org/T280311 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn Moar data!  OK, closing :-)
[07:28:10] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10awight) >>! In T274880#7094199, @Milimetric wrote: > Ok, I think this is all you need, let me know if you need "Owner": https://gerrit.wikimedia.org/r/admin/repo...
[08:28:59] <joal>	 hnowlan: I have debug logs (verbooooooooose)
[08:29:26] <joal>	 They show: DEBUG [client-[/10.64.32.128, aqs1014-b.eqiad.wmnet/10.64.48.67, aqs1013-b.eqiad.wmnet/10.64.32.147]] com.datastax.driver.core.ControlConnection: [Control connection] error on aqs1012-a.e
[08:29:30] <joal>	 qiad.wmnet/10.64.32.128:9042 connection, no more host to try
[08:31:50] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Update refinery-cassandra dependencies to have support for Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10JAllemandou) @hnowlan : Logs at `DEBUG` level (prepare our eyes, those are verbose!)  From `an-launcher1002`: ` sudo -u analytics kerberos-run-command an...
[08:32:11] <joal>	 hnowlan: I added the log-vieweing command to the task --^
[08:36:36] <elukey>	 there is also com.datastax.driver.core.SystemProperties: com.datastax.driver.USE_NATIVE_CLOCK is und
[08:36:39] <elukey>	 efined, using default value true
[08:36:39] <elukey>	 that is interesting
[08:41:48] <elukey>	 can we try with false?
[08:41:55] <elukey>	 joal: --^
[08:42:15] <elukey>	 (if we haven't tried yet)
[08:42:47] <elukey>	 the other interesting thing is
[08:42:48] <elukey>	 2021-05-18 07:18:47,108 DEBUG [cluster6-nio-worker-0] com.datastax.driver.core.Connection: Connection[aqs1
[08:42:51] <elukey>	 012-a.eqiad.wmnet/10.64.32.128:9042-1, inFlight=0, closed=false] Connection established, initializing tran
[08:42:54] <elukey>	 sport
[08:43:00] <elukey>	 so sometimes it happens
[08:44:39] <joal>	 elukey: trying to understand how I can set: com.datastax.driver.USE_NATIVE_CLOCK=false :)
[08:46:12] <elukey>	 it may be nothing but worth to try
[08:46:45] <joal>	 elukey: for sure!
[08:54:45] <hnowlan>	 joal: nice, thank you!
[09:37:50] <joal>	 elukey, hnowlan - I managed to pass com.datastax.driver.USE_NATIVE_CLOCK=false - the log says: 
[09:37:53] <joal>	 com.datastax.driver.core.SystemProperties: com.datastax.driver.USE_NATIVE_CLOCK is defined, using value false
[09:37:56] <joal>	 com.datastax.driver.core.ClockFactory: Using java.lang.System clock to generate timestamps.
[09:38:32] <joal>	 But, it still fails with the same error :(
[09:42:41] <elukey>	 good, we removed one thing :)
[09:47:22] <hnowlan>	 "com.datastax.driver.core.exceptions.DriverException: Connection thread interrupted" is closer to a smoking gun I guess 
[09:47:27] <hnowlan>	 but not very useful ;/
[09:47:43] <joal>	 ok - from the logs: we're stuck in a loop of the driver not managing to connect to 10.64.32.128 (com.datastax.driver.core.exceptions.DriverException: Connection thread interrupted) - And guess what? https://github.com/apache/cassandra/blob/cassandra-3.11.10/src/java/org/apache/cassandra/hadoop/cql3/CqlRecordWriter.java#L311
[09:48:09] <hnowlan>	 ohoo
[09:49:06] <elukey>	 joal: I was about to say, from the logs it seems as if we ran out of max files due to sockets opened in a loop :D
[09:49:15] <joal>	 I assume that behind the scene the driver creates threads, and we leak them to failure
[09:49:31] <joal>	 Ah - could be sockets
[09:49:55] <hnowlan>	 I think reimaging aqs1012 is a good choice at this point
[09:50:04] <elukey>	 +1 it is a quick one to test
[09:50:05] <joal>	 I have no clue though as to why we get a "Connection thread interrupted" everytime on that job and not on others :S
[09:50:19] <joal>	 ack - waiting for your signal to test anew :)
[09:50:29] <joal>	 Thanks a lot for the help elukey and hnowlan :)
[09:50:51] <elukey>	 is there a way to tune how aggressively the driver retries to connect?
[09:51:14] <elukey>	 I suspect that there may be too many sockets stuck in TIME_WAIT or similar on the client side
[09:56:43] <elukey>	 joal: https://github.com/apache/cassandra/blob/181a4969290f1c756089b2993a638fe403bc1314/src/java/org/apache/cassandra/hadoop/cql3/CqlRecordWriter.java#L362
[09:57:07] <elukey>	 so the driver gets
[09:57:08] <elukey>	 com.datastax.driver.core.exceptions.DriverException: Connection thread interrupted
[09:57:12] <hnowlan>	 ohhh heh
[09:57:15] <elukey>	 and it just retry
[09:58:37] <wikibugs>	 10Analytics, 10Cassandra: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts: ` ['aqs1012.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202105180958_hno...
[09:59:18] <hnowlan>	 So it just loops back and forth between failing but the host state not being failed 
[10:00:58] <elukey>	 it is weird, and it seems also not trying with other nodes
[10:01:55] <hnowlan>	 I assume it's already succeeded with other hosts at a certain point so that iterator pool would be empty if it tried it 
[10:02:23] <elukey>	 could be a good theory sigh
[10:03:30] <wikibugs>	 10Analytics, 10Analytics-Kanban: Article missing from the Clickstream dataset - https://phabricator.wikimedia.org/T282178 (10JAllemandou) @diego:  page-titles in clickstream use `_` as separator,  not space!  ` ## check with all redirects: for r in redirects:     title = r['title'].replace(' ', '_')     count...
[10:08:13] <hnowlan>	 knowing what we know now (or maybe knowing how little more we know now?) I feel like nuking aqs1012 a long time ago was probably a good move but I guess that's hindsight
[10:09:56] <elukey>	 it may also be that reimaging will not fix it, we are really flying almost blind, but it is indeed a good quick test to do to remove any doubt
[10:10:03] <elukey>	 at some point we'll get the right one
[10:10:05] <elukey>	 :D
[10:10:51] <hnowlan>	 we'll rebuild and then it'll turn out that the bug has migrated to another host 
[10:12:58] <elukey>	 of course
[10:13:12] <elukey>	 at that point we'll be entitled to throw the laptop outside the window
[10:13:28] <hnowlan>	 :D
[10:15:20] <joal>	 cassandra being a ring, there is no reason for the bug not to move from one host to another!
[10:30:46] <wikibugs>	 10Analytics, 10Cassandra: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['aqs1012.eqiad.wmnet'] `  and were **ALL** successful.
[10:47:41] * elukey lunch
[11:37:41] <hnowlan>	 cluster repairs needed for the new node rejoining, might be a little bit 
[11:42:17] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Add script to send ratio to max auto_increment value of tables (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/691242 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup)
[11:43:34] <wikibugs>	 (03Merged) 10jenkins-bot: Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/691242 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup)
[11:43:46] <wikibugs>	 (03CR) 10Ladsgroup: Add script to send ratio to max auto_increment value of tables (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/691242 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup)
[11:44:04] <wikibugs>	 (03PS1) 10Ladsgroup: Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692561 (https://phabricator.wikimedia.org/T274419)
[11:44:12] <wikibugs>	 10Analytics, 10Analytics-Kanban: Article missing from the Clickstream dataset - https://phabricator.wikimedia.org/T282178 (10diego) Thanks @JAllemandou , I'm curious about the redirect and how views are assigned.  From [[ https://en.wikipedia.org/w/index.php?title=Coronavirus_disease_2019&action=history | here...
[11:44:21] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692561 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup)
[11:45:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692561 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup)
[12:16:57] <hnowlan>	 joal: readded aqs1012, truncated all tables and did a full repair. we're in as good a state as we can be :) 
[12:17:03] <hnowlan>	 bbiab 
[12:18:27] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRIT
[12:18:27] <icinga-wm>	 ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page vie
[12:18:27] <icinga-wm>	 unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:21:06] <elukey>	 hnowlan: lol --^ :D
[12:35:41] <wikibugs>	 10Analytics-Radar, 10ChangeProp, 10Community-Tech, 10Event-Platform, and 4 others: RFC: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Addshore)
[12:38:42] <joal>	 of course :)
[12:40:32] <joal>	 !log Add monitoring data in cassandra-3
[12:40:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:40:44] <joal>	 ok, relaunching test job
[12:41:40] <hnowlan>	 lmao ;[
[12:42:12] <hnowlan>	 note to self - we need to change the replication factor on system_auth
[12:42:31] <icinga-wm>	 PROBLEM - AQS root url on aqs1012 is CRITICAL: connect to address 10.64.32.16 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[12:42:49] <joal>	 aouch :(
[12:43:04] <joal>	 This one I don't know how to--^ :(
[12:43:13] <hnowlan>	 looking
[12:44:22] <elukey>	 I restarted aqs a little bit ago but not sure if it is related to the last alarm :)
[12:47:56] <wikibugs>	 10Analytics, 10Analytics-Kanban: Article missing from the Clickstream dataset - https://phabricator.wikimedia.org/T282178 (10JAllemandou) The clickstream algorithm reduces one step of redirects, meaning that if page A redirects to page B, views for page A are counted for page B. Multiple steps redirects are no...
[12:50:57] <joal>	 hnowlan: job rerun - same exact problem with same exact host :(
[12:51:15] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:51:45] <icinga-wm>	 RECOVERY - AQS root url on aqs1012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[12:53:35] <hnowlan>	 joal: :( 
[12:53:52] <hnowlan>	 agh, we should have retried running when aqs1012 was fully absent from the ring 
[12:54:06] <joal>	 true, we could have done that
[13:02:19] <hnowlan>	 will try digging some more post-meeting 
[13:03:25] <joal>	 hnowlan: the spark-connector seems to be a solution - it's a shame we haven't manage to underand what happens, but we can also say we've spent enough and invest in the other
[13:03:29] <joal>	 let me know what you think
[13:04:34] <hnowlan>	 Do we want to try adjusting the consistency on the current job? Not sure if that's worth the effort
[13:04:41] <hnowlan>	 Has the spark-connector tried writing to the new cluster? 
[13:05:19] <joal>	 I think milimetric did exactly that
[13:13:02] <elukey>	 joal: where is the source code that we use for the cassandra driver?
[13:13:06] <elukey>	 I am curious about a thing
[13:14:43] <joal>	 elukey: code is in refinery-cassandra (cassandra-reducer), and we use CqlOutputFormat from cassandra-core
[13:14:47] <joal>	 https://github.com/apache/cassandra/blob/bf96367f4d55692017e144980cf17963e31df127/src/java/org/apache/cassandra/hadoop/cql3/CqlOutputFormat.java
[13:16:29] <joal>	 elukey: want to talk about that?
[13:16:48] <elukey>	 yes sorry I keep finding this entry in the lgos
[13:16:52] <elukey>	 *logs
[13:17:01] <elukey>	 2021-05-18 07:18:53,250 DEBUG [client-[/10.64.32.128, aqs1014-b.eqiad.wmnet/10.64.48.67, aqs1013-b.eqiad.wmnet/10.64.32.147]] com.datastax.driver.core.Cluster: Starting new cluster with contact points [aqs1012-a.eqiad.wmnet/10.64.32.128:9042]
[13:17:19] <elukey>	 that is weird, I'd expect more than aqs1012
[13:17:40] <elukey>	 and I saw that CqlSession.builder() offers the possibility to add contact points
[13:17:51] <elukey>	 so I was wondering how the driver was initialized
[13:18:20] <joal>	 elukey: this is done completely out of our scope, in the CqlRecordWriter
[13:19:16] <elukey>	 joal: what do we pass as endpoint to the CqlRecordWriter?
[13:19:32] <joal>	 aqs1010-a
[13:20:41] <elukey>	 so 1012 is picked up because of how the ring is layed out? 
[13:20:46] <elukey>	 (asking to brain bounce)
[13:21:54] <joal>	 I think the job first gets hosts, and then has issues with 1012, leading to having a lot of lines with that host
[13:22:06] <joal>	 Quick meet to share screen?
[13:22:29] <elukey>	 sure gimme 2 mins
[13:30:42] <elukey>	 joal: ready bc?
[13:30:48] <joal>	 yessir
[13:39:48] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10Ottomata) > PS: Wow this comment thread is getting huge! Put on wiki! Or etherpad! Or edit the task description? :)
[13:47:25] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add public_cloud info to webrequest in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/692310 (https://phabricator.wikimedia.org/T279380) (owner: 10Joal)
[13:53:21] <mforns>	 holaaa
[14:03:49] <hnowlan>	 back from meeting - joal, elukey: any joys/discoveries/pains? 
[14:10:26] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10elukey) One nit, otherwise LGTM: I would stop the timers first, let the cluster drain and finally apply the Yarn patch + refresh queues, since IIRC that que...
[14:11:57] <elukey>	 hnowlan: not much, I was checking the code that raises the Interrupted Exception in https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/ControlConnection.java#L263 to get some clue, but nothing so far
[14:12:27] <elukey>	 at this point it is probably not aqs1012, but only the fact that the drivers picks it up since it holds some part of the ring?
[14:12:31] <elukey>	 super weird
[14:12:32] <ottomata>	 mforns:  o/
[14:12:35] <ottomata>	 VPV all wikis?
[14:12:48] <mforns>	 ottomata: suuuuureee
[14:12:54] <mforns>	 patch on the way
[14:12:59] <ottomata>	 k!
[14:13:30] <ottomata>	 oh, mforns  train is happeningg now.
[14:13:49] <ottomata>	 actually could be done lemme ask
[14:13:53] <mforns>	 ok
[14:17:03] <mforns>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/692611
[14:32:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10mforns) Agree that logs should be accessible. Will bring this to our standup meeting today.
[14:37:27] <mforns>	 cool ottomata, monitoring
[14:42:37] <ottomata>	 watching https://grafana.wikimedia.org/goto/AFJWsiqMk
[14:57:27] <mforns>	 ottomata: are EP events lighter than old ones? It seems while #messages is stable, #bytes in decreased with the deployment https://grafana.wikimedia.org/goto/xjVCsm3Gk
[15:05:13] <mforns>	 by looking at the events I'd say no, rather the opposite
[15:05:19] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Conda stacked environment overwrites TAR environment variable - https://phabricator.wikimedia.org/T282491 (10Ottomata)
[15:05:22] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10Ottomata)
[15:07:53] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10Ottomata) Ok!  thanks for the report.  I wrote up some reasons this might happen and the proper solution at https://wikitech.wikimedia.org/wiki/Analytics/S...
[15:10:26] <mforns>	 ottomata: I also see that old events are rendered with spaces by kafkacat, and new ones aren't. maybe that has sth to do?
[15:11:06] <ottomata>	 hmm, mforns  i betcha the batches are compressed better
[15:11:25] <mforns>	 ah, ok
[15:18:06] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Can we add defaults for the profile::java parameters?  I see some duplicate values copy and pasted in quite a few role yamls already.
[15:21:45] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Oh, you put the defaults in the common/profile/java.yaml class parameter hiera? Huh.  I had thought that wasn't allowed: https://wikitech.wikimedia.org/wiki/P...
[15:22:31] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10MoritzMuehlenhoff) >>! In T282454#7095851, @Ottomata wrote: > Can we add defaults for the profile::java parameters?  I see some duplicate values copy and pasted in quit...
[15:32:05] <mforns>	 ottomata: all seems good with virtualpageviews so far no? How long do you think old-formated events can still be produced?
[15:32:25] <ottomata>	 mforns:  at least a day would be good to wait
[15:32:28] <ottomata>	 for good measure
[15:32:35] <ottomata>	 but you can prep all the other patches i think
[15:32:47] <ottomata>	 esp the one that changes the Schema URL in the extension.json
[15:33:01] <mforns>	 ottomata: because I just checked and I still saw old-formatted events for cawiki, wich we migrated yesterday...
[15:33:02] <ottomata>	 since that one takes a week or two to go out
[15:33:21] <ottomata>	 yeah, old clients can take a while,a nd sometimes they never stop because of some copy/pasted code running somehwere
[15:33:22] <mforns>	 k, will do
[15:33:41] <ottomata>	 for this one to be sure we could wait until monday before proceeding?
[15:35:57] <mforns>	 ottomata: I checked in turnilo, and yesterday's virtualpageview data looks good both in proportions and breakdowns... I think we're good! Let's wait until tomorrow at least, but maybe no need to wait until monday..
[15:36:05] <ottomata>	 ok great!
[15:36:16] <mforns>	 k
[15:36:41] <wikibugs>	 10Analytics-Clusters: Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson)
[15:36:58] <wikibugs>	 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson)
[16:53:10] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10MW-1.37-notes (1.37.0-wmf.5; 2021-05-11), 10Patch-For-Review: WikidataCompletionSearchClicks Event Platform Migration - https://phabricator.wikimedia.org/T282140 (10Ottomata) 05Open→03Resolved
[16:53:14] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata)
[16:53:18] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10Ottomata) 05Open→03Resolved
[16:53:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata)
[16:53:23] <wikibugs>	 10Analytics, 10Analytics-Kanban: Stop Refining mediawiki_job events in Hive - https://phabricator.wikimedia.org/T281605 (10Ottomata) 05Open→03Resolved
[16:53:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) 05Open→03Resolved
[16:53:40] <wikibugs>	 10Analytics, 10Analytics-Kanban: Refine + EventLoggingSchemaLoader should use api.svc instead of meta.wikimedia.org directly. - https://phabricator.wikimedia.org/T247510 (10Ottomata) 05Open→03Resolved
[16:53:52] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Data-Infrastructure: MEP: Schema fragments shouldn't require fields - https://phabricator.wikimedia.org/T275674 (10Ottomata) 05Open→03Resolved
[16:54:02] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) 05Open→03Resolved
[16:54:15] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decomission SWAP - https://phabricator.wikimedia.org/T262847 (10Ottomata) 05Open→03Resolved
[16:54:17] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata)
[16:54:22] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 (10Ottomata) 05Open→03Resolved
[16:58:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) a:05razzi→03Ottomata
[17:47:33] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10mpopov) Thanks so much for figuring out the solution and documenting it so well, @Ottomata!
[17:51:44] <razzi>	 I'm going to start on the deployment train. Looks like a refinery deployment is needed and some oozie job restarts is all
[18:11:14] <ottomata>	 k go for it :)
[18:11:17] <razzi>	 joal: I see "Add Traffic's notion of "from public cloud" to Analytics webrequest data" (https://phabricator.wikimedia.org/T279380) in ready to deploy, is there anything to be done there?
[18:11:53] <ottomata>	 milimetric:  might know ^ is for druid i think
[18:12:25] <joal>	 Heya - Here I am
[18:13:07] <milimetric>	 razzi: I put it on the etherpad, and mentioned there that I don't think a restart of the job is needed?  But I may be wrong
[18:13:21] <joal>	 razzi: the procedure (not detailed) for deploy is documented in the train etherpad
[18:13:46] <milimetric>	 I can't remember if oozie makes a copy of the scripts when you submit or not
[18:13:52] <joal>	 razzi: the change impacts webrequest-druid jobs (hourly and daily), and both should be restarted, please :)
[18:14:09] <milimetric>	 joal: so even if the xml doesn't change?
[18:14:34] <joal>	 milimetric: by convention we run oozie jobs using: -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/$(date +%Y)* | tail -n 1 | awk '{print $NF}')
[18:15:00] <joal>	 milimetric: This makes oozie use a hard-coded dated deployed refinery folder as base, not the current one
[18:15:03] <joal>	 on purpose
[18:15:36] <milimetric>	 Ah!  Didn't think about how that affects resolving scripts, thank you, that's what I was missing
[18:15:44] <milimetric>	 I always restarted anyway
[18:16:00] <joal>	 :)
[18:17:23] <joal>	 milimetric: it also depends on how scripts are defined in jobs: we usually use the relative hql file-name, not absolute, and oozie expects to find the file in same folder the workflow.xml file is
[18:17:54] <joal>	 this being said: if you don't change the path for you workflow.xml, you still resolve relative to hte dated folder - restart needed :)
[18:23:03] <joal>	 razzi: let me know if you need help with the restarts and all
[18:25:08] <razzi>	 Thanks joal, just waiting on the refinery deploy, restarts are next, and that'll be it for the train
[18:25:28] <joal>	 ack razzi - I'm available as needed
[18:38:30] <razzi>	 joal: I'm looking at the jobs restart checklist: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie/Administration#Job_Restart_Checklist
[18:38:30] <razzi>	 can you confirm the checklist is all set?
[18:39:15] <joal>	 razzi: all good for the webrequest jobs!
[18:44:20] <wikibugs>	 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10JAllemandou) Heya @EBernhardson, not having canary events in the refined data is expected: https://github.com/wikimedia/analytics-r...
[18:46:17] <ottomata>	 !log removing extraneous python-kafka and python-confluent-kafka deb packages from analytics cluster - T275786
[18:46:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:46:20] <stashbot>	 T275786: Remove all debian python-* packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786
[18:57:52] <razzi>	 !log deployed refinery via scap, then deployed to hdfs
[18:57:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:06:13] <ottomata>	 milimetric: druid?
[19:29:30] <wikibugs>	 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson)
[19:33:08] <wikibugs>	 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson) Hmm, is this a new limitation? We first deployed this stream in february and no events came through for a few months,...
[20:02:56] <ottomata>	 razzi:  am in meet if you wanna sync!
[20:03:08] <razzi>	 ottomata: ok, give me a few minutes break and I will!
[20:06:29] <ottomata>	 k!
[20:12:13] <wikibugs>	 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10Ottomata) Canary events are enabled for this stream.  They are used to make create the partitions, but are filtered out of the refi...
[20:22:40] <razzi>	 !log restart oozie virtualpageview hourly, virtualpageview druid daily, virtualpageview druid monthly
[20:22:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:23:27] <ottomata>	 mforns:  we are missing some virtualpageview hourly SLAs
[20:23:29] <ottomata>	 looking...
[20:24:51] <mforns>	 hm
[20:25:48] <mforns>	 ottomata: where did you get that?
[20:27:06] <ottomata>	 oh oh soryr, this is probably razzis oozie restarts
[20:27:15] <ottomata>	 mforns in analytics-alerts email 
[20:27:37] <mforns>	 ottomata: uou! just received them now
[20:27:56] <mforns>	 ottomata: oh, the deployment train, right? yea, makes sense
[20:28:08] <ottomata>	 yeah
[20:28:11] <ottomata>	 false alarm!
[20:28:17] <ottomata>	 just saw thtem and thought maybe related to our deploy today
[20:33:05] <razzi>	 ok ottomata want to sync for a minute?
[20:33:16] <ottomata>	 ya
[20:33:36] <razzi>	 bc!
[21:01:24] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) @Joe @akosiaris q:  I know using docker images in prod outside of k8s is not really done, but...could we?  I also know we don't allow users to run docker images fo...
[22:14:39] <wikibugs>	 10Analytics, 10Analytics-Kanban: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) Airflow 2 supports HA scheduler: https://airflow.apache.org/docs/apache-airflow/stable/scheduler.html#running-more-than-one-schedule...
[22:25:10] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) I ask because it'd be slick to use Deployment Pipeline and Blubber's [[ https://wikitech.wikimedia.org/wiki/Blubber/User_Guide#Variant_Config | python variant conf...
[22:27:57] <wikibugs>	 10Analytics, 10Analytics-Kanban: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) Oh, perhaps MariaDB will work for Airflow HA now that https://jira.mariadb.org/browse/MDEV-13115 is resolved.  Would need a pretty r...
[23:03:11] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) It should be ok to merge the yarn patch first; from the hadoop docs:   > yarn.scheduler.capacity.<queue-path>.state > The state of the queue. Can be...
[23:07:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Aggregate table not working after superset upgrade - https://phabricator.wikimedia.org/T280784 (10razzi) I think we're all set here; let me know if anything else needs to be done.