[00:03:52] PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:20:05] !log manually running drop_event with --verbose flag [02:20:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [02:29:10] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Ottomata) > replace their contents with {} This will be ok once fully migrated, but the metawiki based pipe... [06:03:11] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) Thanks for the reviews @elukey and @JAllemandou! Based on your comments, my plan is to reimage an-master1002 on Tuesday May 25 before standup (14:30... [06:30:12] good morning folks, I'll be afk for ~2h, ttl! [06:45:01] Good morning [06:50:51] !log run manual unique-devices cassandra job for one day with debug logging [06:50:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:23:16] 10Analytics-Radar, 10Dumps-Generation: Temp files left around in wikistats_1/ ? - https://phabricator.wikimedia.org/T280311 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn Moar data! OK, closing :-) [07:28:10] 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10awight) >>! In T274880#7094199, @Milimetric wrote: > Ok, I think this is all you need, let me know if you need "Owner": https://gerrit.wikimedia.org/r/admin/repo... [08:28:59] hnowlan: I have debug logs (verbooooooooose) [08:29:26] They show: DEBUG [client-[/10.64.32.128, aqs1014-b.eqiad.wmnet/10.64.48.67, aqs1013-b.eqiad.wmnet/10.64.32.147]] com.datastax.driver.core.ControlConnection: [Control connection] error on aqs1012-a.e [08:29:30] qiad.wmnet/10.64.32.128:9042 connection, no more host to try [08:31:50] 10Analytics-Kanban, 10Patch-For-Review: Update refinery-cassandra dependencies to have support for Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10JAllemandou) @hnowlan : Logs at `DEBUG` level (prepare our eyes, those are verbose!) From `an-launcher1002`: ` sudo -u analytics kerberos-run-command an... [08:32:11] hnowlan: I added the log-vieweing command to the task --^ [08:36:36] there is also com.datastax.driver.core.SystemProperties: com.datastax.driver.USE_NATIVE_CLOCK is und [08:36:39] efined, using default value true [08:36:39] that is interesting [08:41:48] can we try with false? [08:41:55] joal: --^ [08:42:15] (if we haven't tried yet) [08:42:47] the other interesting thing is [08:42:48] 2021-05-18 07:18:47,108 DEBUG [cluster6-nio-worker-0] com.datastax.driver.core.Connection: Connection[aqs1 [08:42:51] 012-a.eqiad.wmnet/10.64.32.128:9042-1, inFlight=0, closed=false] Connection established, initializing tran [08:42:54] sport [08:43:00] so sometimes it happens [08:44:39] elukey: trying to understand how I can set: com.datastax.driver.USE_NATIVE_CLOCK=false :) [08:46:12] it may be nothing but worth to try [08:46:45] elukey: for sure! [08:54:45] joal: nice, thank you! [09:37:50] elukey, hnowlan - I managed to pass com.datastax.driver.USE_NATIVE_CLOCK=false - the log says: [09:37:53] com.datastax.driver.core.SystemProperties: com.datastax.driver.USE_NATIVE_CLOCK is defined, using value false [09:37:56] com.datastax.driver.core.ClockFactory: Using java.lang.System clock to generate timestamps. [09:38:32] But, it still fails with the same error :( [09:42:41] good, we removed one thing :) [09:47:22] "com.datastax.driver.core.exceptions.DriverException: Connection thread interrupted" is closer to a smoking gun I guess [09:47:27] but not very useful ;/ [09:47:43] ok - from the logs: we're stuck in a loop of the driver not managing to connect to 10.64.32.128 (com.datastax.driver.core.exceptions.DriverException: Connection thread interrupted) - And guess what? https://github.com/apache/cassandra/blob/cassandra-3.11.10/src/java/org/apache/cassandra/hadoop/cql3/CqlRecordWriter.java#L311 [09:48:09] ohoo [09:49:06] joal: I was about to say, from the logs it seems as if we ran out of max files due to sockets opened in a loop :D [09:49:15] I assume that behind the scene the driver creates threads, and we leak them to failure [09:49:31] Ah - could be sockets [09:49:55] I think reimaging aqs1012 is a good choice at this point [09:50:04] +1 it is a quick one to test [09:50:05] I have no clue though as to why we get a "Connection thread interrupted" everytime on that job and not on others :S [09:50:19] ack - waiting for your signal to test anew :) [09:50:29] Thanks a lot for the help elukey and hnowlan :) [09:50:51] is there a way to tune how aggressively the driver retries to connect? [09:51:14] I suspect that there may be too many sockets stuck in TIME_WAIT or similar on the client side [09:56:43] joal: https://github.com/apache/cassandra/blob/181a4969290f1c756089b2993a638fe403bc1314/src/java/org/apache/cassandra/hadoop/cql3/CqlRecordWriter.java#L362 [09:57:07] so the driver gets [09:57:08] com.datastax.driver.core.exceptions.DriverException: Connection thread interrupted [09:57:12] ohhh heh [09:57:15] and it just retry [09:58:37] 10Analytics, 10Cassandra: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts: ` ['aqs1012.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202105180958_hno... [09:59:18] So it just loops back and forth between failing but the host state not being failed [10:00:58] it is weird, and it seems also not trying with other nodes [10:01:55] I assume it's already succeeded with other hosts at a certain point so that iterator pool would be empty if it tried it [10:02:23] could be a good theory sigh [10:03:30] 10Analytics, 10Analytics-Kanban: Article missing from the Clickstream dataset - https://phabricator.wikimedia.org/T282178 (10JAllemandou) @diego: page-titles in clickstream use `_` as separator, not space! ` ## check with all redirects: for r in redirects: title = r['title'].replace(' ', '_') count... [10:08:13] knowing what we know now (or maybe knowing how little more we know now?) I feel like nuking aqs1012 a long time ago was probably a good move but I guess that's hindsight [10:09:56] it may also be that reimaging will not fix it, we are really flying almost blind, but it is indeed a good quick test to do to remove any doubt [10:10:03] at some point we'll get the right one [10:10:05] :D [10:10:51] we'll rebuild and then it'll turn out that the bug has migrated to another host [10:12:58] of course [10:13:12] at that point we'll be entitled to throw the laptop outside the window [10:13:28] :D [10:15:20] cassandra being a ring, there is no reason for the bug not to move from one host to another! [10:30:46] 10Analytics, 10Cassandra: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['aqs1012.eqiad.wmnet'] ` and were **ALL** successful. [10:47:41] * elukey lunch [11:37:41] cluster repairs needed for the new node rejoining, might be a little bit [11:42:17] (03CR) 10Addshore: [C: 03+2] Add script to send ratio to max auto_increment value of tables (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/691242 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup) [11:43:34] (03Merged) 10jenkins-bot: Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/691242 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup) [11:43:46] (03CR) 10Ladsgroup: Add script to send ratio to max auto_increment value of tables (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/691242 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup) [11:44:04] (03PS1) 10Ladsgroup: Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692561 (https://phabricator.wikimedia.org/T274419) [11:44:12] 10Analytics, 10Analytics-Kanban: Article missing from the Clickstream dataset - https://phabricator.wikimedia.org/T282178 (10diego) Thanks @JAllemandou , I'm curious about the redirect and how views are assigned. From [[ https://en.wikipedia.org/w/index.php?title=Coronavirus_disease_2019&action=history | here... [11:44:21] (03CR) 10Ladsgroup: [C: 03+2] Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692561 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup) [11:45:23] (03Merged) 10jenkins-bot: Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692561 (https://phabricator.wikimedia.org/T274419) (owner: 10Ladsgroup) [12:16:57] joal: readded aqs1012, truncated all tables and did a full repair. we're in as good a state as we can be :) [12:17:03] bbiab [12:18:27] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRIT [12:18:27] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page vie [12:18:27] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:21:06] hnowlan: lol --^ :D [12:35:41] 10Analytics-Radar, 10ChangeProp, 10Community-Tech, 10Event-Platform, and 4 others: RFC: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Addshore) [12:38:42] of course :) [12:40:32] !log Add monitoring data in cassandra-3 [12:40:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:40:44] ok, relaunching test job [12:41:40] lmao ;[ [12:42:12] note to self - we need to change the replication factor on system_auth [12:42:31] PROBLEM - AQS root url on aqs1012 is CRITICAL: connect to address 10.64.32.16 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [12:42:49] aouch :( [12:43:04] This one I don't know how to--^ :( [12:43:13] looking [12:44:22] I restarted aqs a little bit ago but not sure if it is related to the last alarm :) [12:47:56] 10Analytics, 10Analytics-Kanban: Article missing from the Clickstream dataset - https://phabricator.wikimedia.org/T282178 (10JAllemandou) The clickstream algorithm reduces one step of redirects, meaning that if page A redirects to page B, views for page A are counted for page B. Multiple steps redirects are no... [12:50:57] hnowlan: job rerun - same exact problem with same exact host :( [12:51:15] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:51:45] RECOVERY - AQS root url on aqs1012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [12:53:35] joal: :( [12:53:52] agh, we should have retried running when aqs1012 was fully absent from the ring [12:54:06] true, we could have done that [13:02:19] will try digging some more post-meeting [13:03:25] hnowlan: the spark-connector seems to be a solution - it's a shame we haven't manage to underand what happens, but we can also say we've spent enough and invest in the other [13:03:29] let me know what you think [13:04:34] Do we want to try adjusting the consistency on the current job? Not sure if that's worth the effort [13:04:41] Has the spark-connector tried writing to the new cluster? [13:05:19] I think milimetric did exactly that [13:13:02] joal: where is the source code that we use for the cassandra driver? [13:13:06] I am curious about a thing [13:14:43] elukey: code is in refinery-cassandra (cassandra-reducer), and we use CqlOutputFormat from cassandra-core [13:14:47] https://github.com/apache/cassandra/blob/bf96367f4d55692017e144980cf17963e31df127/src/java/org/apache/cassandra/hadoop/cql3/CqlOutputFormat.java [13:16:29] elukey: want to talk about that? [13:16:48] yes sorry I keep finding this entry in the lgos [13:16:52] *logs [13:17:01] 2021-05-18 07:18:53,250 DEBUG [client-[/10.64.32.128, aqs1014-b.eqiad.wmnet/10.64.48.67, aqs1013-b.eqiad.wmnet/10.64.32.147]] com.datastax.driver.core.Cluster: Starting new cluster with contact points [aqs1012-a.eqiad.wmnet/10.64.32.128:9042] [13:17:19] that is weird, I'd expect more than aqs1012 [13:17:40] and I saw that CqlSession.builder() offers the possibility to add contact points [13:17:51] so I was wondering how the driver was initialized [13:18:20] elukey: this is done completely out of our scope, in the CqlRecordWriter [13:19:16] joal: what do we pass as endpoint to the CqlRecordWriter? [13:19:32] aqs1010-a [13:20:41] so 1012 is picked up because of how the ring is layed out? [13:20:46] (asking to brain bounce) [13:21:54] I think the job first gets hosts, and then has issues with 1012, leading to having a lot of lines with that host [13:22:06] Quick meet to share screen? [13:22:29] sure gimme 2 mins [13:30:42] joal: ready bc? [13:30:48] yessir [13:39:48] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10Ottomata) > PS: Wow this comment thread is getting huge! Put on wiki! Or etherpad! Or edit the task description? :) [13:47:25] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add public_cloud info to webrequest in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/692310 (https://phabricator.wikimedia.org/T279380) (owner: 10Joal) [13:53:21] holaaa [14:03:49] back from meeting - joal, elukey: any joys/discoveries/pains? [14:10:26] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10elukey) One nit, otherwise LGTM: I would stop the timers first, let the cluster drain and finally apply the Yarn patch + refresh queues, since IIRC that que... [14:11:57] hnowlan: not much, I was checking the code that raises the Interrupted Exception in https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/ControlConnection.java#L263 to get some clue, but nothing so far [14:12:27] at this point it is probably not aqs1012, but only the fact that the drivers picks it up since it holds some part of the ring? [14:12:31] super weird [14:12:32] mforns: o/ [14:12:35] VPV all wikis? [14:12:48] ottomata: suuuuureee [14:12:54] patch on the way [14:12:59] k! [14:13:30] oh, mforns train is happeningg now. [14:13:49] actually could be done lemme ask [14:13:53] ok [14:17:03] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/692611 [14:32:14] 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10mforns) Agree that logs should be accessible. Will bring this to our standup meeting today. [14:37:27] cool ottomata, monitoring [14:42:37] watching https://grafana.wikimedia.org/goto/AFJWsiqMk [14:57:27] ottomata: are EP events lighter than old ones? It seems while #messages is stable, #bytes in decreased with the deployment https://grafana.wikimedia.org/goto/xjVCsm3Gk [15:05:13] by looking at the events I'd say no, rather the opposite [15:05:19] 10Analytics, 10Analytics-Kanban: [Newpyter] Conda stacked environment overwrites TAR environment variable - https://phabricator.wikimedia.org/T282491 (10Ottomata) [15:05:22] 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10Ottomata) [15:07:53] 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10Ottomata) Ok! thanks for the report. I wrote up some reasons this might happen and the proper solution at https://wikitech.wikimedia.org/wiki/Analytics/S... [15:10:26] ottomata: I also see that old events are rendered with spaces by kafkacat, and new ones aren't. maybe that has sth to do? [15:11:06] hmm, mforns i betcha the batches are compressed better [15:11:25] ah, ok [15:18:06] 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Can we add defaults for the profile::java parameters? I see some duplicate values copy and pasted in quite a few role yamls already. [15:21:45] 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Oh, you put the defaults in the common/profile/java.yaml class parameter hiera? Huh. I had thought that wasn't allowed: https://wikitech.wikimedia.org/wiki/P... [15:22:31] 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10MoritzMuehlenhoff) >>! In T282454#7095851, @Ottomata wrote: > Can we add defaults for the profile::java parameters? I see some duplicate values copy and pasted in quit... [15:32:05] ottomata: all seems good with virtualpageviews so far no? How long do you think old-formated events can still be produced? [15:32:25] mforns: at least a day would be good to wait [15:32:28] for good measure [15:32:35] but you can prep all the other patches i think [15:32:47] esp the one that changes the Schema URL in the extension.json [15:33:01] ottomata: because I just checked and I still saw old-formatted events for cawiki, wich we migrated yesterday... [15:33:02] since that one takes a week or two to go out [15:33:21] yeah, old clients can take a while,a nd sometimes they never stop because of some copy/pasted code running somehwere [15:33:22] k, will do [15:33:41] for this one to be sure we could wait until monday before proceeding? [15:35:57] ottomata: I checked in turnilo, and yesterday's virtualpageview data looks good both in proportions and breakdowns... I think we're good! Let's wait until tomorrow at least, but maybe no need to wait until monday.. [15:36:05] ok great! [15:36:16] k [15:36:41] 10Analytics-Clusters: Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson) [15:36:58] 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson) [16:53:10] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10MW-1.37-notes (1.37.0-wmf.5; 2021-05-11), 10Patch-For-Review: WikidataCompletionSearchClicks Event Platform Migration - https://phabricator.wikimedia.org/T282140 (10Ottomata) 05Open→03Resolved [16:53:14] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata) [16:53:18] 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10Ottomata) 05Open→03Resolved [16:53:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) [16:53:23] 10Analytics, 10Analytics-Kanban: Stop Refining mediawiki_job events in Hive - https://phabricator.wikimedia.org/T281605 (10Ottomata) 05Open→03Resolved [16:53:34] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) 05Open→03Resolved [16:53:40] 10Analytics, 10Analytics-Kanban: Refine + EventLoggingSchemaLoader should use api.svc instead of meta.wikimedia.org directly. - https://phabricator.wikimedia.org/T247510 (10Ottomata) 05Open→03Resolved [16:53:52] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Data-Infrastructure: MEP: Schema fragments shouldn't require fields - https://phabricator.wikimedia.org/T275674 (10Ottomata) 05Open→03Resolved [16:54:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) 05Open→03Resolved [16:54:15] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decomission SWAP - https://phabricator.wikimedia.org/T262847 (10Ottomata) 05Open→03Resolved [16:54:17] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) [16:54:22] 10Analytics, 10Analytics-Kanban, 10Event-Platform: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 (10Ottomata) 05Open→03Resolved [16:58:56] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) a:05razzi→03Ottomata [17:47:33] 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10mpopov) Thanks so much for figuring out the solution and documenting it so well, @Ottomata! [17:51:44] I'm going to start on the deployment train. Looks like a refinery deployment is needed and some oozie job restarts is all [18:11:14] k go for it :) [18:11:17] joal: I see "Add Traffic's notion of "from public cloud" to Analytics webrequest data" (https://phabricator.wikimedia.org/T279380) in ready to deploy, is there anything to be done there? [18:11:53] milimetric: might know ^ is for druid i think [18:12:25] Heya - Here I am [18:13:07] razzi: I put it on the etherpad, and mentioned there that I don't think a restart of the job is needed? But I may be wrong [18:13:21] razzi: the procedure (not detailed) for deploy is documented in the train etherpad [18:13:46] I can't remember if oozie makes a copy of the scripts when you submit or not [18:13:52] razzi: the change impacts webrequest-druid jobs (hourly and daily), and both should be restarted, please :) [18:14:09] joal: so even if the xml doesn't change? [18:14:34] milimetric: by convention we run oozie jobs using: -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/$(date +%Y)* | tail -n 1 | awk '{print $NF}') [18:15:00] milimetric: This makes oozie use a hard-coded dated deployed refinery folder as base, not the current one [18:15:03] on purpose [18:15:36] Ah! Didn't think about how that affects resolving scripts, thank you, that's what I was missing [18:15:44] I always restarted anyway [18:16:00] :) [18:17:23] milimetric: it also depends on how scripts are defined in jobs: we usually use the relative hql file-name, not absolute, and oozie expects to find the file in same folder the workflow.xml file is [18:17:54] this being said: if you don't change the path for you workflow.xml, you still resolve relative to hte dated folder - restart needed :) [18:23:03] razzi: let me know if you need help with the restarts and all [18:25:08] Thanks joal, just waiting on the refinery deploy, restarts are next, and that'll be it for the train [18:25:28] ack razzi - I'm available as needed [18:38:30] joal: I'm looking at the jobs restart checklist: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie/Administration#Job_Restart_Checklist [18:38:30] can you confirm the checklist is all set? [18:39:15] razzi: all good for the webrequest jobs! [18:44:20] 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10JAllemandou) Heya @EBernhardson, not having canary events in the refined data is expected: https://github.com/wikimedia/analytics-r... [18:46:17] !log removing extraneous python-kafka and python-confluent-kafka deb packages from analytics cluster - T275786 [18:46:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:46:20] T275786: Remove all debian python-* packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 [18:57:52] !log deployed refinery via scap, then deployed to hdfs [18:57:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:06:13] milimetric: druid? [19:29:30] 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson) [19:33:08] 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10EBernhardson) Hmm, is this a new limitation? We first deployed this stream in february and no events came through for a few months,... [20:02:56] razzi: am in meet if you wanna sync! [20:03:08] ottomata: ok, give me a few minutes break and I will! [20:06:29] k! [20:12:13] 10Analytics-Clusters, 10Discovery-Search (Current work): Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10Ottomata) Canary events are enabled for this stream. They are used to make create the partitions, but are filtered out of the refi... [20:22:40] !log restart oozie virtualpageview hourly, virtualpageview druid daily, virtualpageview druid monthly [20:22:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:23:27] mforns: we are missing some virtualpageview hourly SLAs [20:23:29] looking... [20:24:51] hm [20:25:48] ottomata: where did you get that? [20:27:06] oh oh soryr, this is probably razzis oozie restarts [20:27:15] mforns in analytics-alerts email [20:27:37] ottomata: uou! just received them now [20:27:56] ottomata: oh, the deployment train, right? yea, makes sense [20:28:08] yeah [20:28:11] false alarm! [20:28:17] just saw thtem and thought maybe related to our deploy today [20:33:05] ok ottomata want to sync for a minute? [20:33:16] ya [20:33:36] bc! [21:01:24] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) @Joe @akosiaris q: I know using docker images in prod outside of k8s is not really done, but...could we? I also know we don't allow users to run docker images fo... [22:14:39] 10Analytics, 10Analytics-Kanban: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) Airflow 2 supports HA scheduler: https://airflow.apache.org/docs/apache-airflow/stable/scheduler.html#running-more-than-one-schedule... [22:25:10] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) I ask because it'd be slick to use Deployment Pipeline and Blubber's [[ https://wikitech.wikimedia.org/wiki/Blubber/User_Guide#Variant_Config | python variant conf... [22:27:57] 10Analytics, 10Analytics-Kanban: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) Oh, perhaps MariaDB will work for Airflow HA now that https://jira.mariadb.org/browse/MDEV-13115 is resolved. Would need a pretty r... [23:03:11] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) It should be ok to merge the yarn patch first; from the hadoop docs: > yarn.scheduler.capacity..state > The state of the queue. Can be... [23:07:09] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Aggregate table not working after superset upgrade - https://phabricator.wikimedia.org/T280784 (10razzi) I think we're all set here; let me know if anything else needs to be done.