[02:56:54] <wikibugs>	 (03PS1) 10Ladsgroup: Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413)
[03:02:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413) (owner: 10Ladsgroup)
[03:11:43] <wikibugs>	 (03PS2) 10Ladsgroup: Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413)
[03:16:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413) (owner: 10Ladsgroup)
[03:19:00] <wikibugs>	 (03PS3) 10Ladsgroup: Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413)
[03:23:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413) (owner: 10Ladsgroup)
[03:35:06] <wikibugs>	 (03PS4) 10Ladsgroup: Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413)
[03:39:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413) (owner: 10Ladsgroup)
[03:57:05] <wikibugs>	 (03PS5) 10Ladsgroup: Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413)
[04:01:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413) (owner: 10Ladsgroup)
[04:10:25] <wikibugs>	 (03PS6) 10Ladsgroup: Add scala job for reliability metrics of Wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686066 (https://phabricator.wikimedia.org/T274413)
[05:57:30] <elukey>	 good morning!
[05:57:41] <elukey>	 joal: no hadoop yarn failures in alerts@ \o/
[05:57:51] <elukey>	 (I mean nm failures)
[05:58:35] <wikibugs>	 10Analytics: Reset Kerberos password for nahidunlimited - https://phabricator.wikimedia.org/T282077 (10Nahid) Hey @razzi, I got the password and it's working. Thanks.
[06:25:02] <joal>	 \o/
[06:25:08] <joal>	 Good morning elukey :)
[06:30:27] <elukey>	 bonjour!
[06:43:48] <wikibugs>	 10Analytics: Yarn NM stopping due to failures while creating native threads - https://phabricator.wikimedia.org/T281792 (10elukey) No errors for native threads registered in the past hours, it looks that we are out of the woods, but I'll wait until Monday before declaring victory.
[06:44:04] <wikibugs>	 10Analytics, 10Analytics-Kanban: Yarn NM stopping due to failures while creating native threads - https://phabricator.wikimedia.org/T281792 (10elukey) p:05Triage→03High a:03elukey
[07:07:07] <joal>	 elukey: unique-devices cassandra job succeeded today
[07:07:12] <joal>	  /o\
[07:07:19] * joal understand even less
[07:08:19] <elukey>	 the one that was failing for max open files??
[07:23:04] <joal>	 yeah
[07:24:24] <elukey>	 joal: have we tried to re-run the older ones?
[07:24:29] <elukey>	 just to see if they keep failing
[07:25:11] <joal>	 So today's status: there still is a failing job (top_percountry) - the uniques one has succeeded
[07:26:03] <elukey>	 what I mean is the two failed ones in
[07:26:03] <elukey>	 https://hue.wikimedia.org/hue/jobbrowser/#!id=0011324-210426062240701-oozie-oozi-C
[07:27:15] <joal>	 From the logs: top_percountry has failed due to too many HashedWheelTimer instances - The uniques job also had failing reducer for the same problem, but, after 2 failed reducer, another succeeded without the log-lines of HashedWheelTimer instanciation
[07:27:21] <joal>	 MEHHHHH --^ !
[07:27:52] <joal>	 elukey: I'm gonna try to rerun, but it failed yesterday
[07:28:22] <elukey>	 what it would be interesting to know is if they keep doing it, then it may be due something to the specific data
[07:28:31] <elukey>	 otherwise it wouldn't make sense
[07:28:57] <joal>	 elukey: could be related to classpath as well
[07:29:22] <joal>	 elukey: if jars are loaded in different order over different workers (wouldn't know why though)
[07:30:12] <joal>	 from the successfull uniques job: failed reducers on 1090 and 1121, successful on 1125
[07:45:52] <joal>	 elukey: we're not the only ones - https://issues.apache.org/jira/browse/FLINK-9009
[07:57:03] <wikibugs>	 (03PS1) 10Ladsgroup: oozie: Add oozie job for gather wikidata reliability metrics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/686383 (https://phabricator.wikimedia.org/T274413)
[07:57:21] <elukey>	 joal: sorry you lost me, those people are having the same problem but ingesting to cassandra via Flink?
[07:57:30] <joal>	 correct sir
[07:57:36] <elukey>	 ahhh
[08:03:58] <joal>	 I think we're hitting some kind of race-condition problem at nodes-communication initialisation
[08:32:37] <wikibugs>	 10Analytics, 10Services (watching): Consider converting AQS to TypeScript - https://phabricator.wikimedia.org/T206269 (10Aklapper)
[09:14:51] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10Merle_von_Wittich_WMDE) >>! In T281300#7067393, @Ottomata wrote: >> Can you confirm, then, that we can delete data older than 90 days? :-) >  >...
[09:35:37] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10GoranSMilovanovic) @Merle_von_Wittich_WMDE @Ottomata @mforns   For our New Editors campaigns I have been using `event.wmdebannerinteractions` for...
[09:45:20] <joal>	 heya team - got a power cut this morning - sorry for that
[10:38:07] * elukey lunch!
[13:51:53] <mforns>	 hi teammm
[13:52:11] <wikibugs>	 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Marostegui) >>! In T120242#6925040, @Ottomata wrote: >> How would it handle if the replica goes...
[13:53:57] <milimetric>	 joal: me too!  How weird... would be cool if it was a neutrino shooting through both our towns wreacking havok
[14:00:34] <joal>	 :)
[14:08:01] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10QuickSurveys, 10WMDE-TechWish, 10Readers-Web-Backlog (Tracking): QuickSurveys should show an error when response is blocked - https://phabricator.wikimedia.org/T256463 (10Milimetric) So I'm not sure if you're talking about other problems but I hear two:  1. async...
[14:57:43] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10srodlund) @elukey I noticed that some other folks left comments on the document after I did my pass.  I resol...
[14:59:58] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10elukey) @srodlund thanks a lot for the extra pass, I'll try to resolve the last open comments today so you'll...
[15:22:52] <hnowlan>	 what's the distinction between an-druidNNNN and druidNNNN hosts? 
[15:23:34] <elukey>	 so an-druid100x is the new naming for the analytics cluster
[15:23:52] <elukey>	 that is actually composed by druid100[1-3] and an-druid100[1-3]
[15:24:14] <elukey>	 err an-druid100[1,2]
[15:24:36] <elukey>	 the first three are the ones to replace with an-druid100[3-5] so we'll have the same naming
[15:25:04] <elukey>	 druid100[4-9] are the nodes that AQS calls for the mw edit api (that we don't need to refresh now)
[15:25:10] <elukey>	 (they are behind a LVS VIP)
[15:30:11] <milimetric>	 mforns: the cassandra 3 load job failures are expected, it's what Joseph was talking about yesterday, the weird errors
[15:30:16] <milimetric>	 sorry I just saw your email
[15:30:36] <milimetric>	 so he's working on figuring out what the problem is and I'm close (I hope) to getting a POC spark-cassandra-connector version
[15:31:56] <milimetric>	 joal: success!
[15:31:57] <milimetric>	 aqs@cqlsh> select * from "local_group_default_T_unique_devices_test".data;
[15:32:03] <milimetric>	 (on aqs1010-a)
[15:32:11] <milimetric>	 submitting super hacky POC patch now
[15:32:12] <mforns>	 milimetric: thanks! I realized that, once I saw the "too many files" error in the logs
[15:32:17] <elukey>	 hnowlan: the relevant info about Druid is that it needs Zookeeper, so at the time we added dedicated clusters on the same nodes (we didn't have an-conf100[1-3] yet)
[15:33:00] <mforns>	 milimetric: awesome!
[15:33:08] <elukey>	 hnowlan: druid100[1-3] are running zookeeper, but there is a procedure to move the same node elsewhere
[15:33:12] <wikibugs>	 (03PS1) 10Milimetric: [WIP] POC: loading cassandra directly from spark [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686629
[15:33:38] <mforns>	 BTW a-team, anyone for friday fun?
[15:33:40] <milimetric>	 mforns: the code is so clean it'll make you cry (some details to work out still, I'm sure, but it definitely proves that loading works for this particular query
[15:33:43] <milimetric>	 I'm in mforns 
[15:33:44] <mforns>	 hehehe
[15:34:49] <elukey>	 milimetric: qq - the spark code doesn't optimize where to send data to right? IIRC the current code sends data directly to the nodes in the ring that handles it, to avoid being proxied etc.. (curious about the new code)
[15:35:07] <wikibugs>	 (03CR) 10Milimetric: "To test:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686629 (owner: 10Milimetric)
[15:35:10] <elukey>	 (current loading code I mean, the one failing)
[15:35:49] <elukey>	 or is the saveToCassandra() doing the magic?
[15:36:10] <milimetric>	 elukey: I'm not sure what it does under the hood yet, it's using the datastax spark-cassandra-connector, I'm not sure if it detects the cluster and does fancy stuff, or if that's configurable, or even if it's an option at all.  I know all it asks for is one of the hosts in the cluster
[15:36:42] <milimetric>	 another thing I notice is that it logs a line for every single row it inserts... which seems insane and silly
[15:36:47] <milimetric>	 so that should be turned off
[15:37:45] <elukey>	 ahahahah
[15:37:49] <elukey>	 yes yes probably
[15:38:04] <elukey>	 olja: thanks a lot for the suggestions and the review, really appreciated!
[15:38:10] <elukey>	 I think that we are now ready to publish :)
[15:38:15] <milimetric>	 olja's on IRC!  Welcome :)
[15:39:19] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10elukey) @srodlund it seems that there is nothing outstanding anymore, we can publish!!
[15:42:24] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10srodlund) Oh awesome! Sounds good! I will move this over!
[15:42:57] * elukey dances
[15:48:12] <joal>	 milimetric: The spark-cassandra is indeed super-great :)
[15:48:19] <joal>	 milimetric: let's test that early next week
[15:48:55] <milimetric>	 joal: sure, so I used the POC to insert into a test table on cassandra 3, and I am looking forward to testing together, should be fun
[15:49:15] <joal>	 yay :)
[15:49:49] <joal>	 milimetric: the thing to make is that we can turn off the heavy logging, and that loading big datasets works correctly
[15:50:03] <joal>	 After that it's a go :)
[15:50:15] <milimetric>	 yeah, makes sense
[16:14:23] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation: Temp files left around in wikistats_1/ ? - https://phabricator.wikimedia.org/T280311 (10ArielGlenn) >>! In T280311#7067087, @Milimetric wrote: > So it looks like the https://dumps.wikimedia.org/other/wikistats_1.0/ folder is empty, so that can be deleted. >  > The https:/...
[16:16:24] <mforns>	 joal: sorry I didn't understand, you say that we can turn off logging, and that will fix jobs that are failing currently?
[16:23:48] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10srodlund) @elukey do you have a credit for the image of all the logos that is in the doc? I can just put you...
[16:24:41] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10Ottomata) Ok let me try to rephrase, there are actually 2 distinct questions here.  1. Do you need historical data (older than 90 days) in the `e...
[16:44:14] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10srodlund) @elukey I decided to just use that credit and publish the post. Before I announce it widely, will y...
[16:47:56] <wikibugs>	 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Interesting thanks!  So brainstorming how that would work for Debezium, since Debezium...
[16:58:14] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10elukey) @srodlund +1 for the image credit thanks!  I've quickly read the blogpost and the only thing that I f...
[17:00:14] * elukey afk!
[17:12:33] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10srodlund) @elukey Great! I fixed this and have announced the post on Twitter and Slack! Have a great weekend!
[17:12:39] <wikibugs>	 10Analytics: Reset Kerberos password for nahidunlimited - https://phabricator.wikimedia.org/T282077 (10razzi) 05Open→03Resolved Great!
[17:34:36] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'nloptr' R package on stat100[467], but can on stat100[58] - https://phabricator.wikimedia.org/T282260 (10mpopov)
[17:34:56] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10mpopov)
[17:39:04] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'nloptr' R package on stat100[467], but can on stat100[58] - https://phabricator.wikimedia.org/T282260 (10mpopov) Similar `/bin/gtar` issue when installing [[ https://mc-stan.org/cmdstanr/ | cmdstanr ]] package in R:  ` $ install.packages("cmdstanr") .....
[18:20:27] <Krinkle>	 Question about kafka and kafkacat: Do I need to specify all brokers, or is specifying one enough so long as that one is not down? E.g. will it discover the other shards or will I get only a subset of the data if I don't specify them all?
[18:22:21] <ottomata>	 one is enough!
[18:22:26] <ottomata>	 its just used for bootstrapping
[18:22:36] <ottomata>	 if that broker went down you'd need to provide a different one
[18:22:41] <addshore>	 good to know!
[18:22:45] <ottomata>	 if you have the full list and a broker went down it would just try another
[18:22:53] <ottomata>	 so for prod services, you should list them all
[18:22:56] <ottomata>	 for just CLI usage, one is fine
[18:36:27] <ebernhardson>	 is anyone doing full integration testing of browser based eventlogging, essentially clicking around in the browser and then comparing against a set of expected events, that i could look over?
[18:37:06] <ebernhardson>	 i wrote something to do that in 2016, but before i go back to making that work again thought i should look around
[18:56:55] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'nloptr' R package on stat100[467], but can on stat100[58] - https://phabricator.wikimedia.org/T282260 (10nettrom_WMF) I ssh'ed to stat1006 and created a new stacked conda environment and activated it. Running R (`/usr/lib/anaconda-wmf/bin/R` specifical...
[19:03:11] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'haven' package with conda R but can with system R - https://phabricator.wikimedia.org/T282262 (10nettrom_WMF) Similarly as in T282260, I tested this out with R 3.6.1 too. The installation fails with the same error message that's reported.
[19:06:35] <bearloga>	 ottomata: are there any special setup scripts that run on stat* hosts the first time a user ssh's in? (besides creating their homedir) Wondering if I can just rm -rf everything in my homedir or if I there's special bashrcs or something and I should request an sre to completely nuke my homedir to reinitialize.
[19:13:14] <ottomata>	 ebernhardson:  not from our team, but maybe from product data infrastructure a bit?  
[19:13:18] <bearloga>	 ebernhardson: you may want to check in with jason linehan and mholloway who are working with metrics platform (set of client libraries built on top of event platform) to replace eventlogging -- cf. T276378
[19:13:19] <stashbot>	 T276378: Release Metrics Platform v1 - https://phabricator.wikimedia.org/T276378
[19:13:38] <ottomata>	 bearloga:  your ssh keys :)
[19:13:45] <ottomata>	 but puppet will put everything else back in there
[19:13:50] <ottomata>	 actually...maybe puppet doesn't put those there
[19:14:22] <ottomata>	 bearloga: should be relatively safe to do, if do you break something that sre puts there puppet should bring it back within about 30 mins
[19:16:02] <ebernhardson>	 thanks! i'll take a look around
[19:16:09] <bearloga>	 ottomata: thanks! because I'm running into an issue where installing something doesn't work for me but works for Nettrom and I keep trying to figure out differences but my brain is coming back with null so I wanna try with a clean slate
[19:25:22] <bearloga>	 ebernhardson: minor correction: metrics platform = set of client libraries + a bunch of infrastructure to enable faster, easier instrumentation and reusable instruments & composable event data
[19:25:29] <bearloga>	 :P
[19:36:30] <ebernhardson>	 so nothing about testing correctness :P
[19:52:20] <wikibugs>	 10Analytics, 10Analytics-Kanban: [Newpyter] Can't install 'nloptr' R package on stat100[467], but can on stat100[58] - https://phabricator.wikimedia.org/T282260 (10mpopov) 05Open→03Resolved a:03mpopov I deleted EVERYTHING from my homedir on stat1006, including all the hidden files & directories. After st...
[19:52:23] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10mpopov)
[20:20:26] <Krinkle>	 ottomata: thx, that helps a lot. I've been inconsistent about doing it, and always felt guilty/unsure about not doing it
[20:23:58] <addshore>	 Is there something I should know about grepping the output of kafkacat twice? I seem to not get matches if I pipe it through 2 greps? https://phabricator.wikimedia.org/T280627#7070564
[20:39:18] <ebernhardson>	 addshore: i'd guess something about flushing, try --line-buffered on the first grep
[21:15:15] <addshore>	 ebernhardson: yup, that solves it!