[00:28:07] <wikibugs>	 (03PS1) 10Srishakatux: Improve README [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/538722
[05:40:56] <elukey>	 hello people
[05:41:17] <elukey>	 so I noticed that the eventlogging consumer was not pushing any data to the m4 master
[05:41:20] <elukey>	 https://grafana.wikimedia.org/d/000000505/eventlogging?panelId=12&fullscreen&orgId=1&from=now-7d&to=now-5m
[05:41:28] <elukey>	 eventbus data is of course stopped
[05:41:44] <elukey>	 but apparently on the 21st it stopped also the other one?
[05:42:56] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey)
[05:43:57] <elukey>	 https://phabricator.wikimedia.org/T232349 - seems not ready yet
[05:43:58] <elukey>	 mmmmm
[05:47:14] <wikibugs>	 10Analytics, 10LDAP-Access-Requests: log-in credential confusion for Hive - https://phabricator.wikimedia.org/T233648 (10dchen) worked. tx @Ottomata
[05:57:00] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Disable production EventLogging analytics MySQL consumers - https://phabricator.wikimedia.org/T232349 (10elukey) Something strange happened today: I noticed in icinga that the eventlogging mysql insertion rate alarm was in UNKNOWN state, so I checked the graphs and the m4...
[06:07:33] <elukey>	 !log update Druid Kafka supervisor for netflow to index new dimensions
[06:07:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:07:46] <elukey>	 works!!
[06:08:27] <elukey>	 so if you re-send a supervisor json spec and Druid sees that there is a version already running, it will take care of swapping the two automatically
[06:10:02] <elukey>	 joal: bonjour :) 
[06:10:29] <elukey>	 Do you think that adding a "druid" directory in refinery to store supervisors (atm netflow and banner impressions) could be good?
[06:10:42] <elukey>	 because the oozie dir doesn't fit well for netflow
[06:44:34] <wikibugs>	 (03PS4) 10Elukey: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736)
[06:47:57] <elukey>	  E0738: The following 1 parameters are required but were not defined and no default values are available: hive2_jdbc_url
[06:48:01] <elukey>	 buuuuu
[06:48:14] <elukey>	 this is one of the leftovers from Luca
[06:48:21] <elukey>	 that guy is really terrible
[06:50:08] <wikibugs>	 (03CR) 10Elukey: "Dan/Marcel: the code is ready for a first review pass, let me know your thoughts :)" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey)
[06:54:22] <wikibugs>	 (03PS1) 10Elukey: oozie: add jdbc_url to mediawiki history wikitext's coordinator [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257)
[06:58:17] <wikibugs>	 10Analytics, 10Patch-For-Review: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater/+/537268/ is ready for a first review, testing with real data has not been done yet.  Jenkins already supports python3, it...
[06:58:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) a:03elukey
[07:16:04] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Legacy (Watching / External), and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10awight)
[07:16:49] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Legacy (Watching / External), and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10awight) I added a negative, that multiple extensions cannot use the same schema even when at the sam...
[07:29:59] <joal>	 Hi elukey - I'm sorry I won't have time before this afternoon I think - Naé is still home
[07:30:27] <wikibugs>	 10Analytics: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10MoritzMuehlenhoff)
[07:30:44] <joal>	 elukey: Will ping ou when I have time - sorry for that :(
[07:31:58] <elukey>	 joal: no no don't worry, I left messages for whenever you have time, nothing urgent :)
[08:39:49] <elukey>	 also, great news - we have openjdk-8 for bustet!
[08:39:54] <elukey>	 *buster \o/
[08:56:00] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Product-Analytics: Improve Hue user management - https://phabricator.wikimedia.org/T127850 (10Neil_P._Quinn_WMF)
[09:11:07] <fdans>	 hi team, starting a lil late today
[09:11:20] <elukey>	 o/
[09:11:50] <fdans>	 joal: my bad, in yesterday's patch to aqs I git-added accidentally the change of the field to int, when it should stay as a long
[09:11:57] <fdans>	 I'l post a new patch now
[09:12:38] <elukey>	 fdans: Joseph will start working later in the evening I think
[09:12:44] <elukey>	 (if you need him now)
[09:13:23] <fdans>	 thank youuu elukey 
[09:17:21] <wikibugs>	 (03PS1) 10Fdans: Set mediarequests per referer requests field as long [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538846 (https://phabricator.wikimedia.org/T233622)
[09:31:07] <wikibugs>	 (03CR) 10Fdans: Cast mediarequests value as int before submitting the response (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538611 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans)
[09:46:26] <elukey>	 need to go to the doctor, will take an early lunch pause :)
[11:23:54] <wikibugs>	 (03CR) 10Joal: [C: 03+2] Cast mediarequests value as int before submitting the response (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538611 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans)
[11:27:00] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Thanks for catching this nuria and fdans. Sorry to have missed it :(" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538846 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans)
[11:27:39] <wikibugs>	 (03Merged) 10jenkins-bot: Set mediarequests per referer requests field as long [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538846 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans)
[11:53:02] <fdans>	 thank you for merging joal :)
[11:53:22] <joal>	 np fdans - Sorry for not having seen the typing error :(
[11:53:58] <fdans>	 joal: nono, the change didn't provide enough context for you to see it as a problem :) it was totally my bad
[11:54:24] <joal>	 fdans: I should actually have caught it in the previous review ;) anyway - Fixed for good :)
[11:54:30] <joal>	 Thanks for that
[12:12:53] <fdans>	 joal: now here is a question, do we really need the top *1000* of all files?
[12:12:59] <fdans>	 seems like a whole lot of files
[12:13:20] <joal>	 fdans: this number has been decided long ago, for pageviews
[12:13:53] <joal>	 This seemed like a good compromise between people willing to read a few top values, and others interested in analysing more data
[12:14:04] <joal>	 And storage-size, obviously :)
[12:14:48] <fdans>	 joal: hmmm I don't see it super practical to have the top 1000 files because the aqs response would be huge
[12:15:03] <fdans>	 file names are generally way longer than article names
[12:15:04] <joal>	 fdans: This is what we already do for pageviews though :)
[12:15:11] <joal>	 hm
[12:15:55] <fdans>	 joal: i mean look at this:
[12:16:12] <fdans>	 lol irccloud doesn't let me to paste it
[12:16:26] <fdans>	 https://pastebin.com/55vz7fYS
[12:17:22] <fdans>	 joal:  that's about 100kb for a single result set
[12:17:31] <joal>	 fdans: Interestingly this is because we don't extract the file name out of the URL as we do for pageview :)
[12:18:03] <fdans>	 joal: because the url of the pageview can be inferred from the project name
[12:18:07] <fdans>	 this doesn't apply here
[12:18:31] <joal>	 hm
[12:19:02] <fdans>	 joal:  maybe we can keep in cassandra the top 1000 but give top 100 in aqs?
[12:19:18] <fdans>	 btw
[12:19:49] <fdans>	 the load query takes a ridiculous amount of time to load, I don't know if this happens with pageviews
[12:19:50] <wikibugs>	 10Analytics, 10LDAP-Access-Requests: log-in credential confusion for Hive - https://phabricator.wikimedia.org/T233648 (10Neil_P._Quinn_WMF) 05Open→03Resolved a:03Neil_P._Quinn_WMF Thanks, @Ottomata! I improved the documentation so that information is now on [wikitech:Analytics/Cluster/Hue](https://wikite...
[12:21:07] <joal>	 fdans: this is very much expected yes
[12:22:08] <joal>	 fdans: I checked one day of top for en.wikipedia pageviews, result is 56kb - Doesn't seem an issue for me to be in the same size factor for medias - We can discuss this at standup if you want
[12:22:41] <joal>	 fdans: loading a few hundred million rows in a db takes some time, even if the thing is fast as hell :)
[12:23:01] <fdans>	 joal: I'm ok with it if you're ok :)
[12:23:17] <fdans>	 won't be a performance problem in wikistats
[12:25:01] <fdans>	 I'm going to test the whole job and see how we do
[12:25:20] <joal>	 k
[12:27:39] <fdans>	 joal: oh wait the query took a million years because I tested the monthly one lol, nvm
[12:27:51] <joal>	 ?
[12:28:00] <joal>	 Ah - Monthly top?
[12:28:03] <fdans>	 yes
[12:28:11] <joal>	 indeed, this is longer :)
[12:30:06] <joal>	 Oh so I didn't properly get your previous point - top values is long at computation-stage - And per-file values is long at load stage
[12:31:14] <fdans>	 yesyes, for sure :)
[12:31:57] <joal>	 fdans: IIRC we were counting around 3 or 4h per loaded-day when we backfilled pageviews
[12:33:25] <elukey>	 joal: openjdk-8 deployed on stat1005 (buster) :)
[12:33:33] <joal>	 \o/ !!!
[12:33:49] <joal>	 elukey: I can also play with that for my tests with janusgraph
[12:34:18] <elukey>	 that would be great thanks :)
[12:35:00] <joal>	 elukey: I'll need to bump a labs-instance to buster - not sure how to do that :(
[12:35:30] <joal>	 fdans: I guess you guess the backfilling period: https://grafana.wikimedia.org/d/000000417/cassandra-system?panelId=14&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=aqs&var-server=All&var-disk=sda&var-disk=sdb&var-disk=sdc&var-disk=sdd&var-disk=sde&from=now-30d&to=now
[12:35:48] <fdans>	 joal: here's a question I have for you: do you think that the total number of files in upload.wikimedia.org, which is a metric we don't have, would more or less coincide with the Pages to Date metric for commons.wikimedia?
[12:36:25] <joal>	 this feels as good approximation to me fdans 
[12:36:39] <joal>	 But there might be tricky aspects I don't know about
[12:36:50] <joal>	 like files created on their own wikis for instance?
[12:38:03] <fdans>	 yea you're right, I wasn't thinking about that
[12:38:16] <joal>	 But I don't know much about those
[12:38:23] <fdans>	 joal: probably a better approximation would be to count the number of pages with the file namespace
[12:38:41] <joal>	 Yes !
[12:50:11] <elukey>	 tried a basic spark2 sql query against webrequest from stat1005, seems working fine
[12:50:30] <elukey>	 in theory if this works as expected we'll be able to release stat1005 to everybody
[12:50:36] <elukey>	 together with the GPU
[12:50:39] <elukey>	 \o/
[12:50:47] <joal>	 Hooray :)
[13:08:55] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Disable production EventLogging analytics MySQL consumers - https://phabricator.wikimedia.org/T232349 (10Ottomata) Very strange!  Joseph and I did end up bouncing eventloggingctl stuff last Wednesday for deployment of ua-parser, but that doesn't seem related.
[13:09:53] <wikibugs>	 10Analytics, 10Services (watching): Create mediarequests top files AQS endpoint - https://phabricator.wikimedia.org/T233716 (10fdans)
[13:10:43] <wikibugs>	 10Analytics, 10Services (watching): Add cassandra loading job for top mediarequests - https://phabricator.wikimedia.org/T233717 (10fdans)
[13:10:45] <ottomata>	 mgerlach: hello!
[13:10:52] <ottomata>	 just checking in, how are you, how goes the system exploration?
[13:11:53] <joal>	 Oh by the way elukey - How de we handle the mediawiki-history-text stuff?
[13:12:02] <wikibugs>	 (03PS1) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717)
[13:12:23] <mgerlach>	 ottomata: hey# was gonna write you with a bunch of questions on the jupyter notebooks and swap and queries with spark. 
[13:12:35] <joal>	 elukey: I suggest manually fixing this month (spark job ran successfully, which is the main thing) - and merge the patch
[13:12:53] <elukey>	 joal: I am doing some tests on stat1005 so now it is borked, lemme know when you need it :) for mediawiki, I'd say that we can manually upload the coordinator's properties?
[13:13:05] <elukey>	 ah okok then please do
[13:13:25] <joal>	 elukey: manual fix should be better: no need to rerun the huge spark job
[13:13:35] <joal>	 And then deploy of the patch (clean way :)
[13:13:37] <ottomata>	 mgerlach:  cool ok!
[13:13:44] <mgerlach>	 ottomata: 2 main points: 1] custom kernels  on swap. 2] some help with memory issues in my spark-sql queries
[13:13:44] <joal>	 mgerlach: o/
[13:14:12] <joal>	 mgerlach: sorry for not being responsive :)
[13:14:32] <mgerlach>	 joal: no worries, I saw you were busy; )
[13:15:48] <mgerlach>	 ottomata: if you have a few minutes, we could alsoi discuss briefly on hangouts
[13:16:10] <elukey>	 joal: ack!
[13:16:25] <elukey>	 joal: shall I merge the patch?
[13:16:34] <wikibugs>	 (03CR) 10Joal: [C: 04-1] "Let's also add the parameter in the coordinator.xml and workflow.xml parameters list, so that the thing fails fast if not set please." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey)
[13:16:40] <joal>	 nope :)
[13:16:58] <joal>	 elukey: let me know if my comment makes sense :)
[13:17:01] <elukey>	 ahhh the parameters, adding them
[13:17:18] <elukey>	 in theory it is required by a subworkflow, but I see the point
[13:17:51] <joal>	 elukey: if we only put hte parameter as needed in workflow, running coordinator will succeed - and we don't want that
[13:18:35] <elukey>	 joal: how can the coordinator succeed if one required subworkflow requires a parameter?
[13:18:38] <elukey>	 elukey@stat1005:~$ java -version
[13:18:39] <elukey>	 ottomata: --^
[13:18:40] <elukey>	 :)
[13:18:43] <elukey>	 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
[13:18:45] <elukey>	 openjdk version "1.8.0_222"
[13:18:53] <joal>	 !log Manually repairing wmf.mediawiki_wikitext_history
[13:18:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:19:16] <wikibugs>	 10Analytics, 10Analytics-Kanban: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata)
[13:20:10] <joal>	 elukey: the coordinator doesn't miss any parameter, so it works and tries to run workflow instances - the workflow doesn't miss any parameter, so it waits for data availability, then starts, and finally fail when the missing parameter is actually detected - Exactly what happened this month
[13:20:30] <joal>	 Putting the parameters as mandatory at all levels enforces fail-fast
[13:21:19] <joal>	 makes sense elukey ?
[13:21:22] <wikibugs>	 (03PS2) 10Elukey: oozie: add jdbc_url to mediawiki history wikitext's coordinator [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257)
[13:21:37] <elukey>	 joal: ah okok yes now it is clearer
[13:21:56] <ottomata>	 joal:  hangout witth martin or for somethign else?
[13:22:07] <joal>	 elukey: We would have found the missing parameter at launch time with th
[13:22:11] <joal>	 the patch you provide
[13:22:20] <joal>	 ottomata: sure
[13:22:23] * elukey nodes
[13:22:25] <ottomata>	 elukey:  nic!
[13:22:27] <elukey>	 *nods
[13:22:50] <joal>	 elukey: this is a graph movement :)
[13:22:54] * joal nodes as well
[13:22:56] <elukey>	 all credits to Saint Moritz protector of the Analytics team (saint due to its patience :D)
[13:23:13] <joal>	 Ah men! -^
[13:23:34] <joal>	 cave ottomata / mgerlach, or somewhere else?
[13:24:02] <ottomata>	 i can come in 10 mins, let me get some camus fix stuff going
[13:24:09] <joal>	 sure
[13:24:18] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Thanks elukey" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey)
[13:24:27] <ottomata>	 but i 'm not sure if mgerlach was asking for a hangout now, if so i'm in but if not that's ok too! 
[13:24:38] <joal>	 sure!
[13:24:47] <mgerlach>	 ottomata joal: I can do now or in 10 mins
[13:24:51] <joal>	 :)
[13:24:58] <joal>	 looks like in 10mins is the time
[13:25:04] <joal>	 ottomata: Please give the go
[13:25:17] <ottomata>	 k
[13:28:25] <joal>	 ottomata: about camus, I really think more topic-partitions is the solution
[13:28:56] <joal>	 ottomata: just sayin'
[13:28:57] <joal>	 :)
[13:29:40] <ottomata>	 yup
[13:29:43] <ottomata>	 execpt
[13:29:49] <ottomata>	 webrequest runs with as much data in one partition
[13:30:02] <ottomata>	 (veryfiing...)
[13:30:05] <joal>	 :)
[13:30:35] <wikibugs>	 10Analytics, 10EventBus, 10WMF-JobQueue, 10CPT Initiatives (Modern Event Platform (TEC2)), 10good first bug: EventBus extension must not send batches that are too large - https://phabricator.wikimedia.org/T232392 (10Johan) I suspect this is the reason why MassMessage isn't delivering Tech News to most su...
[13:30:44] <ottomata>	 hmm no i think i'm wrong
[13:30:48] <joal>	 ottomata: the 10 mins thing also helps i guess (more frequent runs, more in-memory_)
[13:30:53] <ottomata>	 it looks like webequest maxes around 2.5K  per partition
[13:31:02] <joal>	 and cirrus?
[13:31:11] <ottomata>	 api-request is 7.5K
[13:31:19] <ottomata>	 cirrus is 2.5K
[13:31:36] <joal>	 ottomata: we bumped webrequest-partitions a while back because of something similar IIRC
[13:31:45] <ottomata>	 aye
[13:31:51] <ottomata>	 its weird that camus does speculative execution
[13:31:57] <ottomata>	 we shoudl probably turn that off
[13:31:59] <ottomata>	 https://yarn.wikimedia.org/proxy/application_1564562750409_217973/mapreduce/attempts/job_1564562750409_217973/m/RUNNING
[13:32:03] <joal>	 ottomata: number of mappers won't actually change anything here, as a single mapper per part is assigned
[13:32:07] <ottomata>	 it is just reading the same data from kafka twice!
[13:32:12] <ottomata>	 yes
[13:32:17] <ottomata>	 but before yesterrday
[13:32:17] <icinga-wm>	 PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:32:27] <ottomata>	 the one mediawiki_events job was configured with 10 mappers only
[13:32:33] <ottomata>	 and with all the other topics it was doing too
[13:32:36] <joal>	 yeah yeah I have read that
[13:32:38] <ottomata>	 that was about 34 partitions
[13:32:40] <ottomata>	 aye
[13:32:47] <ottomata>	 but yea here increasing mappers won't help until we add partitions ya
[13:32:47] <joal>	 bumping for those is needed,
[13:33:01] <joal>	 for cirrus+api, we need more parts and more mappers
[13:33:06] <ottomata>	 ^^^ is me i'm running manually
[13:33:09] <ottomata>	 aye
[13:33:11] <joal>	 k
[13:33:20] <joal>	 ottomata: I don't get our point about the yarn job
[13:33:34] <ottomata>	 joal:  see how there are 4 mappers active?
[13:33:40] <ottomata>	 there are 2 for the exact same partition
[13:33:42] <ottomata>	 e.g.
[13:33:49] <ottomata>	 ad.mediawiki.api-request:1002:0
[13:33:56] <ottomata>	 eqiad.mediawiki.api-request:1002:0
[13:33:56] <joal>	 Ok I get it
[13:33:58] <ottomata>	 because
[13:34:01] <ottomata>	 speculative execution!
[13:34:12] <ottomata>	 so those are both consumign the same partition data from kafka
[13:34:15] <ottomata>	 writing to a temp dir
[13:34:19] <ottomata>	 and which ever finishes first gets kept
[13:34:42] <joal>	 ottomata: I was about to tell that, but prefered to check first :)
[13:34:51] <joal>	 We should configure camus not speculate ;)
[13:34:54] <ottomata>	 hm actually, i'm going to kill this job and try
[13:34:58] <joal>	 k
[13:35:31] <joal>	 ottomata: In job conf: mapreduce.map.speculative = true
[13:35:37] <joal>	 We should change that
[13:37:29] <ottomata>	 ok is better
[13:37:30] <ottomata>	 https://yarn.wikimedia.org/proxy/application_1564562750409_218010/mapreduce/attempts/job_1564562750409_218010/m/RUNNING
[13:37:36] <ottomata>	 will do that for all camus jobs
[13:38:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events ottomata manually running a camus job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:38:23] <ottomata>	 ok mgerlach joal  
[13:38:31] <ottomata>	 https://meet.google.com/rxb-bjxn-nip
[14:00:43] <elukey>	 heads up: I am merging ferm rules for matomo and analytics meta
[14:00:46] <elukey>	 to allow backups
[14:00:53] <elukey>	 let me know if you see any weird error :)
[14:05:55] <fdans>	 joal: does this error ring a bell at all? It's why my loading job is failing and it isn't very descriptive. Seems like a classpah situation:
[14:05:58] <fdans>	 https://www.irccloud.com/pastebin/YTqtd8lM/
[14:12:18] <mforns>	 hey all :]
[14:13:27] <wikibugs>	 (03PS2) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717)
[14:14:55] <elukey>	 a-team: there will be a test of db backup for analytics meta and matomo, shouldn't create impact but let me know if you see anything weird
[14:15:30] <ottomata>	 elukey: yeehaw
[14:15:33] <mforns>	 k
[14:16:28] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Research-Backlog, 10Patch-For-Review: Release edit data lake data as a public  json dump /mysql dump, other? - https://phabricator.wikimedia.org/T208612 (10mforns) Cool! Thanks @Bstorm
[14:29:24] <mforns>	 thanks team for taking care of the alarms on my ops week
[14:39:05] <joal>	 fdans: nope, I don't know that
[14:39:16] <joal>	 fdans: feels like a misconfiguration
[14:39:43] <fdans>	 joal: I suspect it's something about using the latest cassandra jar, I'm trying with the same one as top pageviews
[14:40:02] <joal>	 feels bizare as nothing changed normally
[14:40:57] <fdans>	 joal: who knows, top pageviews is using 0.0.35
[14:45:16] <elukey>	 ok the backup didn't work
[14:45:21] <joal>	 ottomata, mgerlach - I'm back - anything I can help with?
[14:45:34] <elukey>	 since a consistent backup needs a ton of locking and it can't be done in our use cases
[14:46:12] <ottomata>	 !log temporarily disabled  camus-mediawiki_analytics_events systemd timer on an-coord1001 - T233718
[14:46:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:46:16] <stashbot>	 T233718: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718
[14:46:27] <ottomata>	 elukey:  joal , any objections to me adding partitions to those high volume topics soon?
[14:46:31] <ottomata>	 i'll do it on both main and jumbo
[14:46:38] <ottomata>	 to keep the partitions the same on both clusters
[14:46:40] <joal>	 works for me ottomata 
[14:46:51] <ottomata>	 Pchelolo: ?
[14:46:54] <joal>	 ottomata: this should allow faster recovery
[14:46:59] <ottomata>	 if need be i could just do it on jumbo!
[14:47:08] <ottomata>	 acutally i tihnk Pchelolo  is out for a bit, will ask on task
[14:47:35] <elukey>	 non from me
[14:47:38] <elukey>	 *none
[14:47:57] <ottomata>	 OH
[14:47:59] <ottomata>	 these aren't on kafka main
[14:48:00] <ottomata>	 duh
[14:48:09] <ottomata>	 just on jumbo, they are produced via eventgate-analytics
[14:48:11] <ottomata>	 nm Pchelolo  !
[14:48:33] <mgerlach>	 joal: still trying to figure out the memory issue. perhaps I try the things you suggested and see if the problem persists. if yes, I will ping you
[14:48:55] <mgerlach>	 joal: or you already have a hunch what could be the problem
[14:50:33] <joal>	 mgerlach: I think it's related to parquet reading big files- But I'd rather triple check that
[14:50:43] <joal>	 mgerlach: I'll try your code an see if I cam replicate
[14:52:43] <wikibugs>	 10Analytics, 10Analytics-Kanban: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) Increased partitions to 12 for api-request and cirrussearch-request topics:  ` kafka topics --alter --topic codfw.mediawiki.cirrussearch-request --partitions...
[14:52:48] <ottomata>	 !Increased partitions to 12 for api-request and cirrussearch-request topics - T233718
[14:52:48] <stashbot>	 T233718: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718
[14:55:16] <wikibugs>	 (03PS3) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717)
[15:14:18] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) Confirmed working as expected. I manually checked the "null" country's IPs and they match "Anonymous proxies".
[15:15:26] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "I went over the code, and all looks good!" (0313 comments) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey)
[15:16:50] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10mforns) Great!
[15:18:50] <wikibugs>	 (03CR) 10Elukey: Move codebase to python3 (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey)
[15:20:48] <wikibugs>	 (03PS5) 10Elukey: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736)
[15:20:59] <wikibugs>	 10Analytics, 10Discovery-Search (Current work): Recreate  mjolnir.msearch-prod-request topic in kafka-jumbo - https://phabricator.wikimedia.org/T233731 (10EBernhardson)
[15:21:37] <ottomata>	 ebernhardson:  you want it just dropped and recreated?
[15:21:40] <elukey>	 mforns: thanks for the review! No rush on merging etc.., just wanted to kick off the migration review/process/test to catch any surprise early :)
[15:21:47] <ebernhardson>	 ottomata: or changed? either way. The historical info there isn't important
[15:22:22] <ottomata>	 i don't think we can decrease so easily, so it'd have to be dropped i guess
[15:22:32] <ottomata>	 you don't want to just increase # of consumers :)
[15:22:32] <ottomata>	 ?
[15:22:45] <ebernhardson>	 ottomata: we run one consumer per elasticsearch host
[15:22:51] <ebernhardson>	 ottomata: so, 30 hosts in codfw, 30 consumers
[15:22:55] <ottomata>	 k
[15:25:44] <elukey>	 stepping afk for ~10 mins
[15:25:46] <ottomata>	 ebernhardson:  you want me to do that now?
[15:25:51] <ebernhardson>	 ottomata: sure
[15:25:55] <ottomata>	 oook!
[15:25:58] <ebernhardson>	 ottomata: thanks
[15:27:01] <mforns>	 elukey, no no, thanks for pushing this front
[15:27:20] <ottomata>	 ebernhardson:  you want 25?
[15:27:36] <ottomata>	 you just wan tit slightly less than total # of consumers?
[15:28:14] <ebernhardson>	 ottomata: i'm randomly guessing with 25 :) it seems like we probably wont lose 5 servers from the cluster for any period of time
[15:29:06] <ebernhardson>	 ottomata: so, basically yes. I might have to re-tune the thread pools on consumer side as well, will see. 
[15:29:11] <wikibugs>	 10Analytics, 10Discovery-Search (Current work): Recreate  mjolnir.msearch-prod-request topic in kafka-jumbo - https://phabricator.wikimedia.org/T233731 (10Ottomata) Done.   ` kafka topics --delete --topic  mjolnir.msearch-prod-request kafka topics --create --topic  mjolnir.msearch-prod-request --partitions 25...
[15:29:27] <wikibugs>	 10Analytics, 10Discovery-Search (Current work): Recreate  mjolnir.msearch-prod-request topic in kafka-jumbo - https://phabricator.wikimedia.org/T233731 (10Ottomata) 05Open→03Resolved a:03Ottomata
[15:35:45] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: Figure out how to $ref common schema across schema repositories - https://phabricator.wikimedia.org/T233432 (10Ottomata)
[15:35:48] <wikibugs>	 (03PS4) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717)
[15:39:30] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: Figure out how to $ref common schema across schema repositories - https://phabricator.wikimedia.org/T233432 (10Ottomata) Added another idea to the task description: **npm dependency**.  This would be effectively the same a...
[15:51:05] <wikibugs>	 (03CR) 10Mforns: Move codebase to python3 (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey)
[15:56:04] <wikibugs>	 (03PS5) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717)
[15:58:27] <wikibugs>	 10Analytics, 10Research: Parse wikidumps and extract redirect information for 1 small wiki, romanian - https://phabricator.wikimedia.org/T232123 (10MGerlach) @JAllemandou  Main problem: memory error for large (and even not super large) wikis such as frwiki. I implemented some of your suggestions from the discu...
[16:00:34] <mgerlach>	 joal: memory error persists after implementing your suggestions. see the  updated notebook for a minimal example. maybe you have any thoughts on the text processing
[16:00:56] <joal>	 mgerlach: will do
[16:01:18] <wikibugs>	 (03CR) 10Elukey: Move codebase to python3 (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey)
[16:02:06] <wikibugs>	 (03PS6) 10Elukey: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736)
[16:17:37] <wikibugs>	 10Analytics, 10Analytics-Kanban: Alarming scripts for entropy alarms. Anomaly detection and reporting. - https://phabricator.wikimedia.org/T227357 (10Ottomata)
[16:26:37] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop - https://phabricator.wikimedia.org/T223414 (10Ottomata) a:05fdans→03mforns
[16:29:05] <wikibugs>	 (03PS1) 10Elukey: Add the druid directory and store Druid Kafka supervisor json in it [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682)
[16:30:45] <wikibugs>	 (03PS2) 10Elukey: Add the druid directory and store Druid Kafka supervisor json in it [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682)
[16:38:24] <elukey>	 created druid/kafka/etc..
[16:38:29] <elukey>	 --^
[16:39:27] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Works for me! Thanks Luca :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey)
[16:39:35] <elukey>	 \o/
[16:40:04] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add the druid directory and store Druid Kafka supervisor json in it [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey)
[17:15:55] * elukey off!
[17:15:56] <elukey>	 o/
[17:16:18] <wikibugs>	 (03PS6) 10Fdans: Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717)
[17:17:12] <wikibugs>	 (03CR) 10Fdans: [V: 03+1] "Test data loaded correctly in cassandra. This change is ready for revieweing." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans)
[17:30:00] <mforns>	 ottomata, I'm testing reportupdater in stat1007, and one dependency needs upgrade (PyYAML), how should I do that?
[17:30:32] <ottomata>	 hmm
[17:30:49] <ottomata>	 i think we jusst use the default deb package
[17:30:52] <ottomata>	 why does sit need upgrade?
[17:31:05] <ottomata>	 it is the same on stat1006, ya?
[17:43:35] <mforns>	 ottomata, it needs upgrade because of the python3 migration
[17:43:51] <ottomata>	 ah you just need the python3 version?
[17:43:52] <mforns>	 well, I'm not sure there's no way of doing it without changing...
[17:44:08] <ottomata>	 python3-yaml is installed
[17:44:11] <ottomata>	 version 3.12-1
[17:44:20] <mforns>	 yea, the new code needs 5.1
[17:44:25] <ottomata>	 say whaaa!
[17:44:26] <ottomata>	 why?
[17:44:33] <mforns>	 but maybe there's a way to do it without the upgrade
[17:44:51] <ottomata>	 why do packages need upgraded for python3 upgrade?
[17:44:59] <ottomata>	 if we have the python3 versions of those packages already?
[17:45:12] <mforns>	 because in python3 yaml.load() without a loader is deprecated, and yaml.FullLoader is only present in 5.1
[17:45:55] <ottomata>	 harhum
[17:45:56] <ottomata>	 weird.
[17:46:12] <ottomata>	 we don't have our own pyyaml deb
[17:46:13] <ottomata>	 sigh
[17:47:27] <mforns>	 ottomata, maybe we can be "deprecatious" and not use yaml.FullLoader
[17:47:55] <ottomata>	 if there is some way to make 3.12 work it'd be nice. but if not i can try and create a .deb
[17:57:28] <mforns>	 ottomata, ok, trying
[18:00:27] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests, 10ops-eqiad, 10User-Elukey: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10ayounsi)
[18:04:40] <ottomata>	 mforns:  heh, this does not look like an easy package to build...
[18:04:51] <mforns>	 ottomata, wait I'm trying
[18:09:59] <mforns>	 ottomata, if I go back to PyYAML 3.10 and remove the yaml.FullLoader (go back to deprecated version) it works. The deprecation warning seems important, but maybe does not apply to our context: https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
[18:10:42] <mforns>	 because we're parsing version-controlled code, no?
[18:11:04] <ottomata>	 mforns:  can we just use safe_load
[18:11:05] <ottomata>	 ?
[18:11:10] <mforns>	 no, same error
[18:11:25] <mforns>	 PyYAML 3.10 does not implement safe_load
[18:14:30] <ottomata>	 hold on...i may be having success....
[18:14:32] <ottomata>	 building
[18:15:14] <ottomata>	 ack maybe not...
[18:16:40] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:27:38] <ottomata>	 mforns:  i think i got it....
[18:27:44] <mforns>	 oh
[18:28:41] <ottomata>	 going to install on stat1007 for ya and you try?
[18:29:25] <mforns>	 ok!
[18:29:26] <ottomata>	 mforns:  wait a minute
[18:29:29] <ottomata>	 yaml.safe_load
[18:29:34] <ottomata>	 exists on pyyaml 3.12
[18:29:42] <ottomata>	 >>> import yaml
[18:29:42] <ottomata>	 >>> yaml.safe_load
[18:29:42] <ottomata>	 <function safe_load at 0x7f1b228580d0>
[18:29:59] <mforns>	 in 3.10?
[18:30:14] <mforns>	 I'm running tests right now with safe_load
[18:30:16] <ottomata>	 stat1007
[18:30:18] <ottomata>	 dpkg -l python3-yaml
[18:30:20] <ottomata>	 3.12-1
[18:30:25] <mforns>	 oh ok
[18:30:36] <ottomata>	 >>> yaml.safe_load('\N{PILE OF POO}')
[18:30:36] <ottomata>	 '💩'
[18:30:48] <mforns>	 I guess I just tried yaml.full_load()
[18:30:53] <mforns>	 and it threw an error
[18:31:00] * joal is always pleased with ottomata's examples :)
[18:31:08] <mforns>	 ok, using safe_load
[18:31:09] <ottomata>	 haha joal  i got that from the debian source package!!!!
[18:31:10] <mforns>	 testing
[18:31:13] <mforns>	 hehe
[18:33:05] <ottomata>	 k brb
[18:39:43] <wikibugs>	 10Analytics, 10EventBus, 10WMF-JobQueue, 10Core Platform Team (Needs Cleaning - Security, stability, performance, and scalability (TEC1)), 10Wikimedia-production-error: EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached" - https://phabricator.wikimedia.org/T204183 (10Pchelo...
[19:04:05] <ottomata>	 (back btw)
[19:30:17] <ottomata>	 ebernhardson:  i think cirrussearch-reqeuest is catching up...but it will take a while...
[19:30:19] <ottomata>	 i'm watchhing it closely
[19:30:32] <ottomata>	 api-requests on the other hand...i don't think it will make it!
[19:30:35] <ottomata>	 good thing no one uses it 
[19:30:40] <ebernhardson>	 ottomata: heh
[19:30:51] <ebernhardson>	 ottomata: does that also suggest cirrussearch-request is nearing some limit?
[19:31:08] <ottomata>	 well, before adding partitionss, yes
[19:31:12] <ottomata>	 i upped it to 12 paratitions today
[19:31:19] <ottomata>	 so the camus job can parallelize better
[19:31:31] <ebernhardson>	 so, basically the newly added partitions are great, but the old partitions still have to catch up?
[19:31:46] <ottomata>	 yeah, partition 0 still has days worth of data in it
[19:32:06] <ottomata>	 a problem is that camus doesn't let me specify when to start from, and I had to make a new job to change the settings for this stuff
[19:32:15] <ottomata>	 so, its either: skip it all and start from latest
[19:32:21] <ottomata>	 or reconsume from earliest and overwrite
[19:32:26] <ottomata>	 so, that's what's happeningg now
[19:32:30] <icinga-wm>	 PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:32:31] <ottomata>	 its consuming the whole last weeks worth of data
[19:33:06] <ebernhardson>	 ottomata: silly idea...but you could pause the consumer, and run a script that uses their group_id to read the partition and emit all the messages back into new partitions.
[19:33:08] <ottomata>	 its up to mid day sept 19 now, and seems to consume at about 2x run time, so if i let it run for 15 mins, it'll consume ~30 mins worth of data from partition 0
[19:33:15] <ottomata>	 the problem is...
[19:33:16] <ebernhardson>	 but if you don't need the api one, prob not worht the effort
[19:33:20] <ottomata>	 camus stores the offsets in hdfs
[19:33:21] <ottomata>	 not in kafka
[19:33:26] <ebernhardson>	 oh. heh
[19:33:32] <ottomata>	 in binary files
[19:33:39] <ottomata>	 so yes i could probably write some code to update it all
[19:33:46] <ottomata>	 but it would be pretty hacky! :)
[19:34:29] <ebernhardson>	 i can live with the data being behind a few days
[19:34:39] <ottomata>	 kafka is using a 0.7 or 0.8 client; from before the time zookeeper was abstracted away from clients
[19:34:50] <ottomata>	 so they chose to keep the offsets in hdfs instead of zookeeper
[19:35:18] <ottomata>	 its so old, but we haven't figured out how to replace it yet...mainly due to confluent licensing issues on their hdfs/hive connector for kafka connect
[19:35:41] <ebernhardson>	 sounds fun :(
[20:16:10] <wikibugs>	 (03PS7) 10Mforns: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey)
[20:25:44] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+1] "After some testing with Andrew, we degraded PyYAML back to >=3.10," [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey)
[20:26:33] <mforns_>	 ok andrew, I tested that in stat1007 with real config and data, and all seems to run well. Pushed latest changes to patch. thanks for the help!
[20:26:58] <mforns_>	 I meant ottomata, sorry
[20:27:40] <ottomata>	 great!
[20:28:18] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] Improve README (031 comment) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/538722 (owner: 10Srishakatux)
[20:36:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) I've temporarily stopped importing api-request data to help cirrussearch-request catch up.  AFAIK, know one uses api-request data in Hive.
[20:40:46] <wikibugs>	 10Analytics, 10EventBus, 10Product-Analytics: Review draft Modern Event Platform schema guidelines - https://phabricator.wikimedia.org/T233329 (10Ottomata) > So maybe that section could say something like [...] Nice I like it!  Modified.  > I don't think it's necessary to relegate the part on referencing to...
[21:49:04] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers