[00:28:07] (03PS1) 10Srishakatux: Improve README [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/538722 [05:40:56] hello people [05:41:17] so I noticed that the eventlogging consumer was not pushing any data to the m4 master [05:41:20] https://grafana.wikimedia.org/d/000000505/eventlogging?panelId=12&fullscreen&orgId=1&from=now-7d&to=now-5m [05:41:28] eventbus data is of course stopped [05:41:44] but apparently on the 21st it stopped also the other one? [05:42:56] 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) [05:43:57] https://phabricator.wikimedia.org/T232349 - seems not ready yet [05:43:58] mmmmm [05:47:14] 10Analytics, 10LDAP-Access-Requests: log-in credential confusion for Hive - https://phabricator.wikimedia.org/T233648 (10dchen) worked. tx @Ottomata [05:57:00] 10Analytics, 10Analytics-EventLogging: Disable production EventLogging analytics MySQL consumers - https://phabricator.wikimedia.org/T232349 (10elukey) Something strange happened today: I noticed in icinga that the eventlogging mysql insertion rate alarm was in UNKNOWN state, so I checked the graphs and the m4... [06:07:33] !log update Druid Kafka supervisor for netflow to index new dimensions [06:07:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:07:46] works!! [06:08:27] so if you re-send a supervisor json spec and Druid sees that there is a version already running, it will take care of swapping the two automatically [06:10:02] joal: bonjour :) [06:10:29] Do you think that adding a "druid" directory in refinery to store supervisors (atm netflow and banner impressions) could be good? [06:10:42] because the oozie dir doesn't fit well for netflow [06:44:34] (03PS4) 10Elukey: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) [06:47:57] E0738: The following 1 parameters are required but were not defined and no default values are available: hive2_jdbc_url [06:48:01] buuuuu [06:48:14] this is one of the leftovers from Luca [06:48:21] that guy is really terrible [06:50:08] (03CR) 10Elukey: "Dan/Marcel: the code is ready for a first review pass, let me know your thoughts :)" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [06:54:22] (03PS1) 10Elukey: oozie: add jdbc_url to mediawiki history wikitext's coordinator [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257) [06:58:17] 10Analytics, 10Patch-For-Review: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater/+/537268/ is ready for a first review, testing with real data has not been done yet. Jenkins already supports python3, it... [06:58:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) a:03elukey [07:16:04] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Legacy (Watching / External), and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10awight) [07:16:49] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Legacy (Watching / External), and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10awight) I added a negative, that multiple extensions cannot use the same schema even when at the sam... [07:29:59] Hi elukey - I'm sorry I won't have time before this afternoon I think - NaΓ© is still home [07:30:27] 10Analytics: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10MoritzMuehlenhoff) [07:30:44] elukey: Will ping ou when I have time - sorry for that :( [07:31:58] joal: no no don't worry, I left messages for whenever you have time, nothing urgent :) [08:39:49] also, great news - we have openjdk-8 for bustet! [08:39:54] *buster \o/ [08:56:00] 10Analytics, 10Analytics-Cluster, 10Product-Analytics: Improve Hue user management - https://phabricator.wikimedia.org/T127850 (10Neil_P._Quinn_WMF) [09:11:07] hi team, starting a lil late today [09:11:20] o/ [09:11:50] joal: my bad, in yesterday's patch to aqs I git-added accidentally the change of the field to int, when it should stay as a long [09:11:57] I'l post a new patch now [09:12:38] fdans: Joseph will start working later in the evening I think [09:12:44] (if you need him now) [09:13:23] thank youuu elukey [09:17:21] (03PS1) 10Fdans: Set mediarequests per referer requests field as long [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538846 (https://phabricator.wikimedia.org/T233622) [09:31:07] (03CR) 10Fdans: Cast mediarequests value as int before submitting the response (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538611 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans) [09:46:26] need to go to the doctor, will take an early lunch pause :) [11:23:54] (03CR) 10Joal: [C: 03+2] Cast mediarequests value as int before submitting the response (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538611 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans) [11:27:00] (03CR) 10Joal: [C: 03+2] "Thanks for catching this nuria and fdans. Sorry to have missed it :(" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538846 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans) [11:27:39] (03Merged) 10jenkins-bot: Set mediarequests per referer requests field as long [analytics/aqs] - 10https://gerrit.wikimedia.org/r/538846 (https://phabricator.wikimedia.org/T233622) (owner: 10Fdans) [11:53:02] thank you for merging joal :) [11:53:22] np fdans - Sorry for not having seen the typing error :( [11:53:58] joal: nono, the change didn't provide enough context for you to see it as a problem :) it was totally my bad [11:54:24] fdans: I should actually have caught it in the previous review ;) anyway - Fixed for good :) [11:54:30] Thanks for that [12:12:53] joal: now here is a question, do we really need the top *1000* of all files? [12:12:59] seems like a whole lot of files [12:13:20] fdans: this number has been decided long ago, for pageviews [12:13:53] This seemed like a good compromise between people willing to read a few top values, and others interested in analysing more data [12:14:04] And storage-size, obviously :) [12:14:48] joal: hmmm I don't see it super practical to have the top 1000 files because the aqs response would be huge [12:15:03] file names are generally way longer than article names [12:15:04] fdans: This is what we already do for pageviews though :) [12:15:11] hm [12:15:55] joal: i mean look at this: [12:16:12] lol irccloud doesn't let me to paste it [12:16:26] https://pastebin.com/55vz7fYS [12:17:22] joal: that's about 100kb for a single result set [12:17:31] fdans: Interestingly this is because we don't extract the file name out of the URL as we do for pageview :) [12:18:03] joal: because the url of the pageview can be inferred from the project name [12:18:07] this doesn't apply here [12:18:31] hm [12:19:02] joal: maybe we can keep in cassandra the top 1000 but give top 100 in aqs? [12:19:18] btw [12:19:49] the load query takes a ridiculous amount of time to load, I don't know if this happens with pageviews [12:19:50] 10Analytics, 10LDAP-Access-Requests: log-in credential confusion for Hive - https://phabricator.wikimedia.org/T233648 (10Neil_P._Quinn_WMF) 05Openβ†’03Resolved a:03Neil_P._Quinn_WMF Thanks, @Ottomata! I improved the documentation so that information is now on [wikitech:Analytics/Cluster/Hue](https://wikite... [12:21:07] fdans: this is very much expected yes [12:22:08] fdans: I checked one day of top for en.wikipedia pageviews, result is 56kb - Doesn't seem an issue for me to be in the same size factor for medias - We can discuss this at standup if you want [12:22:41] fdans: loading a few hundred million rows in a db takes some time, even if the thing is fast as hell :) [12:23:01] joal: I'm ok with it if you're ok :) [12:23:17] won't be a performance problem in wikistats [12:25:01] I'm going to test the whole job and see how we do [12:25:20] k [12:27:39] joal: oh wait the query took a million years because I tested the monthly one lol, nvm [12:27:51] ? [12:28:00] Ah - Monthly top? [12:28:03] yes [12:28:11] indeed, this is longer :) [12:30:06] Oh so I didn't properly get your previous point - top values is long at computation-stage - And per-file values is long at load stage [12:31:14] yesyes, for sure :) [12:31:57] fdans: IIRC we were counting around 3 or 4h per loaded-day when we backfilled pageviews [12:33:25] joal: openjdk-8 deployed on stat1005 (buster) :) [12:33:33] \o/ !!! [12:33:49] elukey: I can also play with that for my tests with janusgraph [12:34:18] that would be great thanks :) [12:35:00] elukey: I'll need to bump a labs-instance to buster - not sure how to do that :( [12:35:30] fdans: I guess you guess the backfilling period: https://grafana.wikimedia.org/d/000000417/cassandra-system?panelId=14&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=aqs&var-server=All&var-disk=sda&var-disk=sdb&var-disk=sdc&var-disk=sdd&var-disk=sde&from=now-30d&to=now [12:35:48] joal: here's a question I have for you: do you think that the total number of files in upload.wikimedia.org, which is a metric we don't have, would more or less coincide with the Pages to Date metric for commons.wikimedia? [12:36:25] this feels as good approximation to me fdans [12:36:39] But there might be tricky aspects I don't know about [12:36:50] like files created on their own wikis for instance? [12:38:03] yea you're right, I wasn't thinking about that [12:38:16] But I don't know much about those [12:38:23] joal: probably a better approximation would be to count the number of pages with the file namespace [12:38:41] Yes ! [12:50:11] tried a basic spark2 sql query against webrequest from stat1005, seems working fine [12:50:30] in theory if this works as expected we'll be able to release stat1005 to everybody [12:50:36] together with the GPU [12:50:39] \o/ [12:50:47] Hooray :) [13:08:55] 10Analytics, 10Analytics-EventLogging: Disable production EventLogging analytics MySQL consumers - https://phabricator.wikimedia.org/T232349 (10Ottomata) Very strange! Joseph and I did end up bouncing eventloggingctl stuff last Wednesday for deployment of ua-parser, but that doesn't seem related. [13:09:53] 10Analytics, 10Services (watching): Create mediarequests top files AQS endpoint - https://phabricator.wikimedia.org/T233716 (10fdans) [13:10:43] 10Analytics, 10Services (watching): Add cassandra loading job for top mediarequests - https://phabricator.wikimedia.org/T233717 (10fdans) [13:10:45] mgerlach: hello! [13:10:52] just checking in, how are you, how goes the system exploration? [13:11:53] Oh by the way elukey - How de we handle the mediawiki-history-text stuff? [13:12:02] (03PS1) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [13:12:23] ottomata: hey# was gonna write you with a bunch of questions on the jupyter notebooks and swap and queries with spark. [13:12:35] elukey: I suggest manually fixing this month (spark job ran successfully, which is the main thing) - and merge the patch [13:12:53] joal: I am doing some tests on stat1005 so now it is borked, lemme know when you need it :) for mediawiki, I'd say that we can manually upload the coordinator's properties? [13:13:05] ah okok then please do [13:13:25] elukey: manual fix should be better: no need to rerun the huge spark job [13:13:35] And then deploy of the patch (clean way :) [13:13:37] mgerlach: cool ok! [13:13:44] ottomata: 2 main points: 1] custom kernels on swap. 2] some help with memory issues in my spark-sql queries [13:13:44] mgerlach: o/ [13:14:12] mgerlach: sorry for not being responsive :) [13:14:32] joal: no worries, I saw you were busy; ) [13:15:48] ottomata: if you have a few minutes, we could alsoi discuss briefly on hangouts [13:16:10] joal: ack! [13:16:25] joal: shall I merge the patch? [13:16:34] (03CR) 10Joal: [C: 04-1] "Let's also add the parameter in the coordinator.xml and workflow.xml parameters list, so that the thing fails fast if not set please." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey) [13:16:40] nope :) [13:16:58] elukey: let me know if my comment makes sense :) [13:17:01] ahhh the parameters, adding them [13:17:18] in theory it is required by a subworkflow, but I see the point [13:17:51] elukey: if we only put hte parameter as needed in workflow, running coordinator will succeed - and we don't want that [13:18:35] joal: how can the coordinator succeed if one required subworkflow requires a parameter? [13:18:38] elukey@stat1005:~$ java -version [13:18:39] ottomata: --^ [13:18:40] :) [13:18:43] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 [13:18:45] openjdk version "1.8.0_222" [13:18:53] !log Manually repairing wmf.mediawiki_wikitext_history [13:18:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:19:16] 10Analytics, 10Analytics-Kanban: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) [13:20:10] elukey: the coordinator doesn't miss any parameter, so it works and tries to run workflow instances - the workflow doesn't miss any parameter, so it waits for data availability, then starts, and finally fail when the missing parameter is actually detected - Exactly what happened this month [13:20:30] Putting the parameters as mandatory at all levels enforces fail-fast [13:21:19] makes sense elukey ? [13:21:22] (03PS2) 10Elukey: oozie: add jdbc_url to mediawiki history wikitext's coordinator [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257) [13:21:37] joal: ah okok yes now it is clearer [13:21:56] joal: hangout witth martin or for somethign else? [13:22:07] elukey: We would have found the missing parameter at launch time with th [13:22:11] the patch you provide [13:22:20] ottomata: sure [13:22:23] * elukey nodes [13:22:25] elukey: nic! [13:22:27] *nods [13:22:50] elukey: this is a graph movement :) [13:22:54] * joal nodes as well [13:22:56] all credits to Saint Moritz protector of the Analytics team (saint due to its patience :D) [13:23:13] Ah men! -^ [13:23:34] cave ottomata / mgerlach, or somewhere else? [13:24:02] i can come in 10 mins, let me get some camus fix stuff going [13:24:09] sure [13:24:18] (03CR) 10Joal: [V: 03+2 C: 03+2] "Thanks elukey" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538744 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey) [13:24:27] but i 'm not sure if mgerlach was asking for a hangout now, if so i'm in but if not that's ok too! [13:24:38] sure! [13:24:47] ottomata joal: I can do now or in 10 mins [13:24:51] :) [13:24:58] looks like in 10mins is the time [13:25:04] ottomata: Please give the go [13:25:17] k [13:28:25] ottomata: about camus, I really think more topic-partitions is the solution [13:28:56] ottomata: just sayin' [13:28:57] :) [13:29:40] yup [13:29:43] execpt [13:29:49] webrequest runs with as much data in one partition [13:30:02] (veryfiing...) [13:30:05] :) [13:30:35] 10Analytics, 10EventBus, 10WMF-JobQueue, 10CPT Initiatives (Modern Event Platform (TEC2)), 10good first bug: EventBus extension must not send batches that are too large - https://phabricator.wikimedia.org/T232392 (10Johan) I suspect this is the reason why MassMessage isn't delivering Tech News to most su... [13:30:44] hmm no i think i'm wrong [13:30:48] ottomata: the 10 mins thing also helps i guess (more frequent runs, more in-memory_) [13:30:53] it looks like webequest maxes around 2.5K per partition [13:31:02] and cirrus? [13:31:11] api-request is 7.5K [13:31:19] cirrus is 2.5K [13:31:36] ottomata: we bumped webrequest-partitions a while back because of something similar IIRC [13:31:45] aye [13:31:51] its weird that camus does speculative execution [13:31:57] we shoudl probably turn that off [13:31:59] https://yarn.wikimedia.org/proxy/application_1564562750409_217973/mapreduce/attempts/job_1564562750409_217973/m/RUNNING [13:32:03] ottomata: number of mappers won't actually change anything here, as a single mapper per part is assigned [13:32:07] it is just reading the same data from kafka twice! [13:32:12] yes [13:32:17] but before yesterrday [13:32:17] PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:32:27] the one mediawiki_events job was configured with 10 mappers only [13:32:33] and with all the other topics it was doing too [13:32:36] yeah yeah I have read that [13:32:38] that was about 34 partitions [13:32:40] aye [13:32:47] but yea here increasing mappers won't help until we add partitions ya [13:32:47] bumping for those is needed, [13:33:01] for cirrus+api, we need more parts and more mappers [13:33:06] ^^^ is me i'm running manually [13:33:09] aye [13:33:11] k [13:33:20] ottomata: I don't get our point about the yarn job [13:33:34] joal: see how there are 4 mappers active? [13:33:40] there are 2 for the exact same partition [13:33:42] e.g. [13:33:49] ad.mediawiki.api-request:1002:0 [13:33:56] eqiad.mediawiki.api-request:1002:0 [13:33:56] Ok I get it [13:33:58] because [13:34:01] speculative execution! [13:34:12] so those are both consumign the same partition data from kafka [13:34:15] writing to a temp dir [13:34:19] and which ever finishes first gets kept [13:34:42] ottomata: I was about to tell that, but prefered to check first :) [13:34:51] We should configure camus not speculate ;) [13:34:54] hm actually, i'm going to kill this job and try [13:34:58] k [13:35:31] ottomata: In job conf: mapreduce.map.speculative = true [13:35:37] We should change that [13:37:29] ok is better [13:37:30] https://yarn.wikimedia.org/proxy/application_1564562750409_218010/mapreduce/attempts/job_1564562750409_218010/m/RUNNING [13:37:36] will do that for all camus jobs [13:38:19] ACKNOWLEDGEMENT - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events ottomata manually running a camus job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:38:23] ok mgerlach joal [13:38:31] https://meet.google.com/rxb-bjxn-nip [14:00:43] heads up: I am merging ferm rules for matomo and analytics meta [14:00:46] to allow backups [14:00:53] let me know if you see any weird error :) [14:05:55] joal: does this error ring a bell at all? It's why my loading job is failing and it isn't very descriptive. Seems like a classpah situation: [14:05:58] https://www.irccloud.com/pastebin/YTqtd8lM/ [14:12:18] hey all :] [14:13:27] (03PS2) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [14:14:55] a-team: there will be a test of db backup for analytics meta and matomo, shouldn't create impact but let me know if you see anything weird [14:15:30] elukey: yeehaw [14:15:33] k [14:16:28] 10Analytics, 10Analytics-Kanban, 10Research-Backlog, 10Patch-For-Review: Release edit data lake data as a public json dump /mysql dump, other? - https://phabricator.wikimedia.org/T208612 (10mforns) Cool! Thanks @Bstorm [14:29:24] thanks team for taking care of the alarms on my ops week [14:39:05] fdans: nope, I don't know that [14:39:16] fdans: feels like a misconfiguration [14:39:43] joal: I suspect it's something about using the latest cassandra jar, I'm trying with the same one as top pageviews [14:40:02] feels bizare as nothing changed normally [14:40:57] joal: who knows, top pageviews is using 0.0.35 [14:45:16] ok the backup didn't work [14:45:21] ottomata, mgerlach - I'm back - anything I can help with? [14:45:34] since a consistent backup needs a ton of locking and it can't be done in our use cases [14:46:12] !log temporarily disabled camus-mediawiki_analytics_events systemd timer on an-coord1001 - T233718 [14:46:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:46:16] T233718: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 [14:46:27] elukey: joal , any objections to me adding partitions to those high volume topics soon? [14:46:31] i'll do it on both main and jumbo [14:46:38] to keep the partitions the same on both clusters [14:46:40] works for me ottomata [14:46:51] Pchelolo: ? [14:46:54] ottomata: this should allow faster recovery [14:46:59] if need be i could just do it on jumbo! [14:47:08] acutally i tihnk Pchelolo is out for a bit, will ask on task [14:47:35] non from me [14:47:38] *none [14:47:57] OH [14:47:59] these aren't on kafka main [14:48:00] duh [14:48:09] just on jumbo, they are produced via eventgate-analytics [14:48:11] nm Pchelolo ! [14:48:33] joal: still trying to figure out the memory issue. perhaps I try the things you suggested and see if the problem persists. if yes, I will ping you [14:48:55] joal: or you already have a hunch what could be the problem [14:50:33] mgerlach: I think it's related to parquet reading big files- But I'd rather triple check that [14:50:43] mgerlach: I'll try your code an see if I cam replicate [14:52:43] 10Analytics, 10Analytics-Kanban: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) Increased partitions to 12 for api-request and cirrussearch-request topics: ` kafka topics --alter --topic codfw.mediawiki.cirrussearch-request --partitions... [14:52:48] !Increased partitions to 12 for api-request and cirrussearch-request topics - T233718 [14:52:48] T233718: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 [14:55:16] (03PS3) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [15:14:18] 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) Confirmed working as expected. I manually checked the "null" country's IPs and they match "Anonymous proxies". [15:15:26] (03CR) 10Mforns: [C: 03+1] "I went over the code, and all looks good!" (0313 comments) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [15:16:50] 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10mforns) Great! [15:18:50] (03CR) 10Elukey: Move codebase to python3 (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [15:20:48] (03PS5) 10Elukey: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) [15:20:59] 10Analytics, 10Discovery-Search (Current work): Recreate mjolnir.msearch-prod-request topic in kafka-jumbo - https://phabricator.wikimedia.org/T233731 (10EBernhardson) [15:21:37] ebernhardson: you want it just dropped and recreated? [15:21:40] mforns: thanks for the review! No rush on merging etc.., just wanted to kick off the migration review/process/test to catch any surprise early :) [15:21:47] ottomata: or changed? either way. The historical info there isn't important [15:22:22] i don't think we can decrease so easily, so it'd have to be dropped i guess [15:22:32] you don't want to just increase # of consumers :) [15:22:32] ? [15:22:45] ottomata: we run one consumer per elasticsearch host [15:22:51] ottomata: so, 30 hosts in codfw, 30 consumers [15:22:55] k [15:25:44] stepping afk for ~10 mins [15:25:46] ebernhardson: you want me to do that now? [15:25:51] ottomata: sure [15:25:55] oook! [15:25:58] ottomata: thanks [15:27:01] elukey, no no, thanks for pushing this front [15:27:20] ebernhardson: you want 25? [15:27:36] you just wan tit slightly less than total # of consumers? [15:28:14] ottomata: i'm randomly guessing with 25 :) it seems like we probably wont lose 5 servers from the cluster for any period of time [15:29:06] ottomata: so, basically yes. I might have to re-tune the thread pools on consumer side as well, will see. [15:29:11] 10Analytics, 10Discovery-Search (Current work): Recreate mjolnir.msearch-prod-request topic in kafka-jumbo - https://phabricator.wikimedia.org/T233731 (10Ottomata) Done. ` kafka topics --delete --topic mjolnir.msearch-prod-request kafka topics --create --topic mjolnir.msearch-prod-request --partitions 25... [15:29:27] 10Analytics, 10Discovery-Search (Current work): Recreate mjolnir.msearch-prod-request topic in kafka-jumbo - https://phabricator.wikimedia.org/T233731 (10Ottomata) 05Openβ†’03Resolved a:03Ottomata [15:35:45] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: Figure out how to $ref common schema across schema repositories - https://phabricator.wikimedia.org/T233432 (10Ottomata) [15:35:48] (03PS4) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [15:39:30] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: Figure out how to $ref common schema across schema repositories - https://phabricator.wikimedia.org/T233432 (10Ottomata) Added another idea to the task description: **npm dependency**. This would be effectively the same a... [15:51:05] (03CR) 10Mforns: Move codebase to python3 (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [15:56:04] (03PS5) 10Fdans: (wip) Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [15:58:27] 10Analytics, 10Research: Parse wikidumps and extract redirect information for 1 small wiki, romanian - https://phabricator.wikimedia.org/T232123 (10MGerlach) @JAllemandou Main problem: memory error for large (and even not super large) wikis such as frwiki. I implemented some of your suggestions from the discu... [16:00:34] joal: memory error persists after implementing your suggestions. see the updated notebook for a minimal example. maybe you have any thoughts on the text processing [16:00:56] mgerlach: will do [16:01:18] (03CR) 10Elukey: Move codebase to python3 (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [16:02:06] (03PS6) 10Elukey: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) [16:17:37] 10Analytics, 10Analytics-Kanban: Alarming scripts for entropy alarms. Anomaly detection and reporting. - https://phabricator.wikimedia.org/T227357 (10Ottomata) [16:26:37] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop - https://phabricator.wikimedia.org/T223414 (10Ottomata) a:05fdansβ†’03mforns [16:29:05] (03PS1) 10Elukey: Add the druid directory and store Druid Kafka supervisor json in it [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682) [16:30:45] (03PS2) 10Elukey: Add the druid directory and store Druid Kafka supervisor json in it [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682) [16:38:24] created druid/kafka/etc.. [16:38:29] --^ [16:39:27] (03CR) 10Joal: [C: 03+1] "Works for me! Thanks Luca :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [16:39:35] \o/ [16:40:04] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add the druid directory and store Druid Kafka supervisor json in it [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538916 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [17:15:55] * elukey off! [17:15:56] o/ [17:16:18] (03PS6) 10Fdans: Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [17:17:12] (03CR) 10Fdans: [V: 03+1] "Test data loaded correctly in cassandra. This change is ready for revieweing." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans) [17:30:00] ottomata, I'm testing reportupdater in stat1007, and one dependency needs upgrade (PyYAML), how should I do that? [17:30:32] hmm [17:30:49] i think we jusst use the default deb package [17:30:52] why does sit need upgrade? [17:31:05] it is the same on stat1006, ya? [17:43:35] ottomata, it needs upgrade because of the python3 migration [17:43:51] ah you just need the python3 version? [17:43:52] well, I'm not sure there's no way of doing it without changing... [17:44:08] python3-yaml is installed [17:44:11] version 3.12-1 [17:44:20] yea, the new code needs 5.1 [17:44:25] say whaaa! [17:44:26] why? [17:44:33] but maybe there's a way to do it without the upgrade [17:44:51] why do packages need upgraded for python3 upgrade? [17:44:59] if we have the python3 versions of those packages already? [17:45:12] because in python3 yaml.load() without a loader is deprecated, and yaml.FullLoader is only present in 5.1 [17:45:55] harhum [17:45:56] weird. [17:46:12] we don't have our own pyyaml deb [17:46:13] sigh [17:47:27] ottomata, maybe we can be "deprecatious" and not use yaml.FullLoader [17:47:55] if there is some way to make 3.12 work it'd be nice. but if not i can try and create a .deb [17:57:28] ottomata, ok, trying [18:00:27] 10Analytics, 10Operations, 10hardware-requests, 10ops-eqiad, 10User-Elukey: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10ayounsi) [18:04:40] mforns: heh, this does not look like an easy package to build... [18:04:51] ottomata, wait I'm trying [18:09:59] ottomata, if I go back to PyYAML 3.10 and remove the yaml.FullLoader (go back to deprecated version) it works. The deprecation warning seems important, but maybe does not apply to our context: https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation [18:10:42] because we're parsing version-controlled code, no? [18:11:04] mforns: can we just use safe_load [18:11:05] ? [18:11:10] no, same error [18:11:25] PyYAML 3.10 does not implement safe_load [18:14:30] hold on...i may be having success.... [18:14:32] building [18:15:14] ack maybe not... [18:16:40] RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:27:38] mforns: i think i got it.... [18:27:44] oh [18:28:41] going to install on stat1007 for ya and you try? [18:29:25] ok! [18:29:26] mforns: wait a minute [18:29:29] yaml.safe_load [18:29:34] exists on pyyaml 3.12 [18:29:42] >>> import yaml [18:29:42] >>> yaml.safe_load [18:29:42] [18:29:59] in 3.10? [18:30:14] I'm running tests right now with safe_load [18:30:16] stat1007 [18:30:18] dpkg -l python3-yaml [18:30:20] 3.12-1 [18:30:25] oh ok [18:30:36] >>> yaml.safe_load('\N{PILE OF POO}') [18:30:36] 'πŸ’©' [18:30:48] I guess I just tried yaml.full_load() [18:30:53] and it threw an error [18:31:00] * joal is always pleased with ottomata's examples :) [18:31:08] ok, using safe_load [18:31:09] haha joal i got that from the debian source package!!!! [18:31:10] testing [18:31:13] hehe [18:33:05] k brb [18:39:43] 10Analytics, 10EventBus, 10WMF-JobQueue, 10Core Platform Team (Needs Cleaning - Security, stability, performance, and scalability (TEC1)), 10Wikimedia-production-error: EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached" - https://phabricator.wikimedia.org/T204183 (10Pchelo... [19:04:05] (back btw) [19:30:17] ebernhardson: i think cirrussearch-reqeuest is catching up...but it will take a while... [19:30:19] i'm watchhing it closely [19:30:32] api-requests on the other hand...i don't think it will make it! [19:30:35] good thing no one uses it [19:30:40] ottomata: heh [19:30:51] ottomata: does that also suggest cirrussearch-request is nearing some limit? [19:31:08] well, before adding partitionss, yes [19:31:12] i upped it to 12 paratitions today [19:31:19] so the camus job can parallelize better [19:31:31] so, basically the newly added partitions are great, but the old partitions still have to catch up? [19:31:46] yeah, partition 0 still has days worth of data in it [19:32:06] a problem is that camus doesn't let me specify when to start from, and I had to make a new job to change the settings for this stuff [19:32:15] so, its either: skip it all and start from latest [19:32:21] or reconsume from earliest and overwrite [19:32:26] so, that's what's happeningg now [19:32:30] PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:32:31] its consuming the whole last weeks worth of data [19:33:06] ottomata: silly idea...but you could pause the consumer, and run a script that uses their group_id to read the partition and emit all the messages back into new partitions. [19:33:08] its up to mid day sept 19 now, and seems to consume at about 2x run time, so if i let it run for 15 mins, it'll consume ~30 mins worth of data from partition 0 [19:33:15] the problem is... [19:33:16] but if you don't need the api one, prob not worht the effort [19:33:20] camus stores the offsets in hdfs [19:33:21] not in kafka [19:33:26] oh. heh [19:33:32] in binary files [19:33:39] so yes i could probably write some code to update it all [19:33:46] but it would be pretty hacky! :) [19:34:29] i can live with the data being behind a few days [19:34:39] kafka is using a 0.7 or 0.8 client; from before the time zookeeper was abstracted away from clients [19:34:50] so they chose to keep the offsets in hdfs instead of zookeeper [19:35:18] its so old, but we haven't figured out how to replace it yet...mainly due to confluent licensing issues on their hdfs/hive connector for kafka connect [19:35:41] sounds fun :( [20:16:10] (03PS7) 10Mforns: Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [20:25:44] (03CR) 10Mforns: [V: 03+2 C: 03+1] "After some testing with Andrew, we degraded PyYAML back to >=3.10," [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [20:26:33] ok andrew, I tested that in stat1007 with real config and data, and all seems to run well. Pushed latest changes to patch. thanks for the help! [20:26:58] I meant ottomata, sorry [20:27:40] great! [20:28:18] (03CR) 10Nuria: [C: 03+2] Improve README (031 comment) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/538722 (owner: 10Srishakatux) [20:36:12] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) I've temporarily stopped importing api-request data to help cirrussearch-request catch up. AFAIK, know one uses api-request data in Hive. [20:40:46] 10Analytics, 10EventBus, 10Product-Analytics: Review draft Modern Event Platform schema guidelines - https://phabricator.wikimedia.org/T233329 (10Ottomata) > So maybe that section could say something like [...] Nice I like it! Modified. > I don't think it's necessary to relegate the part on referencing to... [21:49:04] RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-coord1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers