[01:09:04] (03PS8) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [01:10:35] (03CR) 10Nuria: "Sorry about patch #7. Note to self: do not push to gerrit when you are sick." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [07:02:00] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: JVM pauses cause Yarn master to failover - https://phabricator.wikimedia.org/T206943 (10elukey) Logs for the minute 11:26 on zookeeper nodes: conf1004: ``` elukey@conf1004:~$ sudo grep "11:26" /var/log/zookeeper/zookeeper.log 2018-10-28... [07:31:07] morning! [07:31:14] so ---^ is kind of a mistery [07:31:35] I am leaning towards something related to kernel + io + disk controller [07:32:17] but it is difficult to prove, no breadcrumbs in syslog/kern.log/etc.. and nothing in metrics [07:36:01] wild [08:06:47] Hi elukey [08:07:09] Bonjour! [08:07:13] groceryheist: I wonder about your usage of HDFS fuse mount point [08:07:42] elukey: Please let me know if you want any help on trying to tame the hadoop-master hiccup [08:08:32] joal: I have no idea now, trying to see if any breadcrumb is there among metrics etc.. but it doesn't seem so [08:08:57] mwarf - a hidden hiccup [08:09:46] I started a tmux with sar -d -p 2 to have a very granular view of what happens to io latency, but I need to wait until another stall happens [08:11:33] the theory about the os causing the stall seems something plausible, but usually (from what I can read) it is caused by other processes hogging disk io [08:11:37] causing starvation [08:12:06] but there's nothing in syslog that can explain this [08:12:09] or in kern.log [08:12:22] the stall is relatively "small", like 8s [08:15:05] I am inclined to bump yarn.resourcemanager.zk-timeout-ms to something like 15/20s [08:15:46] joal: curious, why? I'm using spark from swap. [08:16:43] groceryheist: IMO direct access through hdfs libs are less error prone [08:17:02] particularly if you use spark: direct hdfs access is usually a lot more efficient [08:17:43] elukey: I'm fine with bumpling the timeout [08:18:42] elukey: Could it an explicit lock due to for instance a log roate compaction or something similar (sorry for silly question, trying to think out of the box) [08:19:39] joal: nono please, more thinking usually brings more ideas.. :) [08:21:06] I think that yarn manages its logs via log4j or similar (we don't deploy logrotate configs) [08:23:02] ah. I didn't mount HDFS on the notebook machine. The notebook kernels just set up the connection. I assume they are doing the right thing. [08:29:29] groceryheist: I was actually talking about usage, not the mounting :) [08:37:05] ah. this is going a bit over my head. how do you recommend writing data to hdfs through spark? [08:41:09] ok I have to sleep [08:53:51] groceryheist: Let's talk either tonight (my time) or tomorrow night (morning for you I think) [08:56:14] !log bounce yarn resource managers to pick up new zookeeper session timeout settings [08:56:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:56:27] Thanks for that elukey --^ [08:56:34] elukey: I hope it'll help :( [08:57:05] I hope so, it will not solve the problem though :( [09:00:54] 2018-10-29 09:00:09,335 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server conf1004.eqiad.wmnet/2620:0:861:101:10:64:0:23:2181, sessionid = 0x50664009c3b103bc, negotiated timeout = 20000 [09:01:04] good :) [09:03:53] done! [09:53:31] fdans: you there? [09:58:04] (03PS1) 10John Erling Blad: Wikistats2: Added headless testing [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/470344 [10:06:54] PROBLEM - Check the last execution of check_webrequest_partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions [10:11:15] joal: doing anything with webrequest? [10:12:02] I've just re-run hour 5 in text, failed but no logs available from hue, let'see [10:13:05] maybe we are just a bit behind the schedule [10:13:58] still processing 2018-10-29T04 [10:14:56] hour 4 is running refine now [10:15:14] (for upload) [10:15:15] https://yarn.wikimedia.org/proxy/application_1540747790951_1500 [10:15:39] Elapsed:4hrs, 48mins, 41sec [10:15:47] and 1 mapper running [10:15:48] whatt [10:17:46] ah joal now I get what you were saying before [10:18:01] what a huge create table [10:20:31] I would be inclined to kill it [10:20:45] and follow up via email [10:21:35] +1 elukey [10:22:02] I didn't kill it earlier to try to let it finish - but it's too long and not making progress [10:22:06] Killing it [10:23:11] ack [10:23:25] !log Kill yarn application application_1540747790951_1429 to prevent more cluster errors (eating too many resources) [10:23:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:23:37] done [10:23:48] nice [10:24:00] webrequest refine running with 1 mapper is kinda crazy [10:24:01] :D [10:24:10] elukey: And the job was only at stage-1 of a complicated query :( [10:24:15] :) [10:24:38] I am pretty sure it wasn't intended, but our jobs have the priority if they are heavily impacted [10:25:10] Hello, I'll give 1 CPU and 2gb ram to work 100gb [10:25:20] elukey: indeed [10:25:35] elukey: I actually think we should allow preemption for prod [10:26:00] it would almost never be used, but in that example it would have prevented the issue [10:26:12] makes sense joal, let's open a task? [10:31:35] elukey: sorryyy just seen the notification [10:31:47] I'll take a luca at your patch right now :) [10:32:40] hey Fran no hurry :) [10:32:43] whenever you have time [10:33:00] I was reviewing the code and I didn't recall why /srv/geoip was not chosen [10:33:27] puppet is a bit stupid and does a full scan of /usr/share/GeoIP when it has to ensure it [10:33:33] including the archive dirs [10:33:47] and now it takes ~5 mins for a puppet run to complete [10:33:52] but it'll get worse over time [10:34:09] so my idea is to move 'archive' away fron /usr/share/GeoIP [10:34:29] (03PS7) 10Joal: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) [10:35:09] fdans: I love that one: I'll take a luca't your patch :) [10:35:17] awesome [10:35:31] elukey: Opening a task for preemption :) [10:36:16] * elukey likes the lucat's [10:36:25] err luca't [10:36:26] :P [10:38:38] 10Analytics: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10JAllemandou) [10:39:11] elukey: --^ [10:39:34] ack! Sent an email as follow up for the team [10:43:40] Thanksel [10:43:47] Thanks elukey - sorry [10:51:23] 10Analytics, 10Contributors-Analysis, 10Product-Analytics: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10JAllemandou) If we want an email to be sent once data is available on the cluster, no need to create a new oozi... [11:05:43] fdans: stat1007 -> Notice: Applied catalog in 40.39 seconds [11:05:45] \o/ [11:06:02] wooo blaaaazing elukey :D [11:06:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Geoip data archive repository cause puppet to run for minutes - https://phabricator.wikimedia.org/T208028 (10elukey) ``` elukey@stat1007:~$ sudo puppet agent -tv Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving... [11:06:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Geoip data archive repository cause puppet to run for minutes - https://phabricator.wikimedia.org/T208028 (10elukey) [11:07:40] addshore: o/ so any issue in moving ::statistics::wmde to stat1007? [11:11:45] Hey team - going for special lunch with family - will be off for a few hours [11:41:12] going out for lunch + errand! [12:16:19] (03PS4) 10Fdans: Handle null name values in top metrics from UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968) [13:20:13] o/ Is there an API for getting the list of most edited 100 (enwiki) articles in the past month? [13:47:16] (03CR) 10Ottomata: [C: 031] "I like" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [13:55:48] elukey: hiiii o/ [13:55:57] wanna rebalance some webrequest text partition leadership with me? [13:56:04] it should be easy [13:56:10] just wanna do it with a second pair of eyes [13:59:22] ottomata: just got back! [13:59:31] if you can wait 5 mins I'll be there [13:59:50] ya not quite ready still getting commands etc. ready [13:59:52] but shoudl be soon [14:10:16] bmansurov: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/top-edited-pages/normal|table|1-Month|~total [14:10:26] fdans: thanks! [14:16:02] ottomata: ack I am ready [14:16:08] ready too [14:16:17] do you want to bc? [14:16:19] bc [14:16:19] ua [14:16:20] ya [14:27:31] !log ran kafka-preferred-replica-election on kafka jumbo-eqiad cluster (this successfully rebalanced webrequest_text partition leadership) T207768 [14:27:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:27:34] T207768: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 [14:28:10] hey team! [14:34:15] helloooo mforns [14:34:24] heya fdans :] [14:46:06] elukey: thought: how long do you usually wait between rebooting kafka nodes? [14:46:12] when you have to do a cluster reboot? [14:46:47] 10Analytics: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata) Interesting! Today Luca and I were about to move partition leadership using `kafka reassign-partitions`, but we noticed that the replica assignment actually looked correct,... [14:47:00] 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata) [14:47:16] ottomata: usually when all metrics recover I start with the next host [14:47:27] say 5/10 minutes or more [14:48:01] recover meaning what? do you see data flowing back into the broker? [14:48:14] if the auto rebalance is only considered once every 5 minutes [14:48:38] maybe somehow a premature broker restart could cause balanced leadership to get slightly out of whack? [14:48:52] when I see traffic reshaping and other metrics like in-sync-replicas etc.. getting to flat zero agin [14:48:55] *again [14:48:56] aye [14:49:07] i wonder if ISRs are all back, but election has happened [14:49:17] hasn't* [14:49:19] has not happened yet [14:49:30] but $something must happen since I clearly see traffic getting back to the broker rebooted [14:49:38] right [14:49:45] some but not all? [14:49:58] maybe an auto rebalance was triggered before ALL of the ISRs are back? [14:49:59] and also we have a graph related to partitions assigned to brokers [14:50:14] hm, and it was 1006 that was missing leaders [14:50:17] lemme find it [14:50:18] which is the last broker you restart right? [14:50:20] so, what if [14:50:41] what if you restart 1006, and as it is coming back up, some partitions are back in the ISR, but not all, probably not webrequest text ones yet [14:50:44] because those ones take the longest [14:50:53] then, 300 seconds pass since the last auto rebalance consideration [14:50:58] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=20&fullscreen&orgId=1 [14:51:03] and kafka sees that > 10% of leadership is unbalanced [14:51:07] so it triggers an election. [14:51:12] it then rebalances everything that is in ISR [14:51:25] but, not all of webrequest_text is yet in sync, beacuse it is still replicating from the last reboot [14:51:38] then 1006 gets fully synced back into ISR [14:51:40] but at that point [14:51:47] the unbalanced % is < 10% [14:51:52] so no future elections are triggered [14:51:56] that would explain this pretty well [14:52:54] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) The disk is being sent and should arrive today or tomorrow [14:53:00] ah so webrequest gets late to the party due to the time taken by the replication [14:53:08] right [14:53:18] it could be a good explanation [14:53:32] ya elukey cool, this graph is helpful [14:53:39] so in this case, a round of reboots should always complete (to be sure) with a preferred replica election [14:53:42] i can see that before and after election today [14:53:52] 3 leaderships moved to 1006 [14:53:54] and we know which ones [14:54:04] VirtualPageView, and webrequest_text 0 and 6 [14:54:07] yep [14:54:17] 2 from 1002 and 1 from 1005 [14:54:50] so elukey i think all we should do is just add a step to the reboot procedure [14:54:54] once all nodes have rebooted [14:55:11] wait til all partitions are in ISR [14:55:15] all replicas * [14:55:20] and then do an election [14:55:40] yes +2 [14:55:53] (03PS1) 10Fdans: Release 2.4.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/470409 [14:56:05] I was used to do it after every reboot before auto-rebalance [14:57:11] 10Analytics, 10Analytics-Kanban: Update pageview_hourly to include timestamp for better druid indexation - https://phabricator.wikimedia.org/T208230 (10fdans) p:05Triage>03High [14:58:24] (03CR) 10Fdans: [V: 032 C: 032] Release 2.4.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/470409 (owner: 10Fdans) [14:59:19] 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata) Oo, here's a plausible explanation. kafka-jumbo1006 was the only broker that was missing some of its leaders. It is usually the last nod e to be reboo... [15:01:02] a-team: standdup [15:02:00] ping joal [15:02:16] ping mforns [15:03:40] 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata) I updated wikitech docs: https://wikitech.wikimedia.org/w/index.php?title=Kafka%2FAdministration&action=historysubmit&type=revision&diff=1807237&oldid=1... [15:19:30] aaaaagh! dst change [15:37:54] haha [15:53:13] 10Analytics: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10fdans) p:05Triage>03Normal [15:53:22] 10Analytics: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10fdans) a:03Ottomata [15:55:21] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10Ottomata) [16:00:19] 10Analytics, 10Analytics-Wikistats: Restore WikiStats features disabled for mere performance reasons - https://phabricator.wikimedia.org/T44318 (10fdans) @Nemo_bis hey, is there a list of metrics you have that we could maybe develop with the current infrastructure? [16:05:58] 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Milimetric) p:05Triage>03High [16:09:36] 10Analytics, 10Contributors-Analysis, 10Product-Analytics: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10Milimetric) p:05Normal>03High [16:13:18] 10Analytics: Investigate AQS cassandra schema hash warninga - https://phabricator.wikimedia.org/T178832 (10Milimetric) @JAllemandou what's up with this task, that's what we wanted to know in grosking. [16:15:28] 10Analytics: Investigate AQS cassandra schema hash warninga - https://phabricator.wikimedia.org/T178832 (10Milimetric) p:05Low>03Normal [16:22:08] (03CR) 10Joal: [C: 032] "Merging to deploy tomorrow" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [16:23:44] 10Analytics-Kanban: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention) - https://phabricator.wikimedia.org/T208237 (10JAllemandou) [16:24:00] 10Analytics-Kanban: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention) - https://phabricator.wikimedia.org/T208237 (10JAllemandou) a:03JAllemandou [16:27:32] (03PS2) 10Joal: Update pageview_whitelist fieldname for convention [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469924 (https://phabricator.wikimedia.org/T208237) [16:28:06] (03Merged) 10jenkins-bot: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [16:28:12] (03PS9) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [16:30:10] joal elukey interesting, when we added partitions to the eventlogging_ReadingDepth topic, the preferred leaders were not balanced! [16:30:17] this time for real, not just needing an election [16:30:28] (03CR) 10Joal: [V: 032 C: 032] "Merging to deploy tomorrow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469924 (https://phabricator.wikimedia.org/T208237) (owner: 10Joal) [16:30:38] i wonder if that was because the leadership of all some partitions were unbalanced when we added the partitions [16:30:47] anyway, there is one extra partition on 1002 that needs moved to 1003 [16:31:14] mmmmm [16:31:27] (03CR) 10Joal: [V: 032 C: 032] "Merging to deploy tomorrow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/357814 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [16:31:53] (03CR) 10Nuria: "Francisco, let's make a PR to correct restbase documentation regarding nulls/anaonymous users, once that is done this can be deployed." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/468927 (https://phabricator.wikimedia.org/T206968) (owner: 10Fdans) [16:32:08] (03CR) 10Joal: [V: 032 C: 032] "Merging to deploy tomorrow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/456654 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal) [16:33:53] Hi [16:34:24] I was expecting that the chart at https://language-reportcard.wmflabs.org/cx2/#translations will be updated today, but it wasn't. [16:34:32] Is there a problem with report updater, or with the database? [16:35:35] ottomata: can we bump partition number on VirtualPageview? [16:36:46] yes [16:37:00] gonna move a ReadingDepth one first [16:37:35] !log reassigning eventlogging_ReadingDepth partition 0 from 1002,1004,1006 to 1003,1001,1005 to move preferred leadership from 1002 to 1003 [16:37:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:41:21] aharoni: the files read were last updated on teh 21st: https://analytics.wikimedia.org/datasets/periodic/reports/metrics/published_cx2_translations/published_cx2_translations.tsv [16:42:06] aharoni: so that is the most recent data [16:42:39] (03CR) 10Amire80: "OK, it's clear, this can be reviewed." [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/469390 (https://phabricator.wikimedia.org/T207765) (owner: 10Amire80) [16:44:27] nuria: Thanks. I also spoke to milimetric , and he says that it should run later today, and if it's true, it's OK. Maybe it's related to a configuration change we did some time ago. [16:46:59] aharoni: configuration change? [16:52:48] nuria: https://gerrit.wikimedia.org/r/#/c/analytics/limn-language-data/+/465152/ [16:53:16] Pau the product manager asked to change the dates by which it works [16:54:21] joal, mforns - can you please take a look at https://phabricator.wikimedia.org/T189475 some time? [16:54:30] sure aharoni [16:56:08] (03CR) 10Nuria: [C: 032] Memoizing results of state functions (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [16:56:12] (03CR) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [16:59:16] a-team - I have merged a bunch of patches on refinery-source and refinery and plan to deploy them tomorrow - Please let me know if you have special patches for me to care [16:59:31] ok joal [17:00:58] joal, you said 10 billion web requests per day, right? Was that US billions or EU billions? [17:01:36] mforns: at peak we process 150.000 reqs oer sec [17:01:40] *pe [17:01:42] *per [17:02:23] sounds like 10 US billion [17:03:13] ya, to avoid issues you can put number per sec [17:04:30] mforns: i did that on my last talk [17:04:48] ok, makes sense [17:16:18] mforns: correct 10x10^12 [17:16:25] k :] [17:19:09] mforns: to contradict nuria, I like numbers over a period of time people can feel (a day is nice) [17:19:40] joal, nuria, the thing is, the sentence goes "logs that are ingested daily" [17:19:53] well... that would work as well [17:20:14] I just didn't want to use the same time unit measure to refer to EL events and web requests [17:21:10] I left it like: "10 billion (US) web request logs that are ingested daily" [17:21:15] mforns: no big deal though, whethere per minute or day or second [17:21:20] nuria, please re-ping if you do not like it [17:21:51] mforns: you could also add the peak rps if you want(maybe not in the abstract) [17:22:05] it's ok [17:22:35] joal, regarding "in summary, all data containing personal identifyers..." [17:22:49] how about "practically all data containing..." [17:24:20] mforns: I don't know about the meaning of 'practically' in english [17:24:36] joal, I'd say it means: in practical terms [17:24:37] In french youcould say 'pratiquement' to mean almost [17:24:54] or in practice [17:24:59] yea [17:25:16] but here in practical terms we store IPs since 15 years :)= [17:25:30] it's another way to say almost, hehehehe [17:25:42] So in the almost sense, I agree ;) [17:25:48] ok [17:25:52] xD [17:26:39] mforns: Thanks :) I don't mean to be overly precise, I just like when it's oiverly correct :) [17:27:31] joal, yes, if the talk is accepted, I can mention that in detail, but for the point of that sentence I think this is OK [17:27:41] For sure [17:30:14] mforns: given that you are in EU when giving that conference most attendees are not going to know about the billion (us) versus rest of the world but i do not think is a big deal either way, just be aware [17:32:19] nuria, ok [17:37:06] 10Analytics, 10Analytics-Wikistats: Wikistats Bug - Monthly overview's "Top editors" box links to mainspace instead of userspace - https://phabricator.wikimedia.org/T208247 (10Quiddity) [17:43:48] am surprised that moving this one ReadingDepth partition is taking so long! [17:43:52] its still doing the right thing [17:43:56] oh RIGHT there is a throttle.... [17:44:25] orr maybe the default is no throttle... [17:46:11] euh - question for you ottomata [17:46:22] ya [17:46:36] Why moving when talking about ReadingDepth, while it was not needed for Webreauest? [17:46:51] ottomata: a broker had 2 partitions? [17:46:52] the webrequest preferred leadership assignment was actually correct [17:47:03] ottomata: Ah? [17:47:09] it just wasn't balanced; 1002 has more actual leaders [17:47:17] but a rebalance caused the preferred leaders to take over [17:47:51] ok I think I get it [17:47:59] whereas when we created ReadingDepth, for some reason (maybe because some actual leadership wasn't balanced at the time?) the preferred leaders were not correctly balanced [17:48:12] Makes sense I understand [17:48:14] ok [17:48:14] aye [17:48:17] Thanks :) [17:48:47] Interestingly also, the leader for VirtualPageView also changed when you rebalanced :) [17:49:41] yes! [17:49:44] interesting eh [18:00:03] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Services (watching): T206785: Modern Event Platform: Stream Intake Service: AJV usage security review - https://phabricator.wikimedia.org/T208251 (10Ottomata) p:05Triage>03Normal [18:12:10] milimetric: you are totally right that wrappers to teh functions works just fine [18:13:08] heh, now only if we knew why :) [18:30:20] * elukey off [18:54:26] 10Analytics, 10EventBus, 10Operations, 10Wikidata, and 7 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) [18:56:16] 10Analytics, 10Analytics-EventLogging, 10Wikipedia-iOS-App-Backlog: MobileWikiAppiOSSearch validation errors adding noise to EventLogging error - https://phabricator.wikimedia.org/T205910 (10chelsyx) [18:56:19] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.1-Narwhal-On-A-Bumper-Car: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10chelsyx) [18:56:46] 10Analytics, 10Analytics-EventLogging, 10Wikipedia-iOS-App-Backlog: MobileWikiAppiOSSearch validation errors adding noise to EventLogging error - https://phabricator.wikimedia.org/T205910 (10chelsyx) Hi all, I closed this one as it has been fixed in T207424 [19:08:39] cool finished moving paritiotns in reading depth [19:08:39] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging_ReadingDepth&from=1540834936497&to=1540840098160 [19:09:23] joal: i'm all for adding virtualpageview partitions [19:09:38] buuut, iwonder if it will really make a difference here [19:15:59] milimetric: all calls we do to those functions externally are prefixed with Router.blah and thus they work just fine [19:16:32] milimetric: do you want me to rework the code to add the decorator? it is really only 1 function that gets called externally [19:17:15] nuria: it's up to you, I reviewed the way you approached it too, and it was mostly fine as well [19:17:38] milimetric: ok, will rework the one function exposed one sec [19:30:35] 10Analytics, 10Analytics-Wikistats: Restore WikiStats features disabled for mere performance reasons - https://phabricator.wikimedia.org/T44318 (10Nemo_bis) >>! In T44318#4703281, @fdans wrote: > @Nemo_bis hey, is there a list of metrics you have that we could maybe develop with the current infrastructure? So... [19:43:26] (03PS10) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [19:56:25] (03PS11) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [19:56:29] 10Analytics, 10Analytics-Kanban: Clickstream dataset for Persian Wikipedia only includes external values - https://phabricator.wikimedia.org/T191964 (10Ladsgroup) Hello? [19:57:18] milimetric: on 2nd though i removed the decorator as i do not want to cache any possible response from the functions, just the ones that 'deserve' caching [20:01:08] k, sounds good nuria, I think it’s mostly a matter of style [20:23:42] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Security-Reviews, and 3 others: T206785: Modern Event Platform: Stream Intake Service: AJV usage security review - https://phabricator.wikimedia.org/T208251 (10mobrovac) Adding the security folks. I agree that code generation based on schemae can potent... [20:34:13] Thought you all might like https://www.dataengineeringpodcast.com/using-notebooks-as-the-unifying-layer-for-data-roles-at-netflix-with-matthew-seal-episode-54/ [20:43:30] ottomata: isn't this job kind of eating teh whole capacity of the cluster? https://yarn.wikimedia.org/cluster/app/application_1540803787856_1567 [20:43:39] ottomata: i cannot even do selects ... [20:44:24] ottomata: it has a 1000 reducers [20:46:27] groceryheist: hello, i think one of your selects is using a lot of capacity, is your user name nathante on hadoop? [20:46:47] nuria: hmmm, it is in the nice queue though [20:47:23] looking [20:48:46] nuria i would thikn that if you have a job submitted (hive query, whatever) [20:48:57] you might have to wait until a few of those reducers have finished [20:49:00] but you should get in [20:50:03] i see your hive query has been accepted and is running [20:50:06] but the mappers haven't started [20:51:43] ottomata: ok looking [20:53:50] ottomata: so queries with 1000 reducers on nice queue we think are tenable? (asking for real, not ironic question) [20:55:55] i am not sure, i would hope that something in the nice queue would allow your default job to be scheduled [20:56:11] maybe we need to make the nice queue preemptable too [20:56:32] that job doesn't seem to be making much progress [20:56:46] running reducers just sitting there? [20:56:57] so it is taking up space and not allowing new stuff in [20:58:25] ottomata: also I am a bit lost on how to see the code of the create table [20:58:34] ottomata: thsi https://yarn.wikimedia.org/proxy/application_1540803787856_1567/ [20:58:44] nuria there are prob a bunch of fair schceduler queue tweaks we could/should do there [20:58:45] here [20:58:47] ottomata: doe snot have the code for "create table" that is running [20:58:48] nuria: that i don't know either [20:58:50] oh [20:58:52] your code? [20:58:54] your query [20:59:02] ottomata: no, the one consuming resources [20:59:07] oh [20:59:12] yeah don't know how to get that either [20:59:22] maybe in hue...hm [21:00:11] ottomata: also these are all queries accepted taht are not running [21:00:13] *that [21:00:20] yes found it! [21:00:21] https://hue.wikimedia.org/jobbrowser/jobs/application_1540803787856_1567 [21:00:25] search page for [21:00:26] 'hive.query.string' [21:00:46] nuria: ? [21:01:10] ottomata: wow [21:02:33] ottomata: https://yarn.wikimedia.org/cluster/apps/ACCEPTED [21:05:02] yeah its holding a lot of other stuff up too [21:05:16] accepted and RUNNING ones [21:05:24] RUNNING ones aren't getting assigned many resources to launch containers [21:05:37] ottomata: i seriously cannot see code for select, me feeling a bit handicapped [21:05:55] nuria: the offending jobs? [21:06:01] ottomata: yes [21:06:02] did you lclick on the hue link above^ [21:06:02] ? [21:06:09] https://hue.wikimedia.org/jobbrowser/jobs/application_1540803787856_1567 [21:06:19] ah sorry [21:06:25] then got to Metadata tab [21:06:30] and find on page for 'hive.query.string' [21:06:56] nuria: for prod/essesntial jobs [21:07:02] i think the change that joal suggsted today would help [21:07:18] ottomata: ok, now i see it sorry, i was thinking it will be in the logs! [21:07:28] nuria it will be in the application logs after it finishes [21:08:47] ottomata: right, ok, that query is a complex query over mw snapshot that I think has been killed several times, i can see many attemps [21:08:52] ya [21:08:59] i think the fair scheduler stuff dynamically updates [21:08:59] i [21:09:06] i'm goign to merge this patch we talked about this morning [21:09:14] it won't help your job [21:09:17] but it will at least help the prod ones [21:09:56] ottomata: ok [21:10:22] groceryheist: please ping us if you are running this select CREATE TABLE nathante.readingDataModel_Stage2... [21:10:56] groceryheist: with no bounds of time or wikis is really consuming quite a bit of resources, it will likely get killed and we can probably work together to make it more efficient [21:26:43] nuria: hi [21:29:59] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Possible statsv corruption? - https://phabricator.wikimedia.org/T189530 (10Krinkle) 05Open>03Resolved Looking at graphite1001 now: ``` krinkle at graphite1001.eqiad.wmnet in /var/lib/carbon/whisper $ ls -d -1 *_* eventlogging_client_errors_Na... [21:30:06] yeah I have been trying to optimize this job [21:30:12] had a chat with joal earlier [21:30:30] here's the code https://gist.github.com/groceryheist/46a634ed79ca8b1c9afae3ce6beb0c64 [21:30:48] sorry that it isn't preemptable. The reducers are getting pretty stuc [21:30:55] i'm going to have to go for a while [22:12:57] nuria. I think adding a bounds of time could work so that we take the most recent revision for pages that were last edited before the period where reading data was collected. [22:13:44] since we know that those edits will be correct. this might save us a bit. [22:14:16] i know this query is super huge, but we want the revision of the page that was viewed --- and this wasn't recorded in the event log. [22:14:24] we should probably file a bug for that. [22:50:32] groceryheist: yeah i have to kill this job yarrr [22:51:03] ok [22:51:05] we need to do work on our side to allow your job to allow it to run but not starve the rest of the cluster [22:51:11] so this isn't totally your fault, sorry [22:51:16] makes sense [22:51:22] i know it's going to be a long job [22:52:36] oh did you just kill it? [22:52:40] yes [22:52:48] i'll add the time filter and then see [22:52:50] ok [22:52:52] thanks groceryheist [22:53:03] yeah you can now see https://yarn.wikimedia.org/cluster/scheduler [22:53:06] it's pretty important to add recording revisions to the user logging [22:53:08] that production jobs are filling up the cluster now [22:53:10] they were waiting to run [22:53:13] wow [22:53:30] that shouldn't happen though [22:53:32] yeah I assumed that running in the niceq would be enough [22:53:36] yeah it should be [22:53:39] but clearly it isn't! [22:53:41] but not if my reducers never finish [22:53:45] that's why it isn't totally your fault [22:53:58] i'm not 100% when or how fairscheduler queue config updates get applied [22:54:05] its possible they don't get appleid to already submitted jobs, i don't know [22:54:11] so maybe the tweaks I merged earlier would help [22:54:16] but its a little hard to say [22:54:24] they should make production jobs a little more aggressive at preempting [22:54:28] but it didn't seem to do much [22:54:28] I have a question (again, sorry ;) : what's the connection between EventLogging and the d [22:54:34] darn, crap [22:55:02] !log groceryheist killed a long running hive query that is now allowing backlogged production yarn jobs to finally execute [22:55:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:55:28] I'll try that again… what's the connection between EventLogging and Hadoop when it comes to testing on say betalabs or testwiki? I see some beta sites for some schemas, but not all [22:56:22] I'm asking because I'm wondering if there's a way for us to get data for Schema:EditorJourney so I can write queries against it, without having to have it deployed [22:56:28] No connection, since Hadoop doesn't run in Cloud VPS [22:56:37] hm [22:56:53] Nettrom: it might be possible for us to have a variable whitelist for beta vs prod [22:57:04] so that your events would go into the MySQL db in beta, but not in prod [22:57:07] and you could query there, but [22:57:15] the format of the table would be slightly different [22:57:26] so your queries would have to change for hive in prod [22:58:59] Ah, I see. We're looking to grab data from multiple schemas and such, so I'm not sure how feasible it'll be for us. [22:59:17] And that's alright, I was mainly trying to figure out what the landscape looked like, the documentation didn't seem to cover this scenario [22:59:33] Nettrom: the data still comes in, so you can consume it from kafka or log files [23:02:49] ottomata: ah, so the documentation related to grabbing the Hadoop raw data applies, regardless of whether the schema is used in prod or not? [23:04:14] so for beta, if the instrumentation is deployed to a beta site, then events come into the EventLogging system in beta [23:04:27] the message backend for EventLogging in both beta and in prod is Kafka [23:04:43] (different Kafka clusters, but still Kafka) [23:05:02] its just the destination data stores that are different [23:05:04] ottomata: got it! thanks, I'll dig into that a bit [23:05:09] the same MySQL event schema whitelist is used in both beta and prod [23:05:27] so only schemas in that whilte list make into the EventLogging MySQL DB in either place [23:05:40] but all events in both places go to Kafka