[01:09:04] <wikibugs>	 (03PS8) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352)
[01:10:35] <wikibugs>	 (03CR) 10Nuria: "Sorry about patch #7. Note to self: do not push to gerrit when you are sick." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria)
[07:02:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: JVM pauses cause Yarn master to failover - https://phabricator.wikimedia.org/T206943 (10elukey) Logs for the minute 11:26 on zookeeper nodes:  conf1004:  ``` elukey@conf1004:~$ sudo grep "11:26" /var/log/zookeeper/zookeeper.log 2018-10-28...
[07:31:07] <elukey>	 morning!
[07:31:14] <elukey>	 so ---^ is kind of a mistery
[07:31:35] <elukey>	 I am leaning towards something related to kernel + io + disk controller
[07:32:17] <elukey>	 but it is difficult to prove, no breadcrumbs in syslog/kern.log/etc.. and nothing in metrics
[07:36:01] <groceryheist>	 wild
[08:06:47] <joal>	 Hi elukey 
[08:07:09] <elukey>	 Bonjour! 
[08:07:13] <joal>	 groceryheist: I wonder about your usage of HDFS fuse mount point
[08:07:42] <joal>	 elukey: Please let me know if you want any help on trying to tame the hadoop-master hiccup
[08:08:32] <elukey>	 joal: I have no idea now, trying to see if any breadcrumb is there among metrics etc.. but it doesn't seem so
[08:08:57] <joal>	 mwarf - a hidden hiccup
[08:09:46] <elukey>	 I started a tmux with sar -d -p 2 to have a very granular view of what happens to io latency, but I need to wait until another stall happens
[08:11:33] <elukey>	 the theory about the os causing the stall seems something plausible, but usually (from what I can read) it is caused by other processes hogging disk io
[08:11:37] <elukey>	 causing starvation
[08:12:06] <elukey>	 but there's nothing in syslog that can explain this 
[08:12:09] <elukey>	 or in kern.log
[08:12:22] <elukey>	 the stall is relatively "small", like 8s
[08:15:05] <elukey>	 I am inclined to bump yarn.resourcemanager.zk-timeout-ms to something like 15/20s
[08:15:46] <groceryheist>	 joal: curious, why? I'm using spark from swap. 
[08:16:43] <joal>	 groceryheist: IMO direct access through hdfs libs are less error prone
[08:17:02] <joal>	 particularly if you use spark: direct hdfs access is usually a lot more efficient
[08:17:43] <joal>	 elukey: I'm fine with bumpling the timeout
[08:18:42] <joal>	 elukey: Could it an explicit lock due to for instance a log roate compaction or something similar (sorry for silly question, trying to think out of the box)
[08:19:39] <elukey>	 joal: nono please, more thinking usually brings more ideas.. :)
[08:21:06] <elukey>	 I think that yarn manages its logs via log4j or similar (we don't deploy logrotate configs) 
[08:23:02] <groceryheist>	 ah. I didn't mount HDFS on the notebook machine. The notebook kernels just set up the connection. I assume they are doing the right thing.
[08:29:29] <joal>	 groceryheist: I was actually talking about usage, not the mounting :)
[08:37:05] <groceryheist>	 ah. this is going a bit over my head. how do you recommend writing data to hdfs through spark? 
[08:41:09] <groceryheist>	 ok I have to sleep
[08:53:51] <joal>	 groceryheist: Let's talk either tonight (my time) or tomorrow night (morning for you I think)
[08:56:14] <elukey>	 !log bounce yarn resource managers to pick up new zookeeper session timeout settings
[08:56:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:56:27] <joal>	 Thanks for that elukey --^
[08:56:34] <joal>	 elukey: I hope it'll help :(
[08:57:05] <elukey>	 I hope so, it will not solve the problem though :(
[09:00:54] <elukey>	 2018-10-29 09:00:09,335 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server conf1004.eqiad.wmnet/2620:0:861:101:10:64:0:23:2181, sessionid = 0x50664009c3b103bc, negotiated timeout = 20000
[09:01:04] <elukey>	 good :)
[09:03:53] <elukey>	 done!
[09:53:31] <elukey>	 fdans: you there?
[09:58:04] <wikibugs>	 (03PS1) 10John Erling Blad: Wikistats2: Added headless testing [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/470344
[10:06:54] <icinga-wm>	 PROBLEM - Check the last execution of check_webrequest_partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions
[10:11:15] <elukey>	 joal: doing anything with webrequest?
[10:12:02] <elukey>	 I've just re-run hour 5 in text, failed but no logs available from hue, let'see
[10:13:05] <elukey>	 maybe we are just a bit behind the schedule
[10:13:58] <elukey>	 still processing 2018-10-29T04
[10:14:56] <elukey>	 hour 4 is running refine now
[10:15:14] <elukey>	 (for upload)
[10:15:15] <elukey>	 https://yarn.wikimedia.org/proxy/application_1540747790951_1500
[10:15:39] <elukey>	 Elapsed:4hrs, 48mins, 41sec
[10:15:47] <elukey>	 and 1 mapper running
[10:15:48] <elukey>	 whatt
[10:17:46] <elukey>	 ah joal now I get what you were saying before
[10:18:01] <elukey>	 what a huge create table
[10:20:31] <elukey>	 I would be inclined to kill it
[10:20:45] <elukey>	 and follow up via email
[10:21:35] <joal>	 +1 elukey 
[10:22:02] <joal>	 I didn't kill it earlier to try to let it finish - but it's too long and not making progress
[10:22:06] <joal>	 Killing it
[10:23:11] <elukey>	 ack
[10:23:25] <joal>	 !log Kill yarn application application_1540747790951_1429 to prevent more cluster errors (eating too many resources)
[10:23:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:23:37] <joal>	 done
[10:23:48] <elukey>	 nice
[10:24:00] <elukey>	 webrequest refine running with 1 mapper is kinda crazy
[10:24:01] <elukey>	 :D
[10:24:10] <joal>	 elukey: And the job was only at stage-1 of a complicated query :(
[10:24:15] <joal>	 :)
[10:24:38] <elukey>	 I am pretty sure it wasn't intended, but our jobs have the priority if they are heavily impacted
[10:25:10] <joal>	 Hello, I'll give 1 CPU and 2gb ram to work 100gb
[10:25:20] <joal>	 elukey: indeed
[10:25:35] <joal>	 elukey: I actually think we should allow preemption for prod
[10:26:00] <joal>	 it would almost never be used, but in that example it would have prevented the issue
[10:26:12] <elukey>	 makes sense joal, let's open a task?
[10:31:35] <fdans>	 elukey: sorryyy just seen the notification
[10:31:47] <fdans>	 I'll take a luca at your patch right now :)
[10:32:40] <elukey>	 hey Fran no hurry :)
[10:32:43] <elukey>	 whenever you have time
[10:33:00] <elukey>	 I was reviewing the code and I didn't recall why /srv/geoip was not chosen
[10:33:27] <elukey>	 puppet is a bit stupid and does a full scan of /usr/share/GeoIP when it has to ensure it
[10:33:33] <elukey>	 including the archive dirs
[10:33:47] <elukey>	 and now it takes ~5 mins for a puppet run to complete
[10:33:52] <elukey>	 but it'll get worse over time
[10:34:09] <elukey>	 so my idea is to move 'archive' away fron /usr/share/GeoIP
[10:34:29] <wikibugs>	 (03PS7) 10Joal: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020)
[10:35:09] <joal>	 fdans: I love that one: I'll take a luca't your patch :)
[10:35:17] <joal>	 awesome
[10:35:31] <joal>	 elukey: Opening a task for preemption :)
[10:36:16] * elukey likes the lucat's
[10:36:25] <elukey>	 err luca't
[10:36:26] <elukey>	 :P
[10:38:38] <wikibugs>	 10Analytics: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10JAllemandou)
[10:39:11] <joal>	 elukey: --^
[10:39:34] <elukey>	 ack! Sent an email as follow up for the team
[10:43:40] <joal>	 Thanksel
[10:43:47] <joal>	 Thanks elukey - sorry
[10:51:23] <wikibugs>	 10Analytics, 10Contributors-Analysis, 10Product-Analytics: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10JAllemandou) If we want an email to be sent once data is available on the cluster, no need to create a new oozi...
[11:05:43] <elukey>	 fdans: stat1007 -> Notice: Applied catalog in 40.39 seconds
[11:05:45] <elukey>	 \o/
[11:06:02] <fdans>	 wooo blaaaazing elukey  :D
[11:06:41] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Geoip data archive repository cause puppet to run for minutes - https://phabricator.wikimedia.org/T208028 (10elukey) ``` elukey@stat1007:~$ sudo puppet agent -tv Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving...
[11:06:48] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Geoip data archive repository cause puppet to run for minutes - https://phabricator.wikimedia.org/T208028 (10elukey)
[11:07:40] <elukey>	 addshore: o/ so any issue in moving  ::statistics::wmde to stat1007?
[11:11:45] <joal>	 Hey team - going for special lunch with family - will be off for a few hours
[11:41:12] <elukey>	 going out for lunch + errand!
[12:16:19] <wikibugs>	 (03PS4) 10Fdans: Handle null name values in top metrics from UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968)
[13:20:13] <bmansurov_>	 o/ Is there an API for getting the list of most edited 100 (enwiki) articles in the past month?
[13:47:16] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "I like" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal)
[13:55:48] <ottomata>	 elukey:  hiiii o/
[13:55:57] <ottomata>	 wanna rebalance some webrequest text partition leadership with me?
[13:56:04] <ottomata>	 it should be easy
[13:56:10] <ottomata>	 just wanna do it with a second pair of eyes
[13:59:22] <elukey>	 ottomata: just got back!
[13:59:31] <elukey>	 if you can wait 5 mins I'll be there
[13:59:50] <ottomata>	 ya not quite ready still getting commands etc. ready
[13:59:52] <ottomata>	 but shoudl be soon
[14:10:16] <fdans>	 bmansurov: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/top-edited-pages/normal|table|1-Month|~total
[14:10:26] <bmansurov>	 fdans: thanks!
[14:16:02] <elukey>	 ottomata: ack I am  ready
[14:16:08] <ottomata>	 ready too
[14:16:17] <elukey>	 do you want to bc?
[14:16:19] <ottomata>	 bc
[14:16:19] <ottomata>	 ua
[14:16:20] <ottomata>	 ya
[14:27:31] <ottomata>	 !log ran kafka-preferred-replica-election on kafka jumbo-eqiad cluster (this successfully rebalanced webrequest_text partition leadership) T207768
[14:27:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:27:34] <stashbot>	 T207768: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768
[14:28:10] <mforns>	 hey team!
[14:34:15] <fdans>	 helloooo mforns 
[14:34:24] <mforns>	 heya fdans :]
[14:46:06] <ottomata>	 elukey:  thought: how long do you usually wait between rebooting kafka nodes?
[14:46:12] <ottomata>	 when you have to do a cluster reboot?
[14:46:47] <wikibugs>	 10Analytics: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata) Interesting!  Today Luca and I were about to move partition leadership using `kafka reassign-partitions`, but we noticed that the replica assignment actually looked correct,...
[14:47:00] <wikibugs>	 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata)
[14:47:16] <elukey>	 ottomata: usually when all metrics recover I start with the next host
[14:47:27] <elukey>	 say 5/10 minutes or more
[14:48:01] <ottomata>	 recover meaning what?  do you see data flowing back into the broker?
[14:48:14] <ottomata>	 if the auto rebalance is only considered once every 5 minutes
[14:48:38] <ottomata>	 maybe somehow a premature broker restart could cause balanced leadership to get slightly out of whack?
[14:48:52] <elukey>	 when I see traffic reshaping and other metrics like in-sync-replicas etc.. getting to flat zero agin
[14:48:55] <elukey>	 *again
[14:48:56] <ottomata>	 aye
[14:49:07] <ottomata>	 i wonder if ISRs are all back, but election has happened
[14:49:17] <ottomata>	 hasn't*
[14:49:19] <ottomata>	 has not happened yet
[14:49:30] <elukey>	 but $something must happen since I clearly see traffic getting back to the broker rebooted
[14:49:38] <ottomata>	 right
[14:49:45] <ottomata>	 some but not all?
[14:49:58] <ottomata>	 maybe an auto rebalance was triggered before ALL of the ISRs are back?
[14:49:59] <elukey>	 and also we have a graph related to partitions assigned to brokers
[14:50:14] <ottomata>	 hm, and it was 1006 that was missing leaders
[14:50:17] <elukey>	 lemme find it
[14:50:18] <ottomata>	 which is the last broker you restart right?
[14:50:20] <ottomata>	 so, what if
[14:50:41] <ottomata>	 what if you restart 1006, and as it is coming back up, some partitions are back in the ISR, but not all, probably  not webrequest text ones yet
[14:50:44] <ottomata>	 because those ones take the longest
[14:50:53] <ottomata>	 then, 300 seconds pass since the last auto rebalance consideration
[14:50:58] <elukey>	 https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=20&fullscreen&orgId=1
[14:51:03] <ottomata>	 and kafka sees that > 10% of leadership is unbalanced
[14:51:07] <ottomata>	 so it triggers an election.
[14:51:12] <ottomata>	 it then rebalances everything that is in ISR
[14:51:25] <ottomata>	 but, not all of webrequest_text is yet in sync, beacuse it is still replicating from the last reboot
[14:51:38] <ottomata>	 then 1006 gets fully synced back into ISR
[14:51:40] <ottomata>	 but at that point
[14:51:47] <ottomata>	 the unbalanced % is < 10%
[14:51:52] <ottomata>	 so no future elections are triggered
[14:51:56] <ottomata>	 that would explain this pretty well
[14:52:54] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) The disk is being sent and should arrive today or tomorrow
[14:53:00] <elukey>	 ah so webrequest gets late to the party due to the time taken by the replication
[14:53:08] <ottomata>	 right
[14:53:18] <elukey>	 it could be a good explanation
[14:53:32] <ottomata>	 ya elukey cool, this graph is helpful
[14:53:39] <elukey>	 so in this case, a round of reboots should always complete (to be sure) with a preferred replica election
[14:53:42] <ottomata>	 i can see that before and after election today
[14:53:52] <ottomata>	 3 leaderships moved to 1006
[14:53:54] <ottomata>	 and we know which ones
[14:54:04] <ottomata>	 VirtualPageView, and webrequest_text 0 and 6
[14:54:07] <elukey>	 yep
[14:54:17] <ottomata>	 2 from 1002 and 1 from 1005
[14:54:50] <ottomata>	 so elukey i think all we should do is just add a step to the reboot procedure
[14:54:54] <ottomata>	 once all nodes have rebooted
[14:55:11] <ottomata>	 wait til all partitions are in ISR
[14:55:15] <ottomata>	 all replicas *
[14:55:20] <ottomata>	 and then do an election
[14:55:40] <elukey>	 yes +2
[14:55:53] <wikibugs>	 (03PS1) 10Fdans: Release 2.4.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/470409
[14:56:05] <elukey>	 I was used to do it after every reboot before auto-rebalance
[14:57:11] <wikibugs>	 10Analytics, 10Analytics-Kanban: Update pageview_hourly to include timestamp for better druid indexation - https://phabricator.wikimedia.org/T208230 (10fdans) p:05Triage>03High
[14:58:24] <wikibugs>	 (03CR) 10Fdans: [V: 032 C: 032] Release 2.4.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/470409 (owner: 10Fdans)
[14:59:19] <wikibugs>	 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata) Oo, here's a plausible explanation.  kafka-jumbo1006 was the only broker that was missing some of its leaders.  It is usually the last nod e to be reboo...
[15:01:02] <nuria>	 a-team: standdup
[15:02:00] <nuria>	 ping joal
[15:02:16] <nuria>	 ping mforns 
[15:03:40] <wikibugs>	 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Ottomata) I updated wikitech docs: https://wikitech.wikimedia.org/w/index.php?title=Kafka%2FAdministration&action=historysubmit&type=revision&diff=1807237&oldid=1...
[15:19:30] <mforns>	 aaaaagh! dst change
[15:37:54] <addshore>	 haha
[15:53:13] <wikibugs>	 10Analytics: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10fdans) p:05Triage>03Normal
[15:53:22] <wikibugs>	 10Analytics: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10fdans) a:03Ottomata
[15:55:21] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow hadoop prod jobs to preempt resource over default queue - https://phabricator.wikimedia.org/T208208 (10Ottomata)
[16:00:19] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Restore WikiStats features disabled for mere performance reasons - https://phabricator.wikimedia.org/T44318 (10fdans) @Nemo_bis hey, is there a list of metrics you have that we could maybe develop with the current infrastructure?
[16:05:58] <wikibugs>	 10Analytics, 10Analytics-Kanban: Make sure webrequest_text preferred partition leadership is balanced - https://phabricator.wikimedia.org/T207768 (10Milimetric) p:05Triage>03High
[16:09:36] <wikibugs>	 10Analytics, 10Contributors-Analysis, 10Product-Analytics: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10Milimetric) p:05Normal>03High
[16:13:18] <wikibugs>	 10Analytics: Investigate AQS cassandra schema hash warninga - https://phabricator.wikimedia.org/T178832 (10Milimetric) @JAllemandou what's up with this task, that's what we wanted to know in grosking.
[16:15:28] <wikibugs>	 10Analytics: Investigate AQS cassandra schema hash warninga - https://phabricator.wikimedia.org/T178832 (10Milimetric) p:05Low>03Normal
[16:22:08] <wikibugs>	 (03CR) 10Joal: [C: 032] "Merging to deploy tomorrow" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal)
[16:23:44] <wikibugs>	 10Analytics-Kanban: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention) - https://phabricator.wikimedia.org/T208237 (10JAllemandou)
[16:24:00] <wikibugs>	 10Analytics-Kanban: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention) - https://phabricator.wikimedia.org/T208237 (10JAllemandou) a:03JAllemandou
[16:27:32] <wikibugs>	 (03PS2) 10Joal: Update pageview_whitelist fieldname for convention [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469924 (https://phabricator.wikimedia.org/T208237)
[16:28:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal)
[16:28:12] <wikibugs>	 (03PS9) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352)
[16:30:10] <ottomata>	 joal elukey interesting, when we added partitions to the eventlogging_ReadingDepth topic, the preferred leaders were not balanced!
[16:30:17] <ottomata>	 this time for real, not just needing an election
[16:30:28] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Merging to deploy tomorrow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469924 (https://phabricator.wikimedia.org/T208237) (owner: 10Joal)
[16:30:38] <ottomata>	 i wonder if that was because the leadership of all some partitions were unbalanced when we added the partitions
[16:30:47] <ottomata>	 anyway, there is one extra partition on 1002 that needs moved to 1003
[16:31:14] <elukey>	 mmmmm
[16:31:27] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Merging to deploy tomorrow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/357814 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal)
[16:31:53] <wikibugs>	 (03CR) 10Nuria: "Francisco, let's make a PR to correct restbase documentation regarding nulls/anaonymous users, once that is done this can be deployed." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/468927 (https://phabricator.wikimedia.org/T206968) (owner: 10Fdans)
[16:32:08] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Merging to deploy tomorrow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/456654 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal)
[16:33:53] <aharoni>	 Hi
[16:34:24] <aharoni>	 I was expecting that the chart at https://language-reportcard.wmflabs.org/cx2/#translations will be updated today, but it wasn't.
[16:34:32] <aharoni>	 Is there a problem with report updater, or with the database?
[16:35:35] <joal>	 ottomata: can we bump partition number on VirtualPageview?
[16:36:46] <ottomata>	 yes
[16:37:00] <ottomata>	 gonna move a ReadingDepth one first
[16:37:35] <ottomata>	 !log reassigning eventlogging_ReadingDepth partition 0 from 1002,1004,1006 to 1003,1001,1005 to move preferred leadership from 1002 to 1003
[16:37:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:41:21] <nuria>	 aharoni: the files read were last updated on teh 21st: https://analytics.wikimedia.org/datasets/periodic/reports/metrics/published_cx2_translations/published_cx2_translations.tsv
[16:42:06] <nuria>	 aharoni: so that is the most recent data
[16:42:39] <wikibugs>	 (03CR) 10Amire80: "OK, it's clear, this can be reviewed." [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/469390 (https://phabricator.wikimedia.org/T207765) (owner: 10Amire80)
[16:44:27] <aharoni>	 nuria: Thanks. I also spoke to milimetric , and he says that it should run later today, and if it's true, it's OK. Maybe it's related to a configuration change we did some time ago.
[16:46:59] <nuria>	 aharoni: configuration change?
[16:52:48] <aharoni>	 nuria: https://gerrit.wikimedia.org/r/#/c/analytics/limn-language-data/+/465152/
[16:53:16] <aharoni>	 Pau the product manager asked to change the dates by which it works
[16:54:21] <aharoni>	 joal, mforns - can you please take a look at https://phabricator.wikimedia.org/T189475 some time?
[16:54:30] <mforns>	 sure aharoni 
[16:56:08] <wikibugs>	 (03CR) 10Nuria: [C: 032] Memoizing results of state functions (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria)
[16:56:12] <wikibugs>	 (03CR) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria)
[16:59:16] <joal>	 a-team - I have merged a bunch of patches on refinery-source and refinery and plan to deploy them tomorrow - Please let me know if you have special patches for me to care
[16:59:31] <mforns>	 ok joal 
[17:00:58] <mforns>	 joal, you said 10 billion web requests per day, right? Was that US billions or EU billions?
[17:01:36] <nuria>	 mforns: at peak we process 150.000 reqs oer sec
[17:01:40] <nuria>	 *pe
[17:01:42] <nuria>	 *per
[17:02:23] <mforns>	 sounds like 10 US billion
[17:03:13] <nuria>	 ya, to avoid issues you can put number per sec
[17:04:30] <nuria>	 mforns: i did that on my last talk 
[17:04:48] <mforns>	 ok, makes sense
[17:16:18] <joal>	 mforns: correct 10x10^12
[17:16:25] <mforns>	 k :]
[17:19:09] <joal>	 mforns: to contradict nuria, I like numbers over a period of time people can feel (a day is nice)
[17:19:40] <mforns>	 joal, nuria, the thing is, the sentence goes "logs that are ingested daily"
[17:19:53] <mforns>	 well... that would work as well
[17:20:14] <mforns>	 I just didn't want to use the same time unit measure to refer to EL events and web requests
[17:21:10] <mforns>	 I left it like: "10 billion (US) web request logs that are ingested daily"
[17:21:15] <joal>	 mforns: no big deal though, whethere per minute or day or second
[17:21:20] <mforns>	 nuria, please re-ping if you do not like it
[17:21:51] <joal>	 mforns: you could also add the peak rps if you want(maybe not in the abstract)
[17:22:05] <mforns>	 it's ok
[17:22:35] <mforns>	 joal, regarding "in summary, all data containing personal identifyers..."
[17:22:49] <mforns>	 how about "practically all data containing..."
[17:24:20] <joal>	 mforns: I don't know about the meaning of 'practically' in english
[17:24:36] <mforns>	 joal, I'd say it means: in practical terms
[17:24:37] <joal>	 In french youcould say 'pratiquement' to mean almost
[17:24:54] <mforns>	 or in practice
[17:24:59] <mforns>	 yea
[17:25:16] <joal>	 but here in practical terms we store IPs since 15 years :)=
[17:25:30] <mforns>	 it's another way to say almost, hehehehe
[17:25:42] <joal>	 So in the almost sense, I agree ;)
[17:25:48] <mforns>	 ok
[17:25:52] <mforns>	 xD
[17:26:39] <joal>	 mforns: Thanks :) I don't mean to be overly precise, I just like when it's oiverly correct :)
[17:27:31] <mforns>	 joal, yes, if the talk is accepted, I can mention that in detail, but for the point of that sentence I think this is OK
[17:27:41] <joal>	 For sure
[17:30:14] <nuria>	 mforns: given that you are in EU when giving that conference most attendees are not going to know about the billion (us) versus rest of the world but i do not think is a big deal either way, just  be aware
[17:32:19] <mforns>	 nuria, ok
[17:37:06] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats Bug - Monthly overview's "Top editors" box links to mainspace instead of userspace - https://phabricator.wikimedia.org/T208247 (10Quiddity)
[17:43:48] <ottomata>	 am surprised that moving this one ReadingDepth partition is taking so long!  
[17:43:52] <ottomata>	 its still doing the right thing
[17:43:56] <ottomata>	 oh RIGHT there is a throttle....
[17:44:25] <ottomata>	 orr maybe the default is no throttle...
[17:46:11] <joal>	 euh - question for you ottomata 
[17:46:22] <ottomata>	 ya
[17:46:36] <joal>	 Why moving when talking about ReadingDepth, while it was not needed for Webreauest?
[17:46:51] <joal>	 ottomata: a broker had 2 partitions?
[17:46:52] <ottomata>	 the webrequest preferred leadership assignment was actually correct
[17:47:03] <joal>	 ottomata: Ah?
[17:47:09] <ottomata>	 it just wasn't balanced; 1002 has more actual leaders
[17:47:17] <ottomata>	 but a rebalance caused the preferred leaders to take over
[17:47:51] <joal>	 ok I think I get it
[17:47:59] <ottomata>	 whereas when we created ReadingDepth, for some reason (maybe because some actual leadership wasn't balanced at the time?) the preferred leaders were not correctly balanced
[17:48:12] <joal>	 Makes sense I understand
[17:48:14] <joal>	 ok
[17:48:14] <ottomata>	 aye
[17:48:17] <joal>	 Thanks :)
[17:48:47] <joal>	 Interestingly also, the leader for VirtualPageView also changed when you rebalanced :)
[17:49:41] <ottomata>	 yes!
[17:49:44] <ottomata>	 interesting eh
[18:00:03] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Services (watching): T206785: Modern Event Platform: Stream Intake Service: AJV usage security review - https://phabricator.wikimedia.org/T208251 (10Ottomata) p:05Triage>03Normal
[18:12:10] <nuria>	 milimetric: you are totally right that wrappers to teh functions works just fine
[18:13:08] <milimetric>	 heh, now only if we knew why :)
[18:30:20] * elukey off
[18:54:26] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10Wikidata, and 7 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev)
[18:56:16] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Wikipedia-iOS-App-Backlog: MobileWikiAppiOSSearch validation errors adding noise to EventLogging error - https://phabricator.wikimedia.org/T205910 (10chelsyx)
[18:56:19] <wikibugs>	 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.1-Narwhal-On-A-Bumper-Car: Many errors on    "MobileWikiAppiOSSearch"  and  "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10chelsyx)
[18:56:46] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Wikipedia-iOS-App-Backlog: MobileWikiAppiOSSearch validation errors adding noise to EventLogging error - https://phabricator.wikimedia.org/T205910 (10chelsyx) Hi all, I closed this one as it has been fixed in T207424
[19:08:39] <ottomata>	 cool finished moving paritiotns in reading depth
[19:08:39] <ottomata>	 https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging_ReadingDepth&from=1540834936497&to=1540840098160
[19:09:23] <ottomata>	 joal:  i'm all for adding virtualpageview partitions
[19:09:38] <ottomata>	 buuut, iwonder if it will really make a difference here
[19:15:59] <nuria>	 milimetric: all calls we do to those functions externally are prefixed with Router.blah and thus they work just fine
[19:16:32] <nuria>	 milimetric: do you want me to rework the code to add the decorator? it is really only 1 function that gets called externally
[19:17:15] <milimetric>	 nuria: it's up to you, I reviewed the way you approached it too, and it was mostly fine as well
[19:17:38] <nuria>	 milimetric: ok, will rework the one function exposed one sec
[19:30:35] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Restore WikiStats features disabled for mere performance reasons - https://phabricator.wikimedia.org/T44318 (10Nemo_bis) >>! In T44318#4703281, @fdans wrote: > @Nemo_bis hey, is there a list of metrics you have that we could maybe develop with the current infrastructure?  So...
[19:43:26] <wikibugs>	 (03PS10) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352)
[19:56:25] <wikibugs>	 (03PS11) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352)
[19:56:29] <wikibugs>	 10Analytics, 10Analytics-Kanban: Clickstream dataset for Persian Wikipedia only includes external values - https://phabricator.wikimedia.org/T191964 (10Ladsgroup) Hello?
[19:57:18] <nuria>	 milimetric: on 2nd though i removed the decorator as i do not want to cache any possible response from the functions, just the ones that 'deserve' caching
[20:01:08] <milimetric>	 k, sounds good nuria, I think it’s mostly a matter of style
[20:23:42] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Security-Reviews, and 3 others: T206785: Modern Event Platform: Stream Intake Service: AJV usage security review - https://phabricator.wikimedia.org/T208251 (10mobrovac) Adding the security folks.  I agree that code generation based on schemae can potent...
[20:34:13] <chasemp>	 Thought you all might like https://www.dataengineeringpodcast.com/using-notebooks-as-the-unifying-layer-for-data-roles-at-netflix-with-matthew-seal-episode-54/
[20:43:30] <nuria>	 ottomata: isn't this job kind of eating teh whole capacity of the cluster? https://yarn.wikimedia.org/cluster/app/application_1540803787856_1567
[20:43:39] <nuria>	 ottomata: i cannot even do selects ...
[20:44:24] <nuria>	 ottomata: it has a 1000 reducers
[20:46:27] <nuria>	 groceryheist: hello, i think one of your selects is using a lot of capacity, is your user name nathante on hadoop?
[20:46:47] <ottomata>	 nuria:  hmmm, it is in the nice queue though
[20:47:23] <ottomata>	 looking
[20:48:46] <ottomata>	 nuria i would thikn that if you have a job submitted (hive query, whatever)
[20:48:57] <ottomata>	 you might have to wait until a few of those reducers have finished
[20:49:00] <ottomata>	 but you should get in
[20:50:03] <ottomata>	 i see your hive query has been accepted and is running
[20:50:06] <ottomata>	 but the mappers haven't started
[20:51:43] <nuria>	 ottomata: ok looking
[20:53:50] <nuria>	 ottomata: so queries with 1000 reducers on nice queue we  think are tenable? (asking for real, not ironic question)
[20:55:55] <ottomata>	 i am not sure, i would hope that something in the nice queue would allow your default job to be scheduled
[20:56:11] <ottomata>	 maybe we need to make the nice queue preemptable too
[20:56:32] <ottomata>	 that job doesn't seem to be making much progress
[20:56:46] <ottomata>	 running reducers just sitting there?
[20:56:57] <ottomata>	 so it is taking up space and not allowing new stuff in
[20:58:25] <nuria>	 ottomata: also I am a bit lost on how to see the code of the create table
[20:58:34] <nuria>	 ottomata: thsi https://yarn.wikimedia.org/proxy/application_1540803787856_1567/
[20:58:44] <ottomata>	 nuria there are prob a bunch of fair schceduler queue tweaks we could/should do there
[20:58:45] <ottomata>	 here
[20:58:47] <nuria>	 ottomata: doe snot have the code for "create table" that is running
[20:58:48] <ottomata>	 nuria:  that i don't know either
[20:58:50] <ottomata>	 oh
[20:58:52] <ottomata>	 your code?
[20:58:54] <ottomata>	 your query
[20:59:02] <nuria>	 ottomata: no, the one consuming resources
[20:59:07] <ottomata>	 oh
[20:59:12] <ottomata>	 yeah don't know how to get that either
[20:59:22] <ottomata>	 maybe in hue...hm
[21:00:11] <nuria>	 ottomata: also these are all queries accepted taht are not running
[21:00:13] <nuria>	 *that
[21:00:20] <ottomata>	 yes found it!
[21:00:21] <ottomata>	 https://hue.wikimedia.org/jobbrowser/jobs/application_1540803787856_1567
[21:00:25] <ottomata>	 search page for
[21:00:26] <ottomata>	 'hive.query.string'
[21:00:46] <ottomata>	 nuria: ?
[21:01:10] <nuria>	 ottomata: wow
[21:02:33] <nuria>	 ottomata: https://yarn.wikimedia.org/cluster/apps/ACCEPTED
[21:05:02] <ottomata>	 yeah its holding a lot of other stuff up too
[21:05:16] <ottomata>	 accepted and RUNNING ones
[21:05:24] <ottomata>	 RUNNING ones aren't getting assigned many resources to launch containers
[21:05:37] <nuria>	 ottomata: i seriously cannot see code for select, me feeling  a bit handicapped
[21:05:55] <ottomata>	 nuria:  the offending jobs?
[21:06:01] <nuria>	 ottomata: yes
[21:06:02] <ottomata>	 did you lclick on the hue link above^
[21:06:02] <ottomata>	 ?
[21:06:09] <ottomata>	 https://hue.wikimedia.org/jobbrowser/jobs/application_1540803787856_1567
[21:06:19] <ottomata>	 ah sorry
[21:06:25] <ottomata>	 then got to Metadata tab
[21:06:30] <ottomata>	 and find on page for 'hive.query.string'
[21:06:56] <ottomata>	 nuria:  for prod/essesntial jobs
[21:07:02] <ottomata>	 i think the change that joal suggsted today would help
[21:07:18] <nuria>	 ottomata: ok, now i see it sorry, i was thinking it will be in the logs!
[21:07:28] <ottomata>	 nuria it will be in the application logs after it finishes
[21:08:47] <nuria>	 ottomata: right, ok, that query is  a complex query over mw snapshot that I think has been killed several times, i can see many attemps
[21:08:52] <ottomata>	 ya
[21:08:59] <ottomata>	 i think the fair scheduler stuff dynamically updates
[21:08:59] <ottomata>	 i
[21:09:06] <ottomata>	 i'm goign to merge this patch we talked about this morning
[21:09:14] <ottomata>	 it won't help your job
[21:09:17] <ottomata>	 but it will at least help the prod ones
[21:09:56] <nuria>	 ottomata: ok
[21:10:22] <nuria>	 groceryheist: please ping us if you are running this select  CREATE TABLE nathante.readingDataModel_Stage2...
[21:10:56] <nuria>	 groceryheist: with no bounds of time or wikis is really consuming quite  a bit of resources, it will likely get killed and we can probably work together to make it more efficient
[21:26:43] <groceryheist>	 nuria: hi
[21:29:59] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Possible statsv corruption? - https://phabricator.wikimedia.org/T189530 (10Krinkle) 05Open>03Resolved Looking at graphite1001 now:  ``` krinkle at graphite1001.eqiad.wmnet in /var/lib/carbon/whisper $ ls -d -1 *_*  eventlogging_client_errors_Na...
[21:30:06] <groceryheist>	 yeah I have been trying to optimize this job
[21:30:12] <groceryheist>	 had a chat with joal earlier
[21:30:30] <groceryheist>	 here's the code https://gist.github.com/groceryheist/46a634ed79ca8b1c9afae3ce6beb0c64
[21:30:48] <groceryheist>	 sorry that it isn't preemptable. The reducers are getting pretty stuc
[21:30:55] <groceryheist>	 i'm going to have to go for a while
[22:12:57] <groceryheist>	 nuria. I think adding a bounds of time could work so that we take the most recent revision for pages that were last edited before the period where reading data was collected. 
[22:13:44] <groceryheist>	 since we know that those edits will be correct. this might save us a bit. 
[22:14:16] <groceryheist>	 i know this query is super huge, but we want the revision of the page that was viewed --- and this wasn't recorded in the event log. 
[22:14:24] <groceryheist>	 we should probably file a bug for that. 
[22:50:32] <ottomata>	 groceryheist:  yeah i have to kill this job yarrr
[22:51:03] <groceryheist>	 ok
[22:51:05] <ottomata>	 we need to do work on our side to allow your job to allow it to run but not starve the rest of the cluster
[22:51:11] <ottomata>	 so this isn't totally your fault, sorry
[22:51:16] <groceryheist>	 makes sense
[22:51:22] <groceryheist>	 i know it's going to be a long job
[22:52:36] <ottomata>	 oh did you just kill it?
[22:52:40] <groceryheist>	 yes
[22:52:48] <groceryheist>	 i'll add the time filter and then see
[22:52:50] <ottomata>	 ok
[22:52:52] <ottomata>	 thanks groceryheist
[22:53:03] <ottomata>	 yeah you can now see https://yarn.wikimedia.org/cluster/scheduler
[22:53:06] <groceryheist>	 it's pretty important to add recording revisions to the user logging 
[22:53:08] <ottomata>	 that production jobs are filling up the cluster now
[22:53:10] <ottomata>	 they were waiting to run
[22:53:13] <groceryheist>	 wow
[22:53:30] <ottomata>	 that shouldn't happen though
[22:53:32] <groceryheist>	 yeah I assumed that running in the niceq would be enough
[22:53:36] <ottomata>	 yeah it should be
[22:53:39] <ottomata>	 but clearly it isn't!
[22:53:41] <groceryheist>	 but not if my reducers never finish
[22:53:45] <ottomata>	 that's why it isn't totally your fault
[22:53:58] <ottomata>	 i'm not 100% when or how fairscheduler queue config updates get applied
[22:54:05] <ottomata>	 its possible they don't get appleid to already submitted jobs, i don't know
[22:54:11] <ottomata>	 so maybe the tweaks I merged earlier would help
[22:54:16] <ottomata>	 but its a little hard to say
[22:54:24] <ottomata>	 they should make production jobs a little more aggressive at preempting
[22:54:28] <ottomata>	 but it didn't seem to do much
[22:54:28] <Nettrom>	 I have a question (again, sorry ;) : what's the connection between EventLogging and the d
[22:54:34] <Nettrom>	 darn, crap
[22:55:02] <ottomata>	 !log groceryheist killed a long running hive query that is now allowing backlogged production yarn jobs to finally execute
[22:55:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:55:28] <Nettrom>	 I'll try that again… what's the connection between EventLogging and Hadoop when it comes to testing on say betalabs or testwiki? I see some beta sites for some schemas, but not all
[22:56:22] <Nettrom>	 I'm asking because I'm wondering if there's a way for us to get data for Schema:EditorJourney so I can write queries against it, without having to have it deployed
[22:56:28] <ottomata>	 No connection, since Hadoop doesn't run in Cloud VPS
[22:56:37] <ottomata>	 hm
[22:56:53] <ottomata>	 Nettrom:  it might be possible for us to have a variable whitelist for beta vs prod
[22:57:04] <ottomata>	 so that your events would go into the MySQL db in beta, but not in prod
[22:57:07] <ottomata>	 and you could query there, but
[22:57:15] <ottomata>	 the format of the table would be slightly different
[22:57:26] <ottomata>	 so your queries would have to change for hive in prod
[22:58:59] <Nettrom>	 Ah, I see. We're looking to grab data from multiple schemas and such, so I'm not sure how feasible it'll be for us.
[22:59:17] <Nettrom>	 And that's alright, I was mainly trying to figure out what the landscape looked like, the documentation didn't seem to cover this scenario
[22:59:33] <ottomata>	 Nettrom:  the data still comes in, so you can consume it from kafka or log files
[23:02:49] <Nettrom>	 ottomata: ah, so the documentation related to grabbing the Hadoop raw data applies, regardless of whether the schema is used in prod or not?
[23:04:14] <ottomata>	 so for beta, if the instrumentation is deployed to a beta site, then events come into the EventLogging system in beta
[23:04:27] <ottomata>	 the message backend for EventLogging in both beta and in prod is Kafka
[23:04:43] <ottomata>	 (different Kafka clusters, but still Kafka)
[23:05:02] <ottomata>	 its just the destination data stores that are different
[23:05:04] <Nettrom>	 ottomata: got it! thanks, I'll dig into that a bit
[23:05:09] <ottomata>	 the same MySQL event schema whitelist is used in both beta and prod
[23:05:27] <ottomata>	 so only schemas in that whilte list make into the EventLogging MySQL DB in either place
[23:05:40] <ottomata>	 but all events in both places go to Kafka