[01:43:34] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] database field: if the field is NULL, don't fill it in with None [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/674697 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[05:30:19] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) The pageviews/top-per-country endpoint is now public! Take a look at the documentation [here](https://wikimedi...
[07:00:25] <elukey>	 good morning
[07:00:30] <elukey>	 interesting (hadoop tesT)
[07:00:31] <elukey>	 Application is added to the scheduler and is not yet activated. User's AM resource limit exceeded.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:9011, vCores:1>; Queue Resource Limit for AM = <memory:34863, vCores:1>; Us
[07:00:37] <elukey>	 this is spark refine
[07:01:22] <elukey>	 it may be yarn.scheduler.capacity.maximum-am-resource-percent
[07:15:31] <elukey>	 ok better
[07:20:38] <elukey>	 I just re-ran (in hadoop test) refine to backfill a couple of days, and I noticed this in the yarn ui (for the analytics queue)
[07:20:41] <elukey>	 Queue State: 	RUNNING
[07:20:44] <elukey>	 Used Capacity: 	103.3%
[07:20:46] <elukey>	 Configured Capacity: 	40.0%
[07:20:49] <elukey>	 Configured Max Capacity: 	100.0% 
[07:27:00] <elukey>	 so maybe we could, initially, use a more generous user-limit-factor (now it is 2 IIUC)
[07:27:17] <elukey>	 and tune it later on
[07:27:55] <elukey>	 anyway, I am starting to like the capacity scheduler
[07:28:11] <elukey>	 it is more difficult to manage at the beginning, but way more powerful than fair for sure
[08:01:46] <joal>	 I fully agree elukey on the more powerfull :)
[08:01:52] <joal>	 Good morning :)
[08:04:37] <elukey>	 bonjour :)
[08:05:06] <elukey>	 joal: if you are ok I'd bump the maximum-am-res-percent to 0.5, and possibly a little bit the user-limit-factor?
[08:05:22] <elukey>	 otherwise if you have other ideas I am all ears
[08:06:03] <joal>	 elukey: ok for am, I however don't understand the reason for user-limit-factor
[08:06:14] <joal>	 elukey: how have we set that limit?
[08:07:48] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10JAllemandou) Awesome work @lexnasser :) This new endpoint is worth a blog post IMO :)
[08:08:16] <elukey>	 joal: I think that I added 2 as starting value but we didn't follow up in code reviews etc..
[08:08:28] <elukey>	 I don't find a trace of me and you deciding "2"
[08:08:35] <elukey>	 so it was more a placeholder
[08:10:54] <joal>	 elukey: From the task, we said we needed to set it, you suggested 2 in our first draft, and I thought it was ok as a start and we haven't changed it
[08:11:02] <joal>	 elukey: let's think anew :)
[08:12:09] <elukey>	 joal: IIUC a user can take max-percent-queue * user-limit
[08:12:32] <elukey>	 !log upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia
[08:12:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:12:44] <joal>	 elukey: min-percent * user-limit I think
[08:14:04] <elukey>	 yes sorry
[08:14:32] <elukey>	 !log upgrade hive packages on stat100x to 2.6.3-2 - T276121
[08:14:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:14:36] <stashbot>	 T276121: asoranking timer failed on stat1007 - https://phabricator.wikimedia.org/T276121
[08:15:01] <joal>	 elukey: from this https://blog.cloudera.com/yarn-capacity-scheduler/, it is explained that user-limit is about resource sharing when many users compete for resources
[08:15:25] <joal>	 elukey: In our case, we're more about making sure we can use elasticity
[08:16:02] <joal>	 elukey: For prod queues, we should have a high user-limit (small number of users, maximum elasticity)
[08:16:25] <joal>	 elukey: however for users queues we may wish to have the number not too high, so that resources are shared
[08:16:44] <joal>	 elukey: let me know if I'm nonsensical :)
[08:17:24] <elukey>	 joal: I mean user-limit-factor
[08:17:41] <elukey>	 The multiple of the queue capacity which can be configured to allow a single user to acquire more resources. By default this is set to 1 which ensures that a single user can never take more than the queue’s configured capacity irrespective of how idle the cluster is. Value is specified as a float. 
[08:18:04] <elukey>	 it also limits elasticity IIUC
[08:18:09] <joal>	 elukey: also, about am-resource-percente, the test cluster needs a higher value as there is scarce resource, leading to higher percentage used for AM
[08:18:30] <joal>	 elukey: indeed, user-limit is about limiting elasitcity
[08:18:46] <joal>	 elukey: BUT, our aim is to try to allow it within acceptable ranges
[08:20:09] <elukey>	 joal: sure makes sense, but we'd also need to make sure that the cluster gets used, this is why I was  trying to propose a less conservative value in the beginning, but we can do anything
[08:20:26] <joal>	 elukey: I completely agree
[08:20:39] <elukey>	 joal: what values do you propose for user/prod? 
[08:21:37] <elukey>	 also for the am-percent you are absolutely right, maybe we can use 0.5 for test and 0.2/0.3 for prod, but maybe we can leave 0.5 just in case for both
[08:22:04] <joal>	 elukey: I think 0.1 is ok for prod
[08:22:09] <joal>	 am-percent sorry --^
[08:22:22] <joal>	 about user-limit - for prod we can go to 5h
[08:22:42] <joal>	 elukey: 5 for analytics means that jobs in that queue can take the entire cluster
[08:23:15] <joal>	 elukey: for product however it means they can take 1/3 of the cluster
[08:25:17] <elukey>	 joal: I'd keep the am-% to 0.5 in the beginning, we can refine later on, just to avoid weird issues
[08:25:31] <joal>	 elukey: for users, a factor 2 is enough IMO - User-default queue capacity is Tb (10Tb * 0.4 * 0.8 = 3.2Tb) - With a factor 2 a user can get 6Tb which is way more than we wish to accept as a spark job for instance :)
[08:25:52] <elukey>	 okok makes sense
[08:28:26] <joal>	 elukey: for prod, let's make 5 for analytics, 5 for serarch, 10 for product and 10 for ingest
[08:28:46] <joal>	 elukey: higher elasticity allowed for smaller queues --^
[08:29:13] <elukey>	 ok so I'll split the value in multiple ones (Rather than only for "production")
[08:29:28] <joal>	 elukey: arf - sorry fior that
[08:29:39] <joal>	 elukey: we can also make it 5 for production overall, it should be ok
[08:29:54] <elukey>	 nono it takes a min, better to be granular!
[08:32:03] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/674806
[08:32:04] <elukey>	 :)
[08:32:09] <joal>	 cool :)
[08:33:52] <elukey>	 joal: green light from you? 
[08:33:57] <elukey>	 I'll deploy to hadoop test if so
[08:34:15] <joal>	 elukey: About AM, I think we have the 0.1 limit set in prod with FAIr (just pushing my point :-P)
[08:35:25] <joal>	 elukey: green-light from me - maybe a comment to explain why different values?
[08:35:26] <elukey>	 joal: if we get problems for this I'll blame you, just to go on records :D
[08:35:37] * elukey changes max am percent :D
[08:36:22] <elukey>	 I am going to add some comments as well
[08:36:34] <joal>	 elukey: for am in test, 0.5 is great (just to confirm)
[08:40:08] <elukey>	 nonono I am adding 0.1 now :D
[08:40:15] <elukey>	 and override 0.5 for test
[08:42:19] <joal>	 ack elukey - thanks for that :)
[08:45:07] <elukey>	 can't say no to Joseph
[08:46:54] <elukey>	 joal: ready for the review
[08:47:01] <elukey>	 I added some comments + the override
[08:47:07] <elukey>	 (pcc confirms that it works)
[08:47:18] <joal>	 ack - reviewing
[09:00:59] <elukey>	 joal: ok applied!
[09:04:21] <elukey>	 joal: yesterday evening we missed to chat, lemme know if I can help
[09:07:43] <joal>	 elukey: indeed, I'll need your help (after meetings :)
[09:42:53] <wikibugs>	 10Analytics: Mention QRank in “Analytics Datasets” - https://phabricator.wikimedia.org/T278416 (10Sascha)
[09:58:46] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Wikidata Usage and Coverage ETL failing from stat1004 - https://phabricator.wikimedia.org/T278299 (10GoranSMilovanovic) - The [[ https://wikidata-analytics.wmcloud.org/app/WD_percentUsageDashboard | Wikidata Usage & Covarage Dashboard ]] is...
[09:59:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/674723 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[10:13:18] <awight>	 Does superset include the mariadb wiki replicas?
[10:15:10] <joal>	 I don't think so awight 
[10:19:26] <awight>	 Thanks!  I'm running my query from the commandline but was imagining something more self-serve for the person who requested.
[10:28:55] <elukey>	 awight: the main problem for Superset is that we would not be able to hide the multi-db-host complexity like we do with analytics-mysql
[10:29:20] <elukey>	 (I assume that this is what you are using on stat100x right?)
[10:29:59] <elukey>	 the tool connects to 3 database hosts (dbstore100[3-5]) transparently to the user, meanwhile we'd need custom code to allow Superset to do the same
[10:30:46] <elukey>	 the alternative is to add all 3 db hosts to superset, but the user would need to know the mapping db -> host
[10:31:10] <elukey>	 like
[10:31:11] <elukey>	 elukey@stat1004:~$ analytics-mysql enwiki --print-target
[10:31:11] <elukey>	 dbstore1003.eqiad.wmnet:3311
[10:35:32] <awight>	 <3 TIL ^
[10:36:09] <awight>	 I've been editing .my.conf for every query, or going through a notebook with wmfdata
[10:38:29] <elukey>	 ahhh really??
[10:38:58] <elukey>	 so without the --print-target it just connects to the right host
[10:42:29] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) Next steps:  - Complete the hadoop test upgrade (one worker remaining + masters) - Upgrade furud/flerovium - Upgrade hadoop masters - Upgrade hadoop coordinators (complicated, requir...
[10:43:54] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T259105 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/674853
[10:44:08] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T259105 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/674853 (owner: 10GoranSMilovanovic)
[10:53:56] <wikibugs>	 10Analytics-Clusters: Upgrade furud/flerovium to Debian Buster - https://phabricator.wikimedia.org/T278421 (10elukey)
[10:58:01] <wikibugs>	 10Analytics-Clusters: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10elukey)
[11:05:13] <wikibugs>	 10Analytics-Clusters: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10elukey)
[11:14:11] <wikibugs>	 10Analytics-Clusters: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10elukey)
[11:17:30] <wikibugs>	 10Analytics, 10Datasets-General-or-Unknown: [Legal] Downloads license should mention CC0 for Analytics datasets - https://phabricator.wikimedia.org/T278409 (10Peachey88)
[11:22:34] <wikibugs>	 10Analytics: Cleanup cassandra keyspaces and host - https://phabricator.wikimedia.org/T278231 (10hnowlan) Keyspaces that aren't in use:  ` 434M    test_fdans_pageviews_per_project_v2 628K    test_joal 798M    test_lgc_pagecounts_per_project_test 35M     test_lgc_pageviews_per_project_test 144K    TEST_local_grou...
[11:26:48] * elukey lunch
[11:38:42] <icinga-wm>	 PROBLEM - AQS root url on aqs1012 is CRITICAL: connect to address 10.64.32.16 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[11:39:16] <icinga-wm>	 PROBLEM - AQS root url on aqs1014 is CRITICAL: connect to address 10.64.48.62 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[11:39:18] <icinga-wm>	 PROBLEM - AQS root url on aqs1015 is CRITICAL: connect to address 10.64.48.63 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[11:39:32] <joal>	 WUT?
[11:39:56] <joal>	 hnowlan: Hellooooooo - would have ou changed stuff on cassandra-AQS
[11:39:57] <joal>	 ?
[11:41:29] <joal>	 AHHHH! New nodes
[11:41:43] <joal>	 excuse me for the invalid ping hnowlan :S
[11:42:36] <hnowlan>	 heh no worries, downtime expiring
[11:43:34] <icinga-wm>	 ACKNOWLEDGEMENT - AQS root url on aqs1012 is CRITICAL: connect to address 10.64.32.16 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[11:43:34] <icinga-wm>	 ACKNOWLEDGEMENT - AQS root url on aqs1013 is CRITICAL: connect to address 10.64.32.136 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[11:43:35] <icinga-wm>	 ACKNOWLEDGEMENT - AQS root url on aqs1014 is CRITICAL: connect to address 10.64.48.62 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[11:43:35] <icinga-wm>	 ACKNOWLEDGEMENT - AQS root url on aqs1015 is CRITICAL: connect to address 10.64.48.63 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[12:20:29] <hnowlan>	 joal: I'm hoping the cleanup will rectify this but just related to the conversation we had about disk usage and reporting - aqs1009-a.eqiad.wmnet's instance claims to have a load of 4.05 TB on an array of 2.9 TB
[12:20:57] <joal>	 hnowlan: those machines are VERY powerful :)
[12:23:37] <wikibugs>	 10Analytics: AQS Cassandra storage: Investigate incorrect storage report on Grafana - https://phabricator.wikimedia.org/T278234 (10JAllemandou)
[12:23:44] <joal>	 hnowlan: I have updated the task --^
[12:23:48] <hnowlan>	 nice
[12:44:13] <elukey>	 Just replied to analytics-alerts@ so whoever is in ops week will not freak out :D
[12:48:14] <hnowlan>	 oh did I cause spam to that? I'm not on it 
[12:51:18] <elukey>	 hnowlan: no problem it is a little mess, but some people read emails first without checking in here, so we reply to that list in case something has been handled
[12:51:54] <hnowlan>	 aha - were the messages the nagios alerts above? 
[12:52:09] <elukey>	 exactly yes
[12:59:41] <hnowlan>	 the nodetool cleanup has finished on aqs1004 - no change to space used either on disk or on cluster. Next thing is to see what a restart does to reported space usage over time I guess? There's also the question of whether the cluster needs a repair 
[13:00:58] <hnowlan>	 *aqs1004-a to be specific 
[13:02:05] <hnowlan>	 doing a drain and restart on a single instance shouldn't be risky - aqs1007-b is probably the best candidate because it's reporting a load of 5TB (!)
[13:12:38] <elukey>	 hnowlan: +1, if nodetool-b status is clean you can proceed
[13:13:07] <ottomata>	 dcausse: o/  got a sec to discuss event-utilities http stuff?
[13:13:22] <dcausse>	 ottomata: https://meet.google.com/ugw-nsih-qyw
[13:13:40] <elukey>	 hnowlan: but wouldn't it make more sense to restart aqs1004-a to see if clean + restart leads to any improvement?
[13:13:51] * elukey bbiab
[13:17:01] <hnowlan>	 elukey: ime a cleanup will have an immediate effect but it couldn't hurt to try 1004-a I think, it's also loaded with more data than it has on disk 
[13:27:06] <elukey>	 hnowlan: I am asking since we do (once every x months) a roll restart of cassandra for jvm upgrades, so I'd expect graphs to show a difference in usage over time if a restart clears the metrics out
[13:31:13] <wikibugs>	 10Analytics-Radar, 10Discovery-Search, 10MediaWiki-General: Proposal: drop avro dependency from mediawiki - https://phabricator.wikimedia.org/T265967 (10Mainframe98)
[13:34:42] <hnowlan>	 elukey: ohhh of course, we do the same for restbase. I'll see if there was an impact during those restarts 
[13:45:50] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) Thank you so much @lexnasser! This has been many years in the making and is truly excellent work doing the enginee...
[13:53:25] <elukey>	 !log systemctl restart performance-asotranking on stat1007 for T276121
[13:53:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:53:30] <stashbot>	 T276121: asoranking timer failed on stat1007 - https://phabricator.wikimedia.org/T276121
[13:58:13] <isaacj>	 o/ I have a Pyspark notebook that was running fine before Hadoop upgrade but now crashes due to memory consumption. it's processing current wikitext for all of english wikipedia so big but certainly doable (it runs fine with simplewiki) and i'm running with the yarn-large setting from wmfdata. specifically i get this message: `Reason: Container killed by YARN for exceeding memory limits.  55.3 GB of 45.9 GB virtual memory used. 
[13:58:13] <isaacj>	 Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.` I'm wondering a) if this has been seen in other jobs? b) suggested fixes? some internet searching suggested that sometimes too much virtual memory is allocated by YARN even when its not being used but i'm hesitant to just throw resources at the problem as it wasn't happening before (and ideally i'd like to 
[13:58:13] <isaacj>	 run this job on all wikis simultaneously not just enwiki)
[14:06:05] <elukey>	 isaacj: o/ I think it is worth opening a task, with all the details so we can investigate properly
[14:08:10] <elukey>	 we do set yarn.scheduler.maximum-allocation-mb = ~49G in the yarn config, since we have some workers with 64G of ram (and we reserve some space for the os)
[14:09:28] <elukey>	 now I also recall that we had to set yarn.nodemanager.vmem-pmem-ratio = 5.1 for bigtop
[14:13:28] <elukey>	 ah ok  yarn.nodemanager.vmem-check-enabled was added as "true" in hadoop 2.10.1, now I recall
[14:13:44] <elukey>	 and by default the ratio is 2.1
[14:13:46] <fkaelin>	 There is def a problem, good timing isaacj - Martin and I were just discussing how to put together a  simple job that fails.  
[14:14:29] <elukey>	 well a container of that size is a little bit unusual to start with, but I can understand the use case
[14:15:52] <fkaelin>	 i first noticed job that eventually succeed have tasks that fail with memory errors, which I never noticed before the upgrade
[14:16:00] <fkaelin>	 e.g https://yarn.wikimedia.org/proxy/application_1615988861843_54885/jobs/job/?id=11
[14:18:27] <elukey>	 so an easy test could be to set yarn.nodemanager.vmem-check-enabled=false in yarn's config, or try to bump up the vmem-pmem ratio
[14:19:04] <elukey>	 isaacj: do we know why those containers are so big?
[14:19:24] <isaacj>	 elukey: sorry in meeting but will get back to this in ~10 minutes. thanks for looking into it!!!
[14:19:30] <elukey>	 np!
[14:19:45] <elukey>	 fkaelin: please chime in with any ideas if you have, I am all ears
[14:21:00] <elukey>	 with the new hadoop (2.10.1) we got extra protection on memory consumption, that seems good, not sure if buggy or not
[14:23:06] <elukey>	 something must have changed in the way yarn calculates the maximum-allocation-mb for sure, I suspect it was more lenient before
[14:24:03] <fkaelin>	 it does seem like a config issue, but it is somewhat transient. i will create a phab to investigate.
[14:25:26] <elukey>	 +1 thanks!
[14:26:38] <elukey>	 it could even be that, in Isaac's case, the container is in reality allocating ~10G on the heap but the virtual memory used is more than 5x, hence the kill 
[14:28:08] <elukey>	 the other risk that I see is that the linux OOM killer may interveen if we bump those limits too much
[14:28:25] <elukey>	 (so sigkill from the OS rather than from the nodemanager)
[14:34:36] <joal>	 elukey: IIRC we were having yarn.nodemanager.vmem-check-enabled=false in previous config
[14:37:00] <joal>	 about big containers: The spark executors of yarn-large use multi-cores executors, and there big ram settings per container
[14:39:32] <elukey>	 joal: I think it got set to true by default in 2.10.1 (at least, it is true in yarn defaults)
[14:39:50] <joal>	 elukey: I think it was true in previous versions as well, no?
[14:40:13] <isaacj>	 thanks fkaelin for taking on the phab task. i can add details from my job to that. if there's additional log output etc. that would be useful too elukey just let me know and i'll try to grab it
[14:41:34] <elukey>	 joal: yes you are right, just checked
[14:41:47] <isaacj>	 this was the job: https://yarn.wikimedia.org/cluster/app/application_1615988861843_51161
[14:41:51] <joal>	 elukey: I think we explicitely set it to false with andrew
[14:44:53] <elukey>	 joal: I recall that I had to bump up the ratio to 5.1 to overcome some spark issues though, weird
[14:45:05] <joal>	 :(
[14:45:25] <elukey>	 and also, it is not a bad check in my opinion
[14:49:42] <joal>	 hm
[14:49:53] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10AMuigai) Piling on the thanks @lexnasser! Our team has been patiently waiting for this and glad you led it through to its...
[14:53:28] <elukey>	 joal: my fear is that at some point the oom killer kicks in, but it may not happen
[14:53:35] <elukey>	 we can try to disable it and see how it goes
[14:54:26] <joal>	 elukey: I don't know how OOMkiller works, if it uses physical memory or virtual memory - I do recall however that there was some issues with sdpark and virtual-mem while physical was still in boundaries
[14:54:47] <joal>	 elukey: given the setting was off, keeping it off might be ok :D
[14:54:53] * joal runs away fast
[14:56:18] <elukey>	 joal: I didn't find any proof that the setting was off, nor that it was removed
[14:56:32] <elukey>	 do you have a task?
[14:56:57] <joal>	 elukey: I on't
[14:56:59] <joal>	 elukey: I don't
[14:57:09] <elukey>	 the OOM may kick in when the OS needs to reclaim memory, because the overcommit was too much and eventually it reached a danger zone (too much memory actually used)
[15:04:51] <elukey>	 https://issues.apache.org/jira/browse/YARN-4714 doesn't really give any solid answer
[15:07:28] <ottomata>	 dcausse: they are a little hand wavy in parts of that uber article that left me guessing
[15:07:32] <ottomata>	 https://eng.uber.com/kafka/
[15:07:51] <ottomata>	 "Next, an all-active service coordinates the update services in the regions and assigns one primary region to update."
[15:08:00] <ottomata>	 what is the 'all-active' service in the diagram?
[15:08:31] <wikibugs>	 10Analytics: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10fkaelin)
[15:10:17] <dcausse>	 ottomata: I don't really know
[15:10:19] <fkaelin>	 elukey created the phab ^ with a minimum example, i will leave the linked spark application running
[15:11:01] <elukey>	 fkaelin: so I have a patch ready, maybe we can quickly test it (if my team agrees with the change)
[15:11:12] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/674910/
[15:11:47] <fkaelin>	 testing it 
[15:12:02] <elukey>	 if it doesn't work then we can just disable the vmem check
[15:12:37] <elukey>	 ottomata: o/ joal - anything against me testing the ratio 10.1?
[15:12:40] <elukey>	 (see the above change)
[15:13:00] <ottomata>	 Test away elukey  :)
[15:14:18] <wikibugs>	 10Analytics, 10Patch-For-Review: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10Isaac) Thanks @fkaelin for creating this. My quick example that also has this issue (running on Newpyter on stat1008):  ` import wmfdata spark = wmfdata.spark.get_session(type='yarn-large')   spark.sql("SET h...
[15:17:20] <fkaelin>	 elukey this setting should be applied when passing it as a spark conf?
[15:18:07] <elukey>	 fkaelin: nono it is a global yarn config, I am rolling it out now (ETA 10 mins :)
[15:18:38] <fkaelin>	 ah that makes sense, thanks!
[15:19:51] <wikibugs>	 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Another idea that may not be feasible:  Would it be possible to move the event produce...
[15:27:08] <elukey>	 ottomata: --^ the infinite saga
[15:28:34] <elukey>	 fkaelin: green light!
[15:31:27] <fkaelin>	  ```ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 91.2 GB of 90.9 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. )```
[15:33:38] <fkaelin>	 the job itself is more likely to succeed, i ran it three times successfully. so memory error is less likely to kill the job, but they still happen
[15:33:43] <fkaelin>	 https://yarn.wikimedia.org/proxy/application_1615988861843_55351/
[15:40:13] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics, and 2 others: [MEP] [BUG] dt field in migrated client-side EventLogging schemas is not set to meta.dt - https://phabricator.wikimedia.org/T277330 (10nettrom_WMF) 05Open→03Resolved I've rerun my queries to check the timestamps in...
[15:41:32] <elukey>	 fkaelin: at this point I'll just disable the check, like Joseph mentioned, it doesn't make any sense to keep bumping the limit. Let's see how it works for some days, in case we can revisit
[15:41:36] <elukey>	 ETA 10 mins
[15:41:37] <elukey>	 :)
[15:42:54] <ottomata>	 elukey:  the consistency thing?  indeed atm it is just discussion and exploration.  iwas talkinig with david about it some and thought to add that idea
[15:43:30] <elukey>	 ottomata: yes yes I mean that loong task, I tried to read it and I saw all the attempts to find a shared solution
[15:43:35] <elukey>	 maybe a meeting could help?
[15:44:10] <ottomata>	 elukey:  indeed it just isn't time yet I think?  kate C thinks it should be a thing we put to the technical decision board, and i think i agree
[15:44:48] <ottomata>	 dcausse: qq
[15:44:49] <ottomata>	 in https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/672363/2/eventutilities/src/main/java/org/wikimedia/eventutilities/core/http/BasicHttpClient.java
[15:44:52] <ottomata>	 line 108
[15:45:05] <ottomata>	 instead of passing again target.getHostName, should it be customRoute?
[15:46:23] <dcausse>	 ottomata: hmm.. yes I think... :) 
[15:47:08] <fkaelin>	 Great, thanks elukey. The failing example job is basically a no-op, so if there are memory issues the cause seems to be spark internal issues- so if this is a sanity check that isn't working as intended disabling it seems ok 
[15:55:00] <elukey>	 fkaelin: ready for the second test
[15:58:08] <fkaelin>	 elukey looks much better, don't see the task errors in that toy job
[15:58:45] <elukey>	 !log disable vmemory checks in Yarn nodemanagers on Hadoop
[15:58:49] <elukey>	 fkaelin: perfect :)
[15:58:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:59:07] <elukey>	 let's see if there will be any side effects during the next days
[15:59:15] <fkaelin>	 there is a single task failure with an NPE I noticed this morning, seems unrelated though  ? http://an-worker1085.eqiad.wmnet:8042/node/containerlogs/container_e03_1615988861843_55410_01_000023/fab/stderr/?start=0
[16:02:11] <elukey>	 fkaelin: yeah if it is a NPE I think it is unrelated
[16:03:24] <elukey>	 yes yes the stacktrace is unrelated (but not nice of course :D)
[16:07:40] <wikibugs>	 10Analytics: NullPointerException at beginning of spark job - https://phabricator.wikimedia.org/T278451 (10fkaelin)
[16:30:34] <wikibugs>	 10Analytics-Clusters: Upgrade Druid to 0.20.1 (latest upstream) - https://phabricator.wikimedia.org/T278056 (10Ottomata)
[16:33:21] <elukey>	 addshore: o/
[16:33:26] <elukey>	 around by any chance?
[16:33:29] <elukey>	 I see on stat1007
[16:33:30] <elukey>	 elukey@stat1007:~$ sudo journalctl -u wmde-toolkit-analyzer-build.service | grep Error
[16:33:33] <elukey>	 Mar 25 12:00:00 stat1007 java[23843]: Error: Invalid or corrupt jarfile /srv/analytics-wmde/graphite/src/toolkit-analyzer-build/toolkit-analyzer.jar
[16:34:53] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics, and 2 others: [MEP] [BUG] dt field in migrated client-side EventLogging schemas is not set to meta.dt - https://phabricator.wikimedia.org/T277330 (10kzimmerman) Thanks so much, @nettrom_WMF !
[16:38:20] <wikibugs>	 10Analytics: Add "did edit" field to pageview_actor - https://phabricator.wikimedia.org/T277785 (10Ottomata) p:05Triage→03High
[16:39:59] <wikibugs>	 10Analytics: Cleanup cassandra keyspaces and host - https://phabricator.wikimedia.org/T278231 (10Ottomata) p:05Triage→03Medium a:03hnowlan
[16:40:44] <wikibugs>	 10Analytics: Add "did edit" field to pageview_actor - https://phabricator.wikimedia.org/T277785 (10mforns) I think this would be super useful! We could use this information to filter out rows in reading data sets that we inted to make public, like pageviews per article per country. Not reporting on sessions that...
[16:40:52] <wikibugs>	 10Analytics-Clusters: AQS Cassandra storage: Investigate incorrect storage report on Grafana - https://phabricator.wikimedia.org/T278234 (10Ottomata)
[16:41:27] <addshore>	 Hi elukey 
[16:42:08] <addshore>	 Could you make a phab ticket and tag "wdwb-tech"? I just wrapped up for the day
[16:42:50] <wikibugs>	 10Analytics: Add "did edit" field to pageview_actor - https://phabricator.wikimedia.org/T277785 (10JAllemandou) Discussed today with the team: the change is cheap, we can implement it soon. One thing we would like to see happening before is a coverage analysis of how many edits we get  by using the described fil...
[16:44:29] <wikibugs>	 10Analytics: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10Ottomata) p:05Triage→03Medium
[16:47:48] <wikibugs>	 10Analytics, 10SRE: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10Ottomata) a:03elukey
[16:50:57] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Datasets-General-or-Unknown: [Legal] Downloads license should mention CC0 for Analytics datasets - https://phabricator.wikimedia.org/T278409 (10Ottomata) a:03Ottomata
[16:52:23] <elukey>	 addshore: ack!
[16:54:59] <wikibugs>	 10Analytics: Mention QRank in “Analytics Datasets” - https://phabricator.wikimedia.org/T278416 (10Ottomata) @ArielGlenn thoughts?    I suppose we could just add a link to this dataset on toolforge from  https://dumps.wikimedia.org/other/analytics/?  Should we?  Do we do that anywhere else on dumps (i.e. link to...
[16:55:35] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10Ottomata)
[16:56:15] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10Ottomata) a:03elukey
[17:00:37] <wikibugs>	 10Analytics: NullPointerException at beginning of spark job - https://phabricator.wikimedia.org/T278451 (10Ottomata) @fkaelin can you provide some code to repro? @JAllemandou will give this a quick look.
[17:01:12] <wikibugs>	 10Analytics-Radar, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Wikidata Usage and Coverage ETL failing from stat1004 - https://phabricator.wikimedia.org/T278299 (10Ottomata)
[17:01:24] <wikibugs>	 10Analytics-Radar, 10Product-Analytics (Kanban): Hive table neilpquinn.toledo_pageviews missing almost all data - https://phabricator.wikimedia.org/T277781 (10Ottomata)
[17:29:16] <mforns>	 ottomata: looking at your code review, now
[17:35:15] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser)
[17:36:20] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10lexnasser) 05Open→03Resolved Per the parent task, the pageviews/top-per-country endpoint is now public! Take a look at th...
[17:41:23] <hnowlan>	 I'm gonna drain and restart cassandra on aqs1004-a now if there are no objections
[17:49:44] <hnowlan>	 aha, bingo
[17:49:55] <hnowlan>	 before: UN  aqs1004-a.eqiad.wmnet  2.89 TB    256          25.7%             a6c7480a-7f94-4488-a925-0cff98c5841a  rack1
[17:50:05] <hnowlan>	 after: UN  aqs1004-a.eqiad.wmnet  1.76 TB    256          25.7%             a6c7480a-7f94-4488-a925-0cff98c5841a  rack1
[17:51:45] <wikibugs>	 10Analytics, 10Analytics-Kanban: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10fkaelin)
[17:53:08] <hnowlan>	 gonna do aqs1004-b to see it happen on a non-cleanup'd node 
[18:08:33] <wikibugs>	 10Analytics, 10Product-Analytics, 10Research: Use Hive/Spark timestamps in Refined event data - https://phabricator.wikimedia.org/T278467 (10Ottomata)
[18:14:34] <wikibugs>	 10Analytics-Clusters: AQS Cassandra storage: Investigate incorrect storage report on Grafana - https://phabricator.wikimedia.org/T278234 (10hnowlan) This is at least partially a reporting bug (possibly [[ https://issues.apache.org/jira/browse/CASSANDRA-13738 | this one! ]]) Before restarts: ` root@aqs1004:~# nod...
[18:19:52] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Better Use Of Data: Optimize intermediate session length data set and dashboard - https://phabricator.wikimedia.org/T277512 (10mforns)
[18:24:30] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Better Use Of Data: Optimize intermediate session length data set and dashboard - https://phabricator.wikimedia.org/T277512 (10mforns) As the main problem that we had in the session length dashboard has been significantly mitigated by https://gerrit.wikimedia.org/r/672541,...
[18:56:29] <wikibugs>	 10Analytics, 10Product-Analytics: Hive Runtime Error - Query on event.MobileWikiAppDailyStats failing with errors - https://phabricator.wikimedia.org/T277348 (10mforns) a:05mforns→03JAllemandou
[19:24:35] <wikibugs>	 10Analytics, 10Pageviews-API: Pageviews API should allow specifying a country - https://phabricator.wikimedia.org/T245968 (10lexnasser) Hi @Yair_rand, a pageviews `top-per-country` AQS endpoint was just released ([docs](https://wikimedia.org/api/rest_v1/#/Pageviews%20data/get_metrics_pageviews_top_per_country_...
[19:45:35] <wikibugs>	 10Analytics: Add "did edit" field to pageview_actor - https://phabricator.wikimedia.org/T277785 (10Isaac) > I think this would be super useful! Yay! Glad to hear!  > May we let you sync with your team if this could be done on your end? Yeah, I'm happy to take that on as I have queries lying around that already d...
[20:10:26] <joal>	 ottomata: reviewed all patches, not everything in deep detail :)
[20:38:36] <ottomata>	 joal:  thank you!
[20:43:04] <wikibugs>	 (03CR) 10Mforns: Improve Refine failure report email (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/674304 (owner: 10Ottomata)
[20:48:31] <wikibugs>	 (03CR) 10Ottomata: "> Patch Set 3: Code-Review+1" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670269 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[21:19:05] <wikibugs>	 10Analytics-Radar, 10Editing-team, 10MediaWiki-Page-editing, 10Platform Engineering, and 2 others: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10daniel)
[21:21:08] <wikibugs>	 10Analytics-Radar, 10Editing-team, 10MediaWiki-Page-editing, 10Platform Engineering, and 2 others: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10daniel) I expect that we will resolve this as part of T275847. Moving to "watching" for now, for lack of a b...
[21:48:42] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] database field: if the field is NULL, don't fill it in with None [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/674697 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[21:49:21] <wikibugs>	 (03Merged) 10jenkins-bot: database field: if the field is NULL, don't fill it in with None [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/674697 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[21:52:43] <wikibugs>	 (03PS2) 10Bstorm: worker: update some worker code for errors and connections [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/674723 (https://phabricator.wikimedia.org/T264254)
[21:58:34] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] worker: update some worker code for errors and connections [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/674723 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[21:59:18] <wikibugs>	 (03Merged) 10jenkins-bot: worker: update some worker code for errors and connections [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/674723 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[23:23:48] <wikibugs>	 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10aaron) Keep in mind that, strictly speaking, some of these problems are not even solved in Media...