[00:09:34] 10Analytics, 10Analytics-Kanban, 10serviceops, 10Patch-For-Review, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10Milimetric) Ok, some of these things are easy and some are a bit harder. Instead of filtering out the requests at "refine" time, we wi... [00:50:29] 10Analytics-Radar, 10Better Use Of Data, 10Instrument-ClientError, 10Wikimedia-Logstash, and 4 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10Tgr) [01:00:56] 10Analytics, 10Analytics-Wikistats: pagecounts-ez uploads stopped after 9/24 - https://phabricator.wikimedia.org/T265378 (10Milimetric) The monthly totals are not available yet, see T265732. A quick status update: * we found some malformed rows in pageview_complete, and we're fixing them ** some had 5 column... [01:10:57] 10Analytics-Radar, 10MediaWiki-API, 10Patch-For-Review, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10Milimetric) It's basically stalled on resourcing. There was... [04:41:14] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) @JAllemandou @Isaac Thanks to both of you for going into deeper detail about `page_title` vs. `page_id`. I'm... [06:42:18] good morning [07:17:41] the patch for hive seems to work for bigtop! [07:40:05] created https://issues.apache.org/jira/browse/BIGTOP-3455 to get it included in bigtop 1.5 [07:40:47] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) It seems to work! Created https://issues.apache.org/jira/browse/BIGTOP-3455 to ask if the patch can be included for Bigtop 1.5. [07:46:22] ok I am going to rebuild the hive packages with a -2 at the end, to differenciate them from the upstream ones [08:54:39] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10elukey) @GoranSMilovanovic I think I may have found the issue, I fixed stat1005. Can you retry a normal execution of your script? >>!... [10:16:40] elukey: hello :D https://gerrit.wikimedia.org/r/c/operations/puppet/+/643448 [10:17:13] only 420 replacements left, don't worry [10:21:11] Thanks! [10:21:44] Amir1: well thank you :) just completed the puppet runs via cumin, no op as expected [10:22:23] Awesome [10:44:37] 10Analytics: Alter table for navigation timing errors out in Hadoop test - https://phabricator.wikimedia.org/T268733 (10elukey) [10:45:11] 10Analytics: Alter table for navigation timing errors out in Hadoop test - https://phabricator.wikimedia.org/T268733 (10elukey) [10:53:42] * elukey brb [11:02:03] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:02:17] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:03:27] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Servic [11:03:27] s [11:08:35] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:14:13] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:14:29] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:16:49] ah lovely datasource drops? [11:19:12] ah no it is pageviews [11:19:17] so not druid [11:19:59] https://grafana.wikimedia.org/d/000000526/aqs?orgId=1 wow [11:21:19] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:21:35] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wi [11:21:35] /Services/Monitoring/aqs [11:22:43] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:29:35] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:29:57] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:30:11] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:40:07] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:40:33] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:45:59] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/ [11:45:59] nitoring/aqs [11:46:57] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:47:21] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:47:35] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:04:33] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed [12:04:33] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:05:01] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimed [12:05:01] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:05:15] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per [12:05:15] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:06:15] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:06:41] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:06:57] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:22:13] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed [12:22:13] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:22:41] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:22:55] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Servic [12:22:55] s [12:25:39] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:26:03] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:26:17] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:31:38] ~ [12:32:36] wow - AQS got some peek [12:32:51] elukey: would you be around by any chance? [12:33:10] I am yes, working with SRE to fix this [12:36:02] ack - Can I help elukey ? [12:36:13] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/me [12:36:13] file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:36:39] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/me [12:36:39] file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:36:53] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/me [12:36:53] file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:37:30] joal: no thanks, I need traffic to apply a ban :( [12:37:37] right [12:37:52] elukey: I'm looking at AQS reqs seeing if I can see a pattern [12:38:33] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:43:07] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:43:31] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:54:05] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:56:34] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10JAllemandou) Hi @GoranSMilovanovic - Thank you for the changes. However it's not feasible for me to check them... [12:58:20] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10JAllemandou) [12:58:53] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:59:11] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:00:31] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:14:37] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:01] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wi [13:15:01] /Services/Monitoring/aqs [13:15:19] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Servic [13:15:19] s [13:17:59] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:18:23] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:18:41] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:25:06] joal: blocked :) [13:25:15] o/ [13:25:17] we should be good now, I am going afk for a bit, please call if needed! [13:25:20] You rock elukey [13:25:27] please go elukey [13:25:29] Bye! [14:08:06] 10Analytics: Alter table for navigation timing errors out in Hadoop test - https://phabricator.wikimedia.org/T268733 (10Ottomata) a:03Ottomata Hm interesting. I'll try to look into this next week. [15:38:16] !log move stat1004 to A5 [15:38:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:11:37] !log move analytics1065 to C3 [16:11:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:33:45] ottomata: we can hear you [16:36:01] fdans: I am skipping the sync meeting since I am moving hosts with dcops, ping me if I am needed :) [16:36:26] elukey: no problem! [16:46:12] !log move analytics1066 to C3 [16:46:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:21:55] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @JAllemandou > Can you please put your configurations in a repo and share so that we can discuss/re... [17:50:07] ottomata / mforns / joal: any idea if/where data for the mediawiki.recentchange stream lands in HDFS? I know I can see it in Kafka and EventStreams, but does it touch down anywhere? Couldn't find it in event/event_sanitized [17:58:38] hmm... [18:02:02] milimetric: maybe in raw? /wmf/data/raw/event/*_mediawiki_recentchange [18:04:59] milimetric: it isn't event.mediawiki_recentchange [18:05:00] ? [18:05:09] OHHH [18:05:09] no its not [18:05:14] it doesn't have a consistent schema [18:05:16] we can't import it [18:05:49] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp#L71-L77 [18:06:30] see the comment at the bottom of https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/recentchange/current.yaml [18:08:10] (03PS1) 10GoranSMilovanovic: configs Deployment 20201125 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643530 [18:08:14] milimetric: one qs - is there an npm version that we use to buil wikistats? [18:08:18] *build [18:09:26] (back in a few, will read asap) [18:10:18] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] configs Deployment 20201125 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643530 (owner: 10GoranSMilovanovic) [18:19:55] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @JAllemandou The changes as per your request are now enforced across the WDCM system. More changes... [18:20:17] 10Analytics, 10Patch-For-Review: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey Testing now as per your suggestion in T268376#6647717. Thank you. [18:24:16] elukey: this is the change I mentioned in standup: https://gerrit.wikimedia.org/r/c/operations/puppet/+/643531 can you please review? :] [18:26:23] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) [18:31:47] milimetric: you might want to include seve and jason in any governance discussions [18:40:36] !log restart turnilo to pick up new netflow config changes [18:40:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:41:09] mforns: deployed, can you check on turnilo?? [18:43:07] also, is there anybody that can help me with wikistats deploy ? :) [18:45:37] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) https://wikimediafoundation.org/profile/sumeet-bodington/ shows how Dumisani (T266791) and @JAnstee_WMF are both in Sumeet's... [18:48:13] jenkins is not happy in https://gerrit.wikimedia.org/r/c/analytics/wikistats2/+/643329 [18:48:25] it seems to complain that package.json vs package-lock.json are not in sync [18:48:32] but I'd need to help [18:50:28] it says npm ERR! Missing: wmui-base@wikimedia/wikimedia-ui-base [18:50:37] that is indeed in package.json but not in -lock [18:58:28] elukey: the test as requested in https://phabricator.wikimedia.org/T268376#6647717 is now running (driver is stat1005). I will be afk for some time - thus far, the logging seems not to include a tons of DEBUG messages. [18:58:45] GoranSM: nice! [18:59:08] I'll check once done the yarn app id, thanks for pinging me :) [18:59:29] elukey: Ok. [19:04:52] !log Killing job application_1605880843685_18336 as it consumes too much resources [19:04:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:14:57] joal: I donwscaled the resources for the Pyspark program that run the job https://yarn.wikimedia.org/proxy/application_1605880843685_18336/ as you have requested in https://phabricator.wikimedia.org/T268684#6649444 and you still needed to kill the job. Should I downscale it even more? Please advise. [19:15:41] Hi GoranSM [19:16:10] This is very bizarre - The job I killed was consuming 3.5Tb RAM and 1600+ CPU cores [19:16:13] GoranSM: --^ [19:16:22] elukey: The job just killed by joal was the one that should have been monitored for the logging problem. So, I will have to see if it needs additional downscaling (it obviously does) and then I will let you know when I start testing again. [19:16:36] Hi Joal [19:16:41] GoranSM: I have cheked logs and indeed the problem of logs is solved :) [19:17:01] GoranSM: you've been faster than me, I planned on pinging you about that exact matter :) [19:17:22] GoranSM: log problem solved (thanks a milion elukey to have find the root cause) [19:17:24] joal: Ok, at least the logs are now fine. Let's see about the resources then. I followed your suggestion in https://phabricator.wikimedia.org/T268684#6648373 but it seems that that is not enough. [19:17:38] elukey: thanks!! [19:17:39] elukey: Thank you for fixing the logging thing. [19:17:51] GoranSM: to me it's as if nothing had been done [19:18:04] GoranSM: in term of downscaling I mean [19:18:15] Can you show me the command you used please? [19:18:20] GoranSM: --^ [19:18:22] joal: Of course, just a sec [19:18:46] joal: sudo -u analytics-privatedata spark2-submit --master yarn --deploy-mode client --num-executors 100 --driver-memory 20000m --executor-memory 20000m --executor-cores 8 /home/goransm/Analytics/WDCM/WDCM_Scripts/wdcmModule_ETL.py [19:19:07] joal: You requested to reduce the number of cores from 16 to 8. [19:19:29] GoranSM: `--num-executors 100` --> `--conf spark.dynamicAllocation.maxExecutors=100` [19:19:35] Please :) [19:19:48] GoranSM: the number of executors was unlimited [19:19:57] joal: And the problem is that my intuition on the number of executors/cores that I need for something to work efficiently in Spark is very bad. [19:20:20] joal: How do you mean unlimited (100 is a finite number, right?) :) [19:20:21] the setting you use is only active when dynamic allocation is off, which is NOT our case [19:21:10] joal: Interesting. Every day I learn something new. What do you suggest that I should with this (and many other) Pyspark program, then? Reduce --num-executors to..? [19:21:14] GoranSM: I would suggest to go for small first, then grow if needed :) [19:21:38] joal: Makes sense. Start with - I don't know, ten, twenty, or more? [19:21:56] GoranSM: As stated above - Don't use --num-executors please - use --conf spark.dynamicAllocation.maxExecutors=X instead [19:22:15] joal: Oh oh oh I understand now [19:22:18] GoranSM: If you use --num-executors the settings you give is ignored [19:23:08] joal: So: (1) do not use --num-executors and (2) use --conf spark.dynamicAllocation.maxExecutors=100 ? [19:23:17] Correct sir [19:23:22] GoranSM: --^ [19:23:39] joal: Thank you very much sir - testing again in ten minutes or so. [19:23:42] GoranSM: The second is the replacement for the first when dynamic-allocation is on, which is our case [19:24:05] Great GoranSM - Thanks for that [19:24:31] elukey: netflow in Druid looks good! [19:24:36] joal: Ok, so the --conf spark,dynamicAllocation.maxExecutors=100 will not allow the cluster to use more than 100, but the cluster will also try to figure how many executors are really needed, right? [19:24:51] mforns: super :) [19:24:55] Also GoranSM, about how to scale - I'd advise doing it by steps starting small and moving up, monitoring execution time and gain insights [19:25:15] GoranSM: Correct, the cluster will use as many executors needed, up to 100 [19:25:43] GoranSM: With the previous setting is was using as many executors as needed, up to UNBOUNDED :) [19:26:09] joal: I somehow remember doing that once I've migrated the WDCM etl from HiveQL to Pyspark, but... sincerely I barely remember what were my conclusions back then. Ok, I am up to introduce the changes as suggested. Thank you again! :) [19:26:21] GoranSM: There is no easy rule of thumb for scaling computation in Spark it depends a lot of the computation, datasize, etc [19:26:44] No prob GoranSM - Thank you for making this happen :) [19:27:30] 10Analytics, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10JAllemandou) [19:27:59] 10Analytics, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10JAllemandou) [19:28:37] all right going to log off people, see you tomorrow! happy holidays in the US :) [19:28:45] Bye elukey [19:37:40] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi New fields are in Druid (starting 2020-11-24T03:00:00) :] I've checked that all looks OK, but please do che... [19:39:17] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10JAllemandou) Thanks @GoranSMilovanovic for the explanation and addition of config to the repo. As pere our chat on IRC,... [19:40:45] mforns: New netflow fields in druid is an awesome end-of-day for me :) Thanks for that! [19:42:26] mforns: I have a question related to that for when you have time: Have you moved netflow to the event database? [19:43:03] joal: cool! no, I haven't yet moved it, why? [19:43:13] mforns: just asking :) [19:43:17] I wondered [19:43:25] joal: that was my next step [19:43:34] this is great :) [19:44:04] joal: do you have any tips on how to move? [19:44:25] mforns: hm, nope - I have not done that ever [19:44:28] we could just stop the refine job for a bit, right [19:44:38] I think that's a right approach [19:44:39] we have the fresh data in Druid by streaming [19:45:15] Stop refine, drop table, move data, recreate table, repair table, update refine conf and here you go :) [19:45:31] yea [19:45:31] and obviously restart refine after that [19:46:34] yes, shoudn't take too long [19:47:01] maybe pairing on that would be a good idea [19:47:19] mforns: +1 to pairing- We'll need our beloved ops around as well :) [19:47:28] yea [19:47:48] maybe next week then [19:48:14] mforns: can be done tomorrow if you wish [19:48:14] until then I'll look at the new data size in Druid [19:48:49] oh, of course, for some reason I thought all SREs were on holiday tomorrow [19:48:57] yea, tomorrow is good for me! [19:49:07] will prepare a patch [19:49:37] \o/ [19:49:50] would you be the one to pair with me? :D [19:49:55] mforns: no pressure though - it can also wait next week :) [19:49:59] For sure :) [19:50:08] great! no, tomorrow is cool! [19:50:28] 10Analytics, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10Ottomata) OH! Cool! [19:50:59] elukey: are you OK if we do this tomorrow? maybe before standup? we'll need you to merge stuff :] [19:51:13] he's gone mforns :) [19:51:17] oh ok! [19:55:30] ottomata: I see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/639600 has been merged, shouldn't it stop the navigationtiming alerts? [20:05:09] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) I guess this can be removed now, although it could stay if you wish! https://github.com/wikimedia/puppet/blob/produc... [20:05:11] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Jhernandez) A couple of comments from a subtask that I think should continue here:... [20:06:19] (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643544 [20:06:50] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643544 (owner: 10GoranSMilovanovic) [20:06:57] (03Merged) 10jenkins-bot: T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643544 (owner: 10GoranSMilovanovic) [20:08:15] joal: testing from stat1005 now as per https://phabricator.wikimedia.org/T268684#6649705 [20:15:49] 10Analytics, 10Patch-For-Review: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey Testing again since I had to incorporate some new changes in respect to T268684#6649705. [20:22:47] joal: it seems that netflow size in Druid has increased about 50% with new additions [20:22:54] more or less expected [20:24:43] 90 days of new data (raw) should be around 2.8TB [20:25:08] 1 year of new data (sanitized) should be around 0.6TB [21:04:07] (03PS1) 10Milimetric: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643557 [21:06:21] (03CR) 10Milimetric: [C: 03+2] Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643557 (owner: 10Milimetric) [21:06:38] (03Abandoned) 10Milimetric: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 (owner: 10Elukey) [21:07:50] (03Merged) 10jenkins-bot: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643557 (owner: 10Milimetric) [21:10:19] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) 05Open→03Resolved a:05Dzahn→03JAnstee_WMF Hi @janstee_WMF your shell account has been created. You have been upgraded... [21:12:59] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) [21:23:23] (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643561 [21:23:44] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643561 (owner: 10GoranSMilovanovic) [21:25:53] (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643564 [21:26:20] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643564 (owner: 10GoranSMilovanovic) [21:28:31] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata Human vs Bot Edits system is updated, repo: https://github.com/w... [21:34:37] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:40:57] (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_languagesLandscape] - 10https://gerrit.wikimedia.org/r/643567 [21:41:12] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_languagesLandscape] - 10https://gerrit.wikimedia.org/r/643567 (owner: 10GoranSMilovanovic) [21:44:44] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata Human vs Bot Edits system is updated, repo: https://github.com/w... [21:45:17] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:07:08] (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643570 [22:07:20] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643570 (owner: 10GoranSMilovanovic) [22:37:12] (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643574 [22:37:46] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643574 (owner: 10GoranSMilovanovic) [22:41:03] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata Usage & Coverage system is updated, repo: https://github.com/wik... [22:42:28] (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643577 [23:16:47] (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_identifierLandscape] - 10https://gerrit.wikimedia.org/r/643586 [23:17:01] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_identifierLandscape] - 10https://gerrit.wikimedia.org/r/643586 (owner: 10GoranSMilovanovic) [23:17:14] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643577 (owner: 10GoranSMilovanovic) [23:21:49] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata External Identifiers Landscape system is updated, repo: https://...