[00:09:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10serviceops, 10Patch-For-Review, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10Milimetric) Ok, some of these things are easy and some are a bit harder.  Instead of filtering out the requests at "refine" time, we wi...
[00:50:29] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Instrument-ClientError, 10Wikimedia-Logstash, and 4 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10Tgr)
[01:00:56] <wikibugs>	 10Analytics, 10Analytics-Wikistats: pagecounts-ez uploads stopped after 9/24 - https://phabricator.wikimedia.org/T265378 (10Milimetric) The monthly totals are not available yet, see T265732.  A quick status update:  * we found some malformed rows in pageview_complete, and we're fixing them ** some had 5 column...
[01:10:57] <wikibugs>	 10Analytics-Radar, 10MediaWiki-API, 10Patch-For-Review, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10Milimetric) It's basically stalled on resourcing.  There was...
[04:41:14] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) @JAllemandou @Isaac  Thanks to both of you for going into deeper detail about `page_title` vs. `page_id`. I'm...
[06:42:18] <elukey>	 good morning
[07:17:41] <elukey>	 the patch for hive seems to work for bigtop!
[07:40:05] <elukey>	 created  https://issues.apache.org/jira/browse/BIGTOP-3455 to get it included in bigtop 1.5
[07:40:47] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) It seems to work! Created https://issues.apache.org/jira/browse/BIGTOP-3455 to ask if the patch can be included for Bigtop 1.5.
[07:46:22] <elukey>	 ok I am going to rebuild the hive packages with a -2 at the end, to differenciate them from the upstream ones
[08:54:39] <wikibugs>	 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10elukey) @GoranSMilovanovic I think I may have found the issue, I fixed stat1005. Can you retry a normal execution of your script?  >>!...
[10:16:40] <Amir1>	 elukey: hello :D https://gerrit.wikimedia.org/r/c/operations/puppet/+/643448
[10:17:13] <Amir1>	 only 420 replacements left, don't worry
[10:21:11] <Amir1>	 Thanks!
[10:21:44] <elukey>	 Amir1: well thank you :) just completed the puppet runs via cumin, no op as expected
[10:22:23] <Amir1>	 Awesome
[10:44:37] <wikibugs>	 10Analytics: Alter table for navigation timing errors out in Hadoop test - https://phabricator.wikimedia.org/T268733 (10elukey)
[10:45:11] <wikibugs>	 10Analytics: Alter table for navigation timing errors out in Hadoop test - https://phabricator.wikimedia.org/T268733 (10elukey)
[10:53:42] * elukey brb
[11:02:03] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:02:17] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:03:27] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Servic
[11:03:27] <icinga-wm>	 s
[11:08:35] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:14:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:14:29] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:16:49] <elukey>	 ah lovely datasource drops?
[11:19:12] <elukey>	 ah no it is pageviews
[11:19:17] <elukey>	 so not druid
[11:19:59] <elukey>	 https://grafana.wikimedia.org/d/000000526/aqs?orgId=1 wow
[11:21:19] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:21:35] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wi
[11:21:35] <icinga-wm>	 /Services/Monitoring/aqs
[11:22:43] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:29:35] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:29:57] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:30:11] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:40:07] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:40:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:45:59] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/
[11:45:59] <icinga-wm>	 nitoring/aqs
[11:46:57] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:47:21] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:47:35] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:04:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed
[12:04:33] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:05:01] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimed
[12:05:01] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:05:15] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per
[12:05:15] <icinga-wm>	 t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:06:15] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:06:41] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:06:57] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:22:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed
[12:22:13] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:22:41] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:22:55] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Servic
[12:22:55] <icinga-wm>	 s
[12:25:39] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:26:03] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:26:17] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:31:38] <joal>	 ~
[12:32:36] <joal>	 wow - AQS got some peek
[12:32:51] <joal>	 elukey: would you be around by any chance?
[12:33:10] <elukey>	 I am yes, working with SRE to fix this
[12:36:02] <joal>	 ack - Can I help elukey ?
[12:36:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/me
[12:36:13] <icinga-wm>	 file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:36:39] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/me
[12:36:39] <icinga-wm>	 file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:36:53] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/me
[12:36:53] <icinga-wm>	 file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:37:30] <elukey>	 joal: no thanks, I need traffic to apply a ban :(
[12:37:37] <joal>	 right
[12:37:52] <joal>	 elukey: I'm looking at AQS reqs seeing if I can see a pattern
[12:38:33] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:43:07] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:43:31] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:54:05] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:56:34] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10JAllemandou) Hi @GoranSMilovanovic - Thank you for the changes. However it's not feasible for me to check them...
[12:58:20] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10JAllemandou)
[12:58:53] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:59:11] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:00:31] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:14:37] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:15:01] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wi
[13:15:01] <icinga-wm>	 /Services/Monitoring/aqs
[13:15:19] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Servic
[13:15:19] <icinga-wm>	 s
[13:17:59] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:18:23] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:18:41] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:25:06] <elukey>	 joal: blocked :)
[13:25:15] <joal>	 o/
[13:25:17] <elukey>	 we should be good now, I am going afk for a bit, please call if needed!
[13:25:20] <joal>	 You rock elukey 
[13:25:27] <joal>	 please go elukey 
[13:25:29] <joal>	 Bye!
[14:08:06] <wikibugs>	 10Analytics: Alter table for navigation timing errors out in Hadoop test - https://phabricator.wikimedia.org/T268733 (10Ottomata) a:03Ottomata Hm interesting.  I'll try to look into this next week.
[15:38:16] <elukey>	 !log move stat1004 to A5 
[15:38:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:11:37] <elukey>	 !log move analytics1065 to C3 
[16:11:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:33:45] <fdans>	 ottomata: we can hear you
[16:36:01] <elukey>	 fdans: I am skipping the sync meeting since I am moving hosts with dcops, ping me if I am needed :)
[16:36:26] <fdans>	 elukey: no problem!
[16:46:12] <elukey>	 !log move analytics1066 to C3 
[16:46:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:21:55] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @JAllemandou   > Can you please put your configurations in a repo and share so that we can discuss/re...
[17:50:07] <milimetric>	 ottomata / mforns / joal: any idea if/where data for the mediawiki.recentchange stream lands in HDFS?  I know I can see it in Kafka and EventStreams, but does it touch down anywhere?  Couldn't find it in event/event_sanitized
[17:58:38] <mforns>	 hmm...
[18:02:02] <mforns>	 milimetric: maybe in raw? /wmf/data/raw/event/*_mediawiki_recentchange
[18:04:59] <ottomata>	 milimetric:  it isn't event.mediawiki_recentchange
[18:05:00] <ottomata>	 ?
[18:05:09] <ottomata>	 OHHH
[18:05:09] <ottomata>	 no its not
[18:05:14] <ottomata>	 it doesn't have a consistent schema
[18:05:16] <ottomata>	 we can't import it
[18:05:49] <ottomata>	 https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp#L71-L77
[18:06:30] <ottomata>	 see the comment at the bottom of https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/recentchange/current.yaml
[18:08:10] <wikibugs>	 (03PS1) 10GoranSMilovanovic: configs Deployment 20201125 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643530
[18:08:14] <elukey>	 milimetric: one qs - is there an npm version that we use to buil wikistats?
[18:08:18] <elukey>	 *build
[18:09:26] <elukey>	 (back in a few, will read asap)
[18:10:18] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] configs Deployment 20201125 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643530 (owner: 10GoranSMilovanovic)
[18:19:55] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @JAllemandou   The changes as per your request are now enforced across the WDCM system. More changes...
[18:20:17] <wikibugs>	 10Analytics, 10Patch-For-Review: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey Testing now as per your suggestion in T268376#6647717. Thank you.
[18:24:16] <mforns>	 elukey: this is the change I mentioned in standup: https://gerrit.wikimedia.org/r/c/operations/puppet/+/643531 can you please review? :]
[18:26:23] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn)
[18:31:47] <ottomata>	 milimetric:  you might want to include seve and jason in any governance  discussions
[18:40:36] <elukey>	 !log restart turnilo to pick up new netflow config changes
[18:40:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:41:09] <elukey>	 mforns: deployed, can you check on turnilo??
[18:43:07] <elukey>	 also, is there anybody that can help me with wikistats deploy ? :)
[18:45:37] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) https://wikimediafoundation.org/profile/sumeet-bodington/  shows how Dumisani (T266791) and @JAnstee_WMF  are both in Sumeet's...
[18:48:13] <elukey>	 jenkins is not happy in https://gerrit.wikimedia.org/r/c/analytics/wikistats2/+/643329
[18:48:25] <elukey>	 it seems to complain that package.json vs package-lock.json are not in sync
[18:48:32] <elukey>	 but I'd need to help
[18:50:28] <elukey>	 it says npm ERR! Missing: wmui-base@wikimedia/wikimedia-ui-base
[18:50:37] <elukey>	 that is indeed in package.json but not in -lock
[18:58:28] <GoranSM>	 elukey: the test as requested in https://phabricator.wikimedia.org/T268376#6647717 is now running (driver is stat1005). I will be afk for some time - thus far, the logging seems not to include a tons of DEBUG messages. 
[18:58:45] <elukey>	 GoranSM: nice! 
[18:59:08] <elukey>	 I'll check once done the yarn app id, thanks for pinging me :)
[18:59:29] <GoranSM>	 elukey: Ok. 
[19:04:52] <joal>	 !log Killing job application_1605880843685_18336 as it consumes too much resources
[19:04:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:14:57] <GoranSM>	 joal: I donwscaled the resources for the Pyspark program that run the job https://yarn.wikimedia.org/proxy/application_1605880843685_18336/ as you have requested in https://phabricator.wikimedia.org/T268684#6649444 and you still needed to kill the job. Should I downscale it even more? Please advise.
[19:15:41] <joal>	 Hi GoranSM 
[19:16:10] <joal>	 This is very bizarre - The job I killed was consuming 3.5Tb RAM and 1600+ CPU cores
[19:16:13] <joal>	 GoranSM: --^
[19:16:22] <GoranSM>	 elukey: The job just killed by joal was the one that should have been monitored for the logging problem. So, I will have to see if it needs additional downscaling (it obviously does) and then I will let you know when I start testing again. 
[19:16:36] <GoranSM>	 Hi Joal
[19:16:41] <joal>	 GoranSM: I have cheked logs and indeed the problem of logs is solved :)
[19:17:01] <joal>	 GoranSM: you've been faster than me, I planned on pinging you about that exact matter :)
[19:17:22] <joal>	 GoranSM: log problem solved (thanks a milion elukey to have find the root cause)
[19:17:24] <GoranSM>	 joal: Ok, at least the logs are now fine. Let's see about the resources then. I followed your suggestion in https://phabricator.wikimedia.org/T268684#6648373 but it seems that that is not enough.
[19:17:38] <mforns>	 elukey: thanks!!
[19:17:39] <GoranSM>	 elukey: Thank you for fixing the logging thing. 
[19:17:51] <joal>	 GoranSM: to me it's as if nothing had been done
[19:18:04] <joal>	 GoranSM: in term of downscaling I mean
[19:18:15] <joal>	 Can you show me the command you used please?
[19:18:20] <joal>	 GoranSM: --^
[19:18:22] <GoranSM>	 joal: Of course, just a sec
[19:18:46] <GoranSM>	 joal: sudo -u analytics-privatedata spark2-submit --master yarn --deploy-mode client --num-executors 100 --driver-memory 20000m --executor-memory 20000m --executor-cores 8 /home/goransm/Analytics/WDCM/WDCM_Scripts/wdcmModule_ETL.py
[19:19:07] <GoranSM>	 joal: You requested to reduce the number of cores from 16 to 8. 
[19:19:29] <joal>	 GoranSM: `--num-executors 100` --> `--conf spark.dynamicAllocation.maxExecutors=100`
[19:19:35] <joal>	 Please :)
[19:19:48] <joal>	 GoranSM: the number of executors was unlimited
[19:19:57] <GoranSM>	 joal: And the problem is that my intuition on the number of executors/cores that I need for something to work efficiently in Spark is very bad. 
[19:20:20] <GoranSM>	 joal: How do you mean unlimited (100 is a finite number, right?) :)
[19:20:21] <joal>	 the setting you use is only active when dynamic allocation is off, which is NOT our case
[19:21:10] <GoranSM>	 joal: Interesting. Every day I learn something new. What do you suggest that I should with this (and many other) Pyspark program, then? Reduce --num-executors to..?
[19:21:14] <joal>	 GoranSM: I would suggest to go for small first, then grow if needed :)
[19:21:38] <GoranSM>	 joal: Makes sense. Start with - I don't know, ten, twenty, or more?
[19:21:56] <joal>	 GoranSM: As stated above - Don't use --num-executors please - use --conf spark.dynamicAllocation.maxExecutors=X instead
[19:22:15] <GoranSM>	 joal: Oh oh oh I understand now 
[19:22:18] <joal>	 GoranSM: If you use --num-executors the settings you give is ignored
[19:23:08] <GoranSM>	 joal: So: (1) do not use --num-executors and (2) use --conf spark.dynamicAllocation.maxExecutors=100 ?
[19:23:17] <joal>	 Correct sir
[19:23:22] <joal>	 GoranSM: --^
[19:23:39] <GoranSM>	 joal: Thank you very much sir - testing again in ten minutes or so.
[19:23:42] <joal>	 GoranSM: The second is the replacement for the first when dynamic-allocation is on, which is our case
[19:24:05] <joal>	 Great GoranSM - Thanks for that
[19:24:31] <mforns>	 elukey: netflow in Druid looks good!
[19:24:36] <GoranSM>	 joal: Ok, so the --conf spark,dynamicAllocation.maxExecutors=100 will not allow the cluster to use more than 100, but the cluster will also try to figure how many executors are really needed, right?
[19:24:51] <elukey>	 mforns: super :)
[19:24:55] <joal>	 Also GoranSM, about how to scale - I'd advise doing it by steps starting small and moving up, monitoring execution time and gain insights
[19:25:15] <joal>	 GoranSM: Correct, the cluster will use as many executors needed, up to 100
[19:25:43] <joal>	 GoranSM: With the previous setting is was using as many executors as needed, up to UNBOUNDED :)
[19:26:09] <GoranSM>	 joal: I somehow remember doing that once I've migrated the WDCM etl from HiveQL to Pyspark, but... sincerely I barely remember what were my conclusions back then. Ok, I am up to introduce the changes as suggested. Thank you again! :)
[19:26:21] <joal>	 GoranSM: There is no easy rule of thumb for scaling computation in Spark it depends a lot of the computation, datasize, etc
[19:26:44] <joal>	 No prob GoranSM - Thank you for making this happen :)
[19:27:30] <wikibugs>	 10Analytics, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10JAllemandou)
[19:27:59] <wikibugs>	 10Analytics, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10JAllemandou)
[19:28:37] <elukey>	 all right going to log off people, see you tomorrow! happy holidays in the US :)
[19:28:45] <joal>	 Bye elukey
[19:37:40] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi New fields are in Druid (starting 2020-11-24T03:00:00) :] I've checked that all looks OK, but please do che...
[19:39:17] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10JAllemandou) Thanks @GoranSMilovanovic for the explanation and addition of config to the repo. As pere our chat on IRC,...
[19:40:45] <joal>	 mforns: New netflow fields in druid is an awesome end-of-day for me :) Thanks for that!
[19:42:26] <joal>	 mforns: I have a question related to that for when you have time: Have you moved netflow to the event database?
[19:43:03] <mforns>	 joal: cool! no, I haven't yet moved it, why?
[19:43:13] <joal>	 mforns: just asking :)
[19:43:17] <joal>	 I wondered
[19:43:25] <mforns>	 joal: that was my next step
[19:43:34] <joal>	 this is great :)
[19:44:04] <mforns>	 joal: do you have any tips on how to move?
[19:44:25] <joal>	 mforns: hm, nope - I have not done that ever
[19:44:28] <mforns>	 we could just stop the refine job for a bit, right
[19:44:38] <joal>	 I think that's a right approach
[19:44:39] <mforns>	 we have the fresh data in Druid by streaming
[19:45:15] <joal>	 Stop refine, drop table, move data, recreate table, repair table, update refine conf and here you go :)
[19:45:31] <mforns>	 yea
[19:45:31] <joal>	 and obviously restart refine after that
[19:46:34] <mforns>	 yes, shoudn't take too long
[19:47:01] <mforns>	 maybe pairing on that would be a good idea
[19:47:19] <joal>	 mforns: +1 to pairing-  We'll need our beloved ops around as well :)
[19:47:28] <mforns>	 yea
[19:47:48] <mforns>	 maybe next week then
[19:48:14] <joal>	 mforns: can be done tomorrow if you wish
[19:48:14] <mforns>	 until then I'll look at the new data size in Druid
[19:48:49] <mforns>	 oh, of course, for some reason I thought all SREs were on holiday tomorrow
[19:48:57] <mforns>	 yea, tomorrow is good for me!
[19:49:07] <mforns>	 will prepare a patch
[19:49:37] <joal>	 \o/
[19:49:50] <mforns>	 would you be the one to pair with me? :D
[19:49:55] <joal>	 mforns: no pressure though - it can also wait next week :)
[19:49:59] <joal>	 For sure :)
[19:50:08] <mforns>	 great! no, tomorrow is cool!
[19:50:28] <wikibugs>	 10Analytics, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10Ottomata) OH!  Cool!
[19:50:59] <mforns>	 elukey: are you OK if we do this tomorrow? maybe before standup? we'll need you to merge stuff :]
[19:51:13] <joal>	 he's gone mforns :)
[19:51:17] <mforns>	 oh ok!
[19:55:30] <mforns>	 ottomata: I see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/639600 has been merged, shouldn't it stop the navigationtiming alerts?
[20:05:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) I guess this can be removed now, although it could stay if you wish! https://github.com/wikimedia/puppet/blob/produc...
[20:05:11] <wikibugs>	 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Jhernandez) A couple of comments from a subtask that I think should continue here:...
[20:06:19] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643544
[20:06:50] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643544 (owner: 10GoranSMilovanovic)
[20:06:57] <wikibugs>	 (03Merged) 10jenkins-bot: T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643544 (owner: 10GoranSMilovanovic)
[20:08:15] <GoranSM>	 joal: testing from stat1005 now as per https://phabricator.wikimedia.org/T268684#6649705
[20:15:49] <wikibugs>	 10Analytics, 10Patch-For-Review: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey Testing again since I had to incorporate some new changes in respect to T268684#6649705.
[20:22:47] <mforns>	 joal: it seems that netflow size in Druid has increased about 50% with new additions
[20:22:54] <mforns>	 more or less expected
[20:24:43] <mforns>	 90 days of new data (raw) should be around 2.8TB
[20:25:08] <mforns>	 1 year of new data (sanitized) should be around 0.6TB
[21:04:07] <wikibugs>	 (03PS1) 10Milimetric: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643557
[21:06:21] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643557 (owner: 10Milimetric)
[21:06:38] <wikibugs>	 (03Abandoned) 10Milimetric: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 (owner: 10Elukey)
[21:07:50] <wikibugs>	 (03Merged) 10jenkins-bot: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643557 (owner: 10Milimetric)
[21:10:19] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) 05Open→03Resolved a:05Dzahn→03JAnstee_WMF Hi @janstee_WMF  your shell account has been created. You have been upgraded...
[21:12:59] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn)
[21:23:23] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643561
[21:23:44] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643561 (owner: 10GoranSMilovanovic)
[21:25:53] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643564
[21:26:20] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/643564 (owner: 10GoranSMilovanovic)
[21:28:31] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata Human vs Bot Edits system is updated, repo: https://github.com/w...
[21:34:37] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:40:57] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_languagesLandscape] - 10https://gerrit.wikimedia.org/r/643567
[21:41:12] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_languagesLandscape] - 10https://gerrit.wikimedia.org/r/643567 (owner: 10GoranSMilovanovic)
[21:44:44] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata Human vs Bot Edits system is updated, repo: https://github.com/w...
[21:45:17] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:07:08] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643570
[22:07:20] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/643570 (owner: 10GoranSMilovanovic)
[22:37:12] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643574
[22:37:46] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643574 (owner: 10GoranSMilovanovic)
[22:41:03] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata Usage & Coverage system is updated, repo: https://github.com/wik...
[22:42:28] <wikibugs>	 (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643577
[23:16:47] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T268684 [analytics/wmde/WD/WD_identifierLandscape] - 10https://gerrit.wikimedia.org/r/643586
[23:17:01] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T268684 [analytics/wmde/WD/WD_identifierLandscape] - 10https://gerrit.wikimedia.org/r/643586 (owner: 10GoranSMilovanovic)
[23:17:14] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor [analytics/wmde/WD/WD_percentUsageDashboard] - 10https://gerrit.wikimedia.org/r/643577 (owner: 10GoranSMilovanovic)
[23:21:49] <wikibugs>	 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) @Joal Wikidata External Identifiers Landscape system is updated, repo: https://...