[07:55:17] Good morning [08:11:37] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645344 (https://phabricator.wikimedia.org/T269437) (owner: 10Gerrit maintenance bot) [08:13:03] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645330 (https://phabricator.wikimedia.org/T269426) (owner: 10Gerrit maintenance bot) [08:14:08] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645332 (https://phabricator.wikimedia.org/T269431) (owner: 10Gerrit maintenance bot) [11:27:16] [VOTE] Release Bigtop version 1.5.0 - \o/ [11:28:03] just received the email from the dev mailing list, there is a 72h time to vote/comment/etc.., that is unfortunate since I'd like to upgrade the test cluster to see if everything works [11:28:09] I'll ask for a bit more time [11:28:58] (native packages for buster too! [11:56:16] \o/ [12:03:35] 10Analytics, 10Operations, 10SRE-Access-Requests: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10ssingh) a:03ssingh [12:04:02] 10Analytics, 10Operations, 10SRE-Access-Requests: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10ssingh) (Additional context: T267314). [12:09:49] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10ssingh) 05Open→03Resolved Hi @Swagoel: You should have received an email with the Kerberos password. Please let us know if there are any issues, thanks! [13:41:54] (03PS1) 10Awight: Sanitize and keep TemplateDataEditor events [analytics/refinery] - 10https://gerrit.wikimedia.org/r/646670 (https://phabricator.wikimedia.org/T260343) [14:29:08] (03CR) 10Ottomata: [C: 03+1] Add netflow to eventlogging sanitization include-list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645419 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [15:24:05] (03CR) 10Awight: [C: 03+1] "I love how these reports are broken down into small pieces. But there might be some small efficiency gain if we collect all the preferenc" (034 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/645345 (https://phabricator.wikimedia.org/T260138) (owner: 10Andrew-WMDE) [15:35:52] ottomata: hello! whenever you have some time today, can I ask for a (very) quick review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/645372/? it's about an access request for which we have your +1 already, but this is for the CR [15:35:56] not urgent, thanks! [15:36:45] +1 APPROVED [15:36:53] THANK YOU! [15:36:55] :P [15:52:44] a-team I'm running late for standup, sorry [15:53:34] 10Analytics, 10Event-Platform: Should Webrequest.isWikimediaHost return true for www.wikipedia.org - https://phabricator.wikimedia.org/T269597 (10Ottomata) [16:09:55] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Refine should add field to indicate if event is from wikimedia domain instead of filtering - https://phabricator.wikimedia.org/T256677 (10Ottomata) [16:10:04] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Refine should add field to indicate if event is from wikimedia domain instead of filtering - https://phabricator.wikimedia.org/T256677 (10Ottomata) a:03Ottomata [16:11:50] 10Analytics: Refine should always DROPMALFORMED but alert if records are dropped - https://phabricator.wikimedia.org/T266872 (10Ottomata) [16:12:10] 10Analytics, 10Analytics-Kanban: Refine should always DROPMALFORMED but alert if records are dropped - https://phabricator.wikimedia.org/T266872 (10Ottomata) [16:13:16] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Sanitize and keep TemplateDataEditor events [analytics/refinery] - 10https://gerrit.wikimedia.org/r/646670 (https://phabricator.wikimedia.org/T260343) (owner: 10Awight) [16:15:10] 10Analytics: Update refinery-core Webrequest.isWikimediaHost - https://phabricator.wikimedia.org/T256674 (10Ottomata) [16:15:13] 10Analytics, 10Event-Platform: Should Webrequest.isWikimediaHost return true for www.wikipedia.org - https://phabricator.wikimedia.org/T269597 (10Ottomata) [16:15:48] 10Analytics: Update refinery-core Webrequest.isWikimediaHost - https://phabricator.wikimedia.org/T256674 (10Ottomata) a:03Ottomata [16:15:57] 10Analytics, 10Analytics-Kanban: Update refinery-core Webrequest.isWikimediaHost - https://phabricator.wikimedia.org/T256674 (10Ottomata) [16:20:47] 10Analytics-Radar, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: Downscale Wikidata-analysis pyspark scripts to analytics limits - https://phabricator.wikimedia.org/T268684 (10Milimetric) [16:23:59] 10Analytics, 10Design: Broken icons on https://analytics.wikimedia.org/ - https://phabricator.wikimedia.org/T255840 (10Milimetric) [16:24:02] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Epic: Revamp analytics.wikimedia.org data portal & landing page - https://phabricator.wikimedia.org/T253393 (10Milimetric) [16:24:21] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Epic: Revamp analytics.wikimedia.org data portal & landing page - https://phabricator.wikimedia.org/T253393 (10Milimetric) [16:28:10] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data, 10Wikipedia-iOS-App-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10Milimetric) 05Open→03Declined > For daily, there is still data for 12/1 or 12/2 even now that monthly data is... [16:28:31] 10Analytics-Clusters: Deprecate the anaytics-users POSIX group - https://phabricator.wikimedia.org/T269150 (10Ottomata) [16:28:53] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10Ottomata) [16:29:54] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: mediawiki-wikitext-history-2020-10 failed - https://phabricator.wikimedia.org/T269032 (10Milimetric) p:05Triage→03High a:03JAllemandou [16:32:27] 10Analytics, 10Analytics-Kanban, 10Chinese-Sites, 10Pageviews-Anomaly: Unusual high page view on Chinese Wikipedia - https://phabricator.wikimedia.org/T269065 (10Milimetric) p:05Triage→03High a:03Milimetric [16:33:43] 10Analytics-Clusters, 10Operations: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10Ottomata) a:03klausman [16:33:52] 10Analytics-Clusters, 10Operations: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10Ottomata) p:05Triage→03Medium [17:21:27] 10Analytics, 10CheckUser, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10kaldari) @WDoranWMF @AMooney - This was being worked on by a volunteer a year ago, but never made... [17:47:20] Hi, I am having issues starting a scala kernel on stat1008 using Newpyter. Starting the kernel fails silently, ie there are no errors and I dont have the perms to see logs mentioned on the wiki (`sudo journalctl --since "1 hour ago" -u jupyterhub`). This worked for me as of Saturday, and a different scala kernel that doesnt create a spark session during startup still succeeds. Any ideas what could be going [17:47:21] wrong? [17:49:21] fkaelin: Have you kinit? [17:50:42] fkaelin: and, hi ! sorry :) [17:54:49] joal yes i have [17:56:14] fkaelin: ack! other questions around this (kerberos) idea - when you say newpyter, you're talking about the existing jupyter server running on every stat host and that you can access using ssh tunnel, not your own jupter server - right? [17:57:12] fkaelin: If it is the case, you need to kinit 'inside' the venv, meaning through a terminal opened via the jupyterhub webapp [17:57:46] ah- sorry about that, i think it might be kerberos after all. i am using this for ref https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Newpyter [17:58:15] right fkaelin - start a terminal within your notebook env and kinit in there [17:58:19] so i see this [17:58:30] https://www.irccloud.com/pastebin/HRT34ASR/ [17:58:37] Restart your kernel, and problem should be solved (I forget about doing that every time I run a new notebook :( [17:59:48] i did do kinit right before that, but look at the duration of validity. i reran kinit after your comment, and now it is expires only on 12/09/2020 17:55:34 [18:00:36] No clue fkaelin :( [18:00:55] fkaelin: I assume you kinit within notebook terminal? [18:01:10] maybe i ran kinit right before the token actually expired. [18:01:16] ah no, it was in a separte terminal [18:03:03] well now it works in any case. thanks joal [18:03:35] np fkaelin - kerberos errors in scala notebooks materialize to me in silent errors- That's why I asked about that first [18:11:42] ya fkaelin annoyingly, the kinit has to happen inside the juptyer terminal, not an ssh terminal. the jupyter one and your ssh one on the box are separate tickets [18:13:04] 10Analytics-Radar, 10ChangeProp, 10Discovery-Search, 10Event-Platform, and 3 others: Better way to pause writes on elasticsearch - https://phabricator.wikimedia.org/T230730 (10CBogen) [18:19:40] PROBLEM - Disk space on Hadoop worker on an-worker1101 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:20:46] 10Analytics-Radar, 10Product-Analytics: Content for analytics.wikimedia.org - https://phabricator.wikimedia.org/T267254 (10mpopov) [18:27:40] ottomata, razzi -- Could any of you have a look at an-worker1101 please? [18:28:09] joal: I'll take a look [18:28:35] thanks razzi - I'm interested if a paraticuliar app-id is filling up the space [18:28:46] joal: i'm looking at it too [18:28:55] am actually not sure how to tell that [18:29:02] PROBLEM - Disk space on Hadoop worker on an-worker1101 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:29:05] there is def one disk being filled [18:32:54] hm actually [18:32:57] the cluster is not very busy right now [18:33:04] and there is only one app running on 1101 [18:33:04] application_1605880843685_61049 [18:33:21] Right ottomata - a flink [18:33:24] yeah [18:33:30] kinda expected :) [18:33:34] zpapierski: ping? [18:34:00] OH [18:34:01] it is yarn [18:34:05] sorry i was reading my du wrong [18:34:07] so yes [18:34:09] It is logs [18:34:12] ok i thought it was actual hdfs blocks [18:34:12] yeah [18:34:28] Long running app in yarn generating too uch logs without intermediate aggregation [18:34:33] hmmmmmm [18:34:42] riiight [18:34:42] because [18:34:48] we don't ususally do long running apps like this! [18:34:55] is there such a thing as interrmediate aggregateion? [18:34:57] correct ottomata [18:35:04] Not that Iknow ottomata [18:35:07] hmmm [18:35:47] maybe there is [18:35:49] [18:35:49] yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds [18:35:49] 3600 [18:35:49] [18:35:49] [18:36:03] The logs for running applications are aggregated at the specified interval. The minimum monitoring interval value is 3600 seconds (one hour). [18:37:15] why have we never set this! [18:37:19] Defines how often NMs wake up to upload log files. The default value is -1. By default, the logs will be uploaded when the application is finished. By setting this configure, logs can be uploaded periodically when the application is running. The minimum rolling-interval-seconds can be set is 3600. [18:37:51] ottomata: because we never needed that :) [18:38:16] yeah but event for long but not forever type jobs, it would be nice to have access to logs more easily periodically [18:38:17] looks like we do now~ [18:38:29] right [18:38:31] making task and then patch [18:39:17] joal: Zbyzsko is on vacations, feel free to kill anything he's running if it's causing [18:39:25] issues [18:39:28] ack dcausse - thanks for that [18:39:46] ottomata: shall we kill the app or wait for your ptch? [18:40:27] 10Analytics: Set yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds - https://phabricator.wikimedia.org/T269616 (10Ottomata) a:03Ottomata [18:40:28] hm joal [18:40:29] https://phabricator.wikimedia.org/T269616 [18:40:34] i'm going to have to restart the nodemanagers to apply [18:40:39] i guess we should just kill [18:40:42] and then we can wait for review too [18:40:46] Ack ottomata [18:40:48] ottomata: joal We can truncate the log to free up space: [18:40:48] `316G /var/lib/hadoop/data/h/yarn/logs/application_1605880843685_61049/container_e27_1605880843685_61049_01_000002/taskmanager.log` [18:40:58] razzi: that's good idea too [18:41:03] maybe that is nicer [18:41:09] it isn't writing a lot of logs [18:41:12] its just running forever [18:41:33] Alright I'll truncate it [18:41:33] dcausse: is there a reason zbyskos flink app should keep running? [18:41:36] 316G of logs in 5 days is not neglectable :) [18:41:42] i think we can keep it running [18:42:08] !log truncate /var/lib/hadoop/data/h/yarn/logs/application_1605880843685_61049/container_e27_1605880843685_61049_01_000002/taskmanager.log on an-worker1011 [18:42:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:42:21] ottomata: I don't think it's being used, he was just testing swift I guess [18:42:26] ok [18:42:41] the one flink app that matters (for us) is the one run by analytics-search [18:42:48] ahh i see [18:42:51] got it [18:42:56] ok then let's kill too [18:43:07] razzi want to kill it? [18:43:13] ack ottomata - doing that (exceot if you wish to do it razzi ) [18:43:18] sudo -u hdfs yarn application -kill application_1605880843685_61049 [18:43:21] * joal holds the killing :) [18:43:28] (and log it too :) ) [18:43:50] !log kill testing flink job: sudo -u hdfs yarn application -kill application_1605880843685_61049 [18:43:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:43:55] ty [18:44:07] ok, i'm not going to hurry with a puppet patch then [18:44:09] i've filed a ticket [18:44:11] awesome thanks ottomata and razzi [18:45:16] Hmm, running into a kerberos error, even after kinit [18:45:43] razzi: sudo -u hdfs kerberos-run-command hdfs yarn application -kill application_1605880843685_61049 [18:46:00] Turns out I was able to do it without sudo even [18:46:13] it is indeed feasible razzi :) [18:51:18] !log Test mediawiki-wikitext-history new sizing settings [18:51:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:04:54] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Structured-Data-Backlog, 10Patch-For-Review: Add image table to monthly sqoop list - https://phabricator.wikimedia.org/T266077 (10nettrom_WMF) I checked the Data Lake and noticed we now have `wmf_raw.mediawiki_image` there. From a quick query of it... [19:06:58] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Structured-Data-Backlog, 10Patch-For-Review: Add image table to monthly sqoop list - https://phabricator.wikimedia.org/T266077 (10JAllemandou) Ah! Forgot to mention in the task. The `image` is now present indeed for almost all wikis - One however ha... [19:13:17] RECOVERY - Disk space on Hadoop worker on an-worker1101 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:15:52] milimetric: I manage to make commons image sqoop using an interesting trick: reducing then number of fetched rows! [19:16:08] joal: woah! [19:16:17] interesting... so somehow the size was too much? [19:16:18] milimetric: I think that for some objects the rows are so big that fetching too many of them is just to much at once [19:16:34] joal: maybe they base-64 encode some of the images inline? [19:16:51] milimetric: I also checked the read_timeout conf and it is default value [19:16:56] (03PS1) 10Ottomata: Move pageview filters to PageviewDefinition; add Webrequest.isWMFDomain [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) [19:17:21] maybe it's because there's the metadata mediumblob in the table? not sure how big that can be? [19:17:21] milimetric: I think image_metadata might get big - will triple check once data is finished to be uploaded [19:17:32] Nettrom: you read my mind :) [19:18:54] milimetric: with your approval I'll bring back my change for sqoop tasks fluidification - I have a base patch :) [19:18:55] joal: also, thanks for making this happen! I'm already running queries against wmf_raw.mediawiki_image :) [19:19:04] \o/ Nettrom [19:19:21] joal: sounds good [19:19:38] Nettrom: just fyi: there's no commons data yet, we faked the "SUCCESS" flag [19:19:52] (03PS2) 10Ottomata: Move pageview filters to PageviewDefinition; add Webrequest.isWMFDomain [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) [19:20:21] milimetric: Thanks! I saw joal's comment on the phab task about that. In my case I should be able to work around it [19:20:26] Nettrom: I have not advertised in the task but there are more that just the image (category, categorylinks, externallinks, image, iwlinks, langlinks, templatelinks and in private data watchlist) [19:21:21] joal: I may have noticed that in your patch ;) that's also awesome, it's great to be able to query that in the Data Lake [19:23:39] (03CR) 10jerkins-bot: [V: 04-1] Move pageview filters to PageviewDefinition; add Webrequest.isWMFDomain [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) (owner: 10Ottomata) [19:35:26] Gone for tonight fokls - see you tomorrow [20:01:42] (03PS3) 10Ottomata: Move pageview filters to PageviewDefinition; add Webrequest.isWMFHostname [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) [20:34:44] (03PS1) 10Ottomata: Refine - Add is_wmf_domain transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) [20:40:08] (03CR) 10jerkins-bot: [V: 04-1] Refine - Add is_wmf_domain transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) (owner: 10Ottomata) [20:48:45] (03PS2) 10Ottomata: Refine - Add is_wmf_domain transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) [20:53:11] (03CR) 10jerkins-bot: [V: 04-1] Refine - Add is_wmf_domain transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) (owner: 10Ottomata) [20:58:08] (03PS3) 10Ottomata: Refine - Add is_wmf_domain transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) [21:04:46] (03CR) 10jerkins-bot: [V: 04-1] Refine - Add is_wmf_domain transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) (owner: 10Ottomata) [21:08:51] (03PS4) 10Ottomata: Refine - Add is_wmf_domain transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) [21:36:47] 10Analytics, 10Analytics-Kanban: Refine should always DROPMALFORMED but alert if records are dropped - https://phabricator.wikimedia.org/T266872 (10Ottomata) > If the job fails altogether, try it again in X minutes (X = 30?). If it fails again, alert. Retrying after dropped records wouldn't change the result.... [21:58:35] 10Analytics, 10Analytics-Kanban: Refine should always DROPMALFORMED but alert if records are dropped - https://phabricator.wikimedia.org/T266872 (10Ottomata) I'm not sure this is possible! We'd need to do multiple passes on the data. Once to read the data in PERMISSIVE mode, to know the number of records tha...