[01:56:26] PROBLEM - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:04:28] PROBLEM - Check the last execution of analytics-dumps-fetch-pageview_complete_dumps on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview_complete_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:04:38] PROBLEM - Check the last execution of analytics-dumps-fetch-pageview_complete_dumps on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview_complete_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:03:50] PROBLEM - Check the last execution of analytics-dumps-fetch-geoeditors_dumps on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-geoeditors_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:10:10] PROBLEM - Check the last execution of analytics-dumps-fetch-geoeditors_dumps on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-geoeditors_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:01:45] 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10ArielGlenn) I'd like that not to be root; what are our choices? Looping in @Bstorm who will have thoughts on this I am sure, since those are WMCS boxes doin... [07:24:52] 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10elukey) I'll try to follow up on this today, we tightened up the file permission settings on hdfs and some use cases like this one popped up. The alternativ... [07:25:26] Good morning [07:25:33] We haz new permz :) [07:25:54] bonjor joal ! [07:26:07] it is also good that some corner cases popped up as consequence [07:26:14] I really like the new scheme [07:26:16] How are you elukey ? Have you met the kings yesterday? [07:26:54] Agreed - changes of read rights not making some stuff fail is weird :) [07:32:48] joal: it is funny since in It we don't celebrate the kings but https://en.wikipedia.org/wiki/Befana [07:32:58] Ah! [07:33:02] that it is more like Santa [07:33:18] Please excuse my ignorance :) [07:34:03] nono please, it is related to the three kings more or less, but nobody really think about it anymore [07:34:30] it is nice to see how things change across catholic countries :D [07:34:34] elukey: I like very much (trying) to understand better :) [07:36:47] Viva, viva La Befana! [07:37:51] asked to Marcel if they have anything similar, really interested :D [07:37:59] indeed! [07:38:22] I am reading the alerts@ dir, no bueno [07:38:30] all those refine errors are so weird [07:38:36] hm [07:38:49] elukey: Maybe we should have restarted oozie? [07:39:19] elukey: there is also a message from Mortten in the task about failures [07:39:28] I have not started the alerts folder yet [07:39:34] will go for it in a minute [07:39:59] no idea, it seems that a lot of coordinators are waiting, we paused oozie for a while so It may be due to a backlog causing delays? [07:40:09] hm [07:40:32] what is the task about failures? [07:40:57] (the coords waiting are all the daily ones) [07:42:28] elukey: https://phabricator.wikimedia.org/T270629#6727202 [07:42:41] Ok diving into erriors [07:44:11] wow - indeed all coords are suspended [07:44:16] This is weird [07:44:42] elukey: could it be related to the 'oozie' user not being part of the needed group for HDFS readability? [07:45:44] joal: but oozie doesn't really need to create/read/etc.. files, the mapred jobs are ran as other users [07:45:55] elukey: not so sure! [07:46:06] elukey: for instance oozie reads its config from hdfs [07:47:00] joal: sure but it has its own directory that it can read [07:48:14] Morten's error should be temporary in theory, I wonder if there were chmods in progress [07:48:21] elukey: new subfolder will be unreadable for all though [07:48:30] org.apache.hadoop.security.AccessControlException: Permission denied: user=nettrom, access=READ_EXECUTE, inode="/wmf/data/event/mediawiki_revision_tags_change/datacenter=eqiad/year=2021/month=1/day=6/hour=13":analytics:analytics:drwxr-x--- [07:48:43] but the dir is now analytics:analytics-privatedata-users [07:49:03] joal: what do you mean "unreadable for all" ? [07:49:32] elukey: it'll be rw-r----- [07:49:45] actually rwxr-x--- [07:49:48] since it;s a folder [07:49:59] Therefore not readable by ALL [07:50:04] by OTHER sorry [07:50:05] which one joal ? [07:50:10] anyone :) [07:50:23] Since we use the new umask [07:50:24] yes it is the goal :D [07:50:30] indeed :) [07:50:49] this means that for instance newly dpeloyed refinery folder will not be accessible to OTHERS [07:51:21] So if oozie is not in the correct group, it'll not be able to read the config folder [07:51:36] This is not the current case, but I prefered to mention, in case [07:51:59] ahhh okok now I get it [07:52:14] Also, I have checked the folder Mortten mentioned and it is indeed -rw-r----- 3 analytics analytics-privatedata-users [07:52:22] So all good for that IMO [07:52:32] I think it happened when the chmods were in progress [07:52:38] agreed --^ [07:52:49] ok, so back to unlocking oozie [07:53:04] elukey: now that it is all stalled, shall we restrart it, just in case| [07:53:07] ? [07:53:32] And actually, maybe we want to do a new deploy and a job restart, to make sure it works with the new perms? [07:54:07] !log restart oozie on an-coord1001 [07:54:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:57:51] mmm joal browser-general-coord for example still shows up as "SUSPENDED" [07:58:11] elukey: webrequest as well [07:58:14] a lot of coords are suspended :D [07:58:20] we didn't re-enabled them [07:58:21] I think most of them are [07:58:30] Have they been disabled? [07:58:56] yeah it was something that I suggested as precautionary measure to avoid files being created when the new settings were applied [07:59:08] Ok - this explains then :) [07:59:20] !log re-enabling all oozie jobs previously suspended [07:59:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:01:01] ok we are back in business [08:04:55] also we should alert the discovery people as well [08:05:32] doing so [08:05:33] elukey: also about refine jobs failing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/654650 [08:07:58] I saw it yes [08:08:27] now - shall we test a refinery-deploy elukey? [08:08:46] or do you wish we fix the dumpsgen stuff first? [08:09:56] dumpsgen could be the next inline if possible [08:10:00] ack [08:10:12] also the refine one is important [08:10:30] do you prefer a lower parallelism? Is there a safer value? [08:10:48] I agree on your point from yesterday: we should try to minimize the right of dumpgen given it resides on labstore hosts [08:11:23] about refine: we can try with 64 and see [08:11:37] elukey: worst case, some big jobs fail (they currently already do) [08:12:30] joal: all right so ok to merge? [08:12:42] yes please [08:12:47] super doing it [08:13:06] elukey: in the meantime I'll review failed refine and rerun if needed [08:13:17] <3 [08:13:45] mmm puppet on an-launcher1002 is still disabled [08:14:09] ok re-enabling [08:14:25] !log re-enable puppet on an-launcher1002 to apply new refine memory settings [08:14:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:26:02] !log Rerunning 4 failed refine jobs (mediawiki_cirrussearch_request, day=6/hour=20|21, day=7/hour=0|2) [08:26:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:42:52] elukey: webrequest is gently catching up - Today I'm happy we improved the job performance :) [08:45:03] :) [08:50:02] all right so now dumpsgen [08:51:49] elukey: refine jobs reran successfully [08:51:52] emails sent [08:52:35] goood [08:53:22] ok so in theory dirs like /wmf/data/archive/geoeditors/public/ should have the other bits set [08:53:58] for the directories it should be fine, since they pick the parent's settings (so in this case a chmod will suffice in theory [08:54:18] but for the files we'd need to apply a new umask [08:54:27] :22:40 INFO Refine: Beginning refinement of hdfs://analytics-hadoop/wmf/data/raw/event/eqiad_mediawiki_cirrussearch-request/hourly/2021/01/05/16 -> `event`.`mediawiki_cirrussearch_request` (datacenter="eqiad",year=2021,month=1,day=5,hour=16)... [08:54:31] 21/01/05 19:29:03 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/event/eqiad_mediawiki_cirrussearch-request/hourly/2021/01/05/16 -> `event`.`mediawiki_cirrussearch_request` (datacenter="eqiad",year=2021,month=1,day=5,hour=16). (# refined records: 12742802) [08:54:36] woops sorry [08:55:48] elukey: what I think we should do is explictely add a chmod action in oozie jobs for archived stuff [09:00:28] joal: is it the only option? We used to have specific umask settings for camus before yestarday, maybe we could apply the same to those oozie jobs? [09:02:30] Is it an accident that schemas/event/secondary doesn't automatically submit? If my team is merging a schema update, can we go ahead and submit, or is that best left up to analytics or ops engineers? [09:03:59] awight: hi! Better to ask Andrew later on if you can wait, just to be sure [09:04:33] elukey: we probably can do it the same way it's been done for camus, but I'd prefer an explicit step I think (very similar, except for explicuit oozie step vs job setting0 [09:05:02] elukey: Thanks, yes we're fine with waiting. His comment in https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/649599 makes it sound safe to merge, but it's a surprising workflow so I'll ask him for the full story. [09:08:19] awight: ah sorry you already have the rubberstamp from Andrew, then I think it is fine to submit.. Probably CI hasn't been set up or wasn't part of the plan, so manual actions are needed [09:09:21] joal: I prefer the opposite since it seems more clear and less prone to failures (for example, why do we need an extra step etc..) but you are always wiser than me so if you prefer that option let's do it, I don't have any opposition [09:09:42] elukey: let's talk with the team [09:20:04] joal: if you feel strongly about the new step let's create a patch so people can review etc.., I'd prefer to fix this issue today [09:20:17] (if you have time of course) [09:25:27] 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10ArielGlenn) Sounds good to me! [09:37:30] !log forced re-run of monitor_refine_event_failure_flags.service on an-launcher1002 to clear alerts [09:37:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:39:20] RECOVERY - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:41:10] joal: wow FastIngest with Gobblin seems really nice [09:43:49] elukey: it does! [09:43:57] elukey: wanna cave for a minute? [09:46:02] 1 min and I'll join [09:46:06] kack! [10:10:08] PROBLEM - Check the last execution of check_webrequest_partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:14:55] this might be legit since we have stopped webrequest for hours --^ [10:15:07] I think it is elukey [10:21:55] joal: I'd execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event" if you are ok [10:34:54] +1 elukey - sorry for the delay [10:35:10] ack! [10:36:51] !log execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event" on an-master1001 to fix some file permissions (an-launcher executed timers during the past hours without the new umask) - T270629 [10:36:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:40:13] elukey: can you please also do it for /wmf/data/event_sanitized [10:44:07] joal: I will yes [10:44:12] <3 [10:45:39] (I just checked and indeed it has the same problem) [10:46:10] elukey: I think the rest should be ok [10:46:33] famous last words joal, you should have learnt the lesson of not saying them out loud :D [10:46:40] :D [10:47:03] Sorry, I tend to be optimnistic sometimes and forget about The Rule(TM) [10:47:14] elukey: I meant other datasets ;) [10:48:14] ahahaha [11:24:52] !log execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event_sanitized" to fix some file permissions as well [11:24:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:40:00] going to lunch! Will keep working on the perm stuff later on :) [13:33:57] 10Analytics, 10Product-Analytics: Set up a system for team-managed command-line jobs - https://phabricator.wikimedia.org/T271420 (10nshahquinn-wmf) [13:35:38] 10Analytics-Radar, 10Product-Analytics: Set up a system for team-managed command-line jobs - https://phabricator.wikimedia.org/T271420 (10nshahquinn-wmf) [13:36:21] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Set up environment for Product Analytics system user - https://phabricator.wikimedia.org/T258970 (10nshahquinn-wmf) 05Open→03Resolved This environment is working well; I'm currently using it to run the `wikipediapreview_stats` Oozie job. We still don'... [13:42:56] joal: here I am! [13:43:01] Heya :) [13:45:30] elukey: I actually managed to find the stuff by myself [13:45:35] sorry for thep ing :) [13:46:00] super :) [14:04:02] elukey@stat1004:~$ hdfs dfs -D fs.permissions.umask-mode=022 -touchz test4 [14:04:05] elukey@stat1004:~$ hdfs dfs -touchz test5 [14:04:08] elukey@stat1004:~$ hdfs dfs -ls test4 test5 [14:04:10] -rw-r--r-- 3 elukey elukey 0 2021-01-07 14:03 test4 [14:04:13] -rw-r----- 3 elukey elukey 0 2021-01-07 14:03 test5 [14:04:15] joal: --^ [14:04:34] Yes [14:05:02] works very nicely, I am going to add it to the deploy script [14:05:13] awesome elukey [14:05:33] elukey: I'm reviewing special cases for my patch, then will update for them and send for review [14:05:48] <3 [14:21:16] elukey: there are 3 archive paths that use the archive-job but not synced externally - I wonder if we should keep them 640 or use the default 644 that other archive jobs use [14:22:20] hm - I'm gonna make them 640, making explicit the fact that we don't use those folders externally through perms [14:22:50] ok - I'll do that after kids - sorry elukey I've been slower than expected - You'll have the CR later today [14:23:46] (03PS1) 10Elukey: Add a parameter to refinery-deploy-to-hdfs to force umask to 022 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) [14:24:10] ack! [14:25:16] (03PS2) 10Elukey: Add a parameter to refinery-deploy-to-hdfs to force umask to 022 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) [14:28:13] (03CR) 10Elukey: "Some thoughts:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [14:30:45] (03PS3) 10Elukey: Add a parameter to refinery-deploy-to-hdfs to force umask to 022 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) [14:31:20] (03CR) 10Elukey: "I like option 3) more, updated the code review, lemme know :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [14:33:21] * elukey afk for 10 mins! [14:56:34] o/ [14:57:55] goood morning [14:58:02] 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10Bstorm) Sounds sensible. [14:59:00] elukey thanks for merging that patch [14:59:22] np! Thank you for filing it :) [14:59:24] i wasn't 100% sure that the --conf spark.executor.memoryOverhead=1024 would work exactly as specified [14:59:41] and i didn't feel confident merging it last night without being able to watch it [15:00:22] i'm rerunning the latest failed SearchSatisfaction one now [15:01:08] <3 [15:01:24] I checked the option and it seemed sound [15:26:19] ottomata: there is a horrible chain of patches starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/654871/1 to remove all the remaining users of the 'researchers' posix group [15:26:50] Moritz is reviewing, if you want to check them please do otherwise I'll merge and nuke the group for good :) [15:36:19] ok I am merging :) [15:40:52] !log deprecate the 'reseachers' posix group for good [15:40:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:41:45] (03CR) 10Ottomata: [C: 03+1] "I'm fine with any of the 3 options, and 3) is nice so sure!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [15:43:01] 10Analytics, 10Product-Analytics: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10nshahquinn-wmf) [15:47:33] 10Analytics, 10Product-Analytics: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10nshahquinn-wmf) [15:47:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Spike: POC of refine with airflow - https://phabricator.wikimedia.org/T241246 (10nshahquinn-wmf) [15:47:38] 10Analytics, 10User-ArielGlenn: Spike [2019-2020 work] Oozie Replacement. Airflow Study / Argo Study - https://phabricator.wikimedia.org/T217059 (10nshahquinn-wmf) [15:49:37] 10Analytics, 10Product-Analytics, 10Epic: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10nshahquinn-wmf) [15:54:50] 10Analytics-Radar, 10Product-Analytics: Set up a system for team-managed command-line jobs - https://phabricator.wikimedia.org/T271420 (10nshahquinn-wmf) [16:00:24] fdans: (nerd-snipe attempt) https://analytics.wikimedia.org/ shows a weird like at the bottom [16:00:36] "Code for this page can be seen here" [16:08:03] elukey: well we should fix it before the next visitor to analytics.wikimedia.org comes, in about two weeks [16:09:15] elukey: will open a task [16:14:38] fdans: I'll try to fix the bug to show up my html skills [16:21:30] there's something broken with the anaconda-wmf package on an-launcher1002, apt currently proposes a downgrade there [16:24:07] ottomata: --^ [16:31:27] !log remove /etc/mysql/conf.d/research-client.cnf from stat100x nodes [16:31:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:35:06] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10elukey) p:05Triage→03Medium [16:35:25] 10Analytics-Clusters, 10Analytics-Kanban: Deprecate the anaytics-users POSIX group - https://phabricator.wikimedia.org/T269150 (10elukey) [16:40:11] (03PS1) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [16:40:17] elukey, ottomata --^ [16:42:07] (03PS2) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [16:49:12] moritzm: looking [16:50:56] (03CR) 10Ottomata: "Am fine with this approach, but, would just setting fs.permissions.umask=022 in .properties files be enough?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [16:53:44] ah joal we are commenting both on ticket and patch [16:53:49] joal i'm fine with anything [16:53:50] :) [16:53:54] Let's do it here [16:54:30] ottomata: if we add the setting globall yto the job, we'll endup with more than expected files being readable by all [16:54:43] oh hm [16:54:44] right. [16:54:54] But, we could probably add the setting to the hive query generating the files to be archived - I can try that [16:55:24] i hink if we can do it with just the umask setting, that is cleaner (doesn't have extra chmod command step) [16:55:32] but, if the chmod is easier and makes more sense, that's fine of course [16:56:26] I told elukey that I'd rather have an explicit oozie step, but after having written the thing, I think I actually don't mind that much - A few jobs need explicit settings NOT to have default in my current approach, probably better to actually have a few jobs having explicit settings to make OTHER accessible [16:56:39] ottomata: Will try [16:57:30] ok! [17:17:19] https://fosdem.org/2021/schedule/event/amd_gpus/ [17:17:20] (03PS1) 10Joal: Update perms of oozie jobs writing public archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654904 (https://phabricator.wikimedia.org/T270629) [17:17:46] elukey, ottomata: here is the new patch- I will test a job (both spark and hive) [17:18:02] also elukey: sorry for changing my mind :) [17:18:37] cool abstract elukey :) [17:20:47] elukey, joal heya! I think the new permits scheme is preventing druid ingestion [17:20:56] wowo [17:21:07] mforns: good catch!!! [17:21:13] joal: I know that you prefer ottomata :D [17:21:21] elukey: :-P [17:21:38] mforns: without any alarms? [17:21:42] * elukey sighs [17:21:57] elukey: indexing jobs succeeded - historical can't load data [17:22:06] I'm trying to backfill netflow and I get the following error: org.apache.hadoop.security.AccessControlException: Permission denied: user=druid, access=EXECUTE, inode=\"/tmp/DataFrameToDruid/wmf_netflow/20210107163550\":analytics:hdfs:drwxr-x---\n at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279) [17:22:12] oh my [17:22:30] elukey: I'm wrong [17:22:42] joal: I lost a year of life :D [17:22:52] elukey: data is available for webrequest_128 up to now [17:23:41] maybe it's just the way I'm executing the backfilling [17:23:44] mforns: where are you running the script from? [17:23:47] it is very weird [17:23:50] an-launcher1002 [17:23:51] it is!!! [17:24:09] elukey: does druid belong to analytics-privatedata? [17:24:13] nope [17:24:20] WEIRDOH! [17:25:18] how is that even possible that druid job can read hive data? [17:25:24] Ah - maybe [17:26:33] nope [17:27:00] I am wondering if it is due to spark running in client mode [17:27:04] webrequest data looks good (no ALL perm) - I don't explain myself how druid can run a successful indexation job [17:27:27] and joal, elukey: I also have related problems with superset: https://superset.wikimedia.org/superset/dashboard/73/ sorry for being the bearer of bad news... :[ [17:28:10] nono thanks for sharing, we are patching holes now.. [17:28:19] we have relied for too much on wrong perms :( [17:29:01] funny elukey: files in refined webrequest folders have +x ??? [17:29:42] that's probably because of hive [17:30:10] there is probably a parent dir dictating the +x or similar [17:33:53] so let's try to organize the work to check [17:35:12] joal: what are you looking at currently? So we can parallelize [17:35:32] elukey: I'm testing my oozie patch [17:35:40] ack perfect [17:35:41] elukey: ok for you? [17:35:44] ack [17:35:48] can I help elukey [17:35:50] ? [17:35:54] exactly I was about to ask [17:36:21] for some reason druid indexations are now working but only manually, I don't see any errors from the an-launcher1002 point of view [17:36:25] can you double check? [17:36:46] elukey: should I re-run my backfilling? [17:37:04] nono I mean if data has been indexed during the past hours in druid [17:37:30] I see, will check for indexation errors [17:38:08] I just checked the sql lab functionality in superset, the tables owned by analytics-privatedata etc.. work [17:38:21] (03PS11) 10Fdans: Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) [17:42:19] elukey: can you visualize the data quality dashboard in superset? https://superset.wikimedia.org/superset/dashboard/73/ [17:43:13] mforns: nope it fails, but it makes sense given the actual permissions for the data quality stats table [17:43:20] can we jump on bc? [17:45:55] elukey: sure! [17:45:57] omw [17:48:39] 10Analytics-Radar, 10Product-Analytics: Set up a system for team-managed command-line jobs - https://phabricator.wikimedia.org/T271420 (10nshahquinn-wmf) [18:04:41] ottomata, elukey: one reason for which we need an explicit chmod on archive is for folder creation - I'm currently testing the first patch [18:08:35] !log chown -R /tmp/DataFrameToDruid analytics:druid (was: analytics:hdfs) on hdfs to temporarily unblock Hive2Druid jobs [18:08:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:10:08] !log disable temporarily hdfs-cleaner.timer to prevent /tmp/DataFrameToDruid to be dropped [18:10:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:21:59] !log "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/wmf/data_quality_stats" [18:22:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:22:36] !log chown -R /tmp/analytics analytics:analytics-privatedata-users (tmp dir for data quality stats tables) [18:22:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:26:59] elukey: I'm fighting with oozie - I hope I'll have results soon [18:27:11] joal: why for folder creation? [18:27:13] oh [18:27:19] because you'd also have to set the umask for that [18:27:28] which maybe is possible but might be annoying to do it for every operation eh? [18:27:52] ottomata: because the archive job moves data, and explicitely creates the folder (if needed) before [18:28:43] joal: I found two quick hacks to solve the outstanding issues, but the long term fix might require more time :( [18:28:53] elukey: about? [18:28:57] aye you couuuuld probably provide -Dfs.permissions.umask to folder creation but that's getting a wee annoying eh? [18:28:58] data frame 2 druid and data quality stats create files in /tmp [18:29:08] so if correct perms are applied to the parent dirs, all good [18:29:10] :( [18:29:14] ok [18:29:16] (see my earlier commands) [18:29:19] yup [18:29:36] and we verified on the druid overlord console, a lot of indexations failed but no alarm [18:29:40] * elukey cries in a corner [18:29:53] (03PS3) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [18:29:58] /o\ [18:30:04] MEH this is sad :( [18:30:15] I am happy because we are finding a lot of weird corner cases [18:30:22] tighter perms are good [18:30:51] ok the above patch has proven almost working for hive + archive [18:31:08] I still have a problem with recursive folders - Will test a trick [18:39:50] trick fails :( [18:39:58] * joal is out of idea :( [18:40:23] * joal goes to the cave, waiting for a nice soul to exchange [18:40:56] joal: we are in the sync with research/PA, will join asap :) [18:41:06] actually joining that one [19:00:03] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) a:05Cmjohnson→03RobH @robh can you complete the off-site work for an-worker1118-1138. Still needs dhcpd file updated and maybe netboot.cfg.... [19:26:16] 10Analytics, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Cmjohnson) The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-image needs to happen. Let me know if you... [19:27:41] 10Analytics, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) >>! In T270768#6729455, @Cmjohnson wrote: > The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-... [19:47:45] 10Analytics: Separate RSVD anomaly detection into a systemd timer for better alarming with Icinga - https://phabricator.wikimedia.org/T263030 (10mforns) >>The anomalous metric and the corresponding deviation can not be attached to the email. The "alertee" needs to track down the alert (possibly ssh-ing into a st... [19:48:14] * elukey dinner~ [19:53:02] 10Analytics: Separate RSVD anomaly detection into a systemd timer for better alarming with Icinga - https://phabricator.wikimedia.org/T263030 (10mforns) >> So, what if we use the current anomaly file as state? Currently, every time the script detects an anomaly, it writes a file to HDFS. With some changes to the... [20:16:16] elukey: I have not managed to make it work :( [20:16:39] I'm gonna sign off for today - Will continue to work on fixing tomorrow [20:21:31] ouch :( [20:31:35] I have updated https://phabricator.wikimedia.org/T270629 with the current status [20:34:07] 10Analytics, 10Event-Platform, 10Fundraising-Backlog: CentralNoticeBannerHistory and CentralNoticeImpression Event Platform Migration - https://phabricator.wikimedia.org/T271168 (10EYener) Hi all! CCing @AndyRussG - do we in Fundraising need IP data for either CentralNoticeImpressions or CentralNoticeBannerH... [20:34:32] razzi: if you have time can you sync with mforns about the druid indexation jobs failed to re-run? [20:35:15] elukey: sure thing [20:35:45] mforns: when's a good time for you? I'm free for the next 20 minutes and again in an hour [20:36:04] with ssh -L 8081:druid1002.eqiad.wmnet:8081 druid1002.eqiad.wmnet you can see on localhost:8081 the current status of the indexations [20:36:33] there is a section called "tasks", with some failures reported [20:36:41] razzi: In an hour is better for me, if that's ok [20:36:45] mforns: sounds good [20:36:46] you can click and expand the list to show more than 20 entries [20:36:56] razzi: ok, let's meet at half past? [20:36:58] so you can have an idea about what failed (and didn't alert) [20:37:02] mforns: thanks a lot! [20:37:17] k, no problemo1 [20:37:33] mforns: yep [20:37:39] k :] [20:38:15] I am going to log off for today, will restart tomorrow morning :) [20:38:33] 10Analytics, 10Event-Platform, 10Fundraising-Backlog: CentralNoticeBannerHistory and CentralNoticeImpression Event Platform Migration - https://phabricator.wikimedia.org/T271168 (10AndyRussG) >>! In T271168#6729693, @EYener wrote: > Hi all! CCing @AndyRussG - do we in Fundraising need IP data for either Cent... [20:45:53] 10Analytics, 10Event-Platform, 10Fundraising-Backlog: CentralNoticeBannerHistory and CentralNoticeImpression Event Platform Migration - https://phabricator.wikimedia.org/T271168 (10AndyRussG) @Ottomata hey also (apologies if this question has already been resolved) is there any action needed on our part for... [20:47:44] (03PS1) 10Lex Nasser: Create and configure Oozie job to load 'Top Articles by Country Pageviews API' data into Cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654924 (https://phabricator.wikimedia.org/T207171) [20:51:58] (03CR) 10Lex Nasser: "Tested this change up until loading into Cassandra, which failed because the keyspace was not yet created. I will be able to test this cha" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654924 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [21:01:14] 10Analytics, 10Event-Platform, 10Fundraising-Backlog: CentralNoticeBannerHistory and CentralNoticeImpression Event Platform Migration - https://phabricator.wikimedia.org/T271168 (10Ottomata) Oh...that is a bit of a problem then. Can we change that so that it logs using the EventLogging client JS? [21:05:53] 10Analytics: Password for Kerberos - https://phabricator.wikimedia.org/T271467 (10kostajh) [21:32:07] razzi: I'm back, ping me whenever! [21:32:31] mforns: cool, give me 2 mins [21:32:36] sure