[00:36:34] 10Analytics-Clusters: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Bstorm) Ok, I can confirm that you don't want to use that role directly. The reason is that is what I'm using for building the proxy config here https://gerrit.wikimedia.org/r/c/operations/... [00:39:33] PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:03:46] https://databricks.com/session_eu20/project-zen-improving-apache-spark-for-python-users [09:08:05] Thanks elukey --^ :) [09:11:15] 10Analytics, 10Product-Analytics, 10Epic: Readership Retention: New vs. Returning Unique devices - https://phabricator.wikimedia.org/T269815 (10JAllemandou) Adding a wikitech page @Nuria wrote 4 years ago (a few steps ahead :): https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/UserRetention [09:13:03] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654502 (https://phabricator.wikimedia.org/T271260) (owner: 10Gerrit maintenance bot) [11:24:55] elukey: hey, for when you have time https://gerrit.wikimedia.org/r/c/operations/puppet/+/651640 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/654521 [11:30:31] Amir1: elukey is out today (public holiday in .it) [11:31:18] klausman: oh I forgot, yeah. You can take a look if you want to too :D It's noop and straightforward [11:31:51] Havin' a look-see [11:32:39] Thanks! [11:33:06] Both +2's [11:33:13] s/'s/'d/ [11:34:46] I assume it needs submit too? I'd never learn how puppet git repo works [11:35:12] https://integration.wikimedia.org/zuul/ [11:35:19] here nothing is being pushed to jenkins [11:36:34] I can do the puppet merge for you [11:36:59] But you have to submit the code, yes [11:37:15] Top right should have a submit button on Gerrit [11:38:00] Might need a reload of the page [11:59:26] klausman: I don't have +2 to submit it, it's disabled for me [11:59:28] https://usercontent.irccloud-cdn.com/file/P6kb7aQT/image.png [12:00:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/650993/1#message-0117bf277bbd5ebaa6fac2a4e50fff71a41a3096 Luca usually submits my patches too [12:20:03] I can do that, then [12:20:42] submitted and doing a puppet merge now [12:20:49] and done. [12:27:59] Thank you so much [12:28:55] 221 hiera() left [12:29:50] (03CR) 10WMDE-Fisch: [C: 03+2] Add new action and user fields [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/649599 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [12:44:20] (03CR) 10ToprakM: [C: 03+1] "Looks OK" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654502 (https://phabricator.wikimedia.org/T271260) (owner: 10Gerrit maintenance bot) [14:06:50] RECOVERY - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is OK: OK: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:49:20] hello! razzi FYI i'm going to run the chmod and chown commands in https://phabricator.wikimedia.org/T270629 [14:49:43] hmm, actually wait, i guess we should apply the puppet patch first [14:49:47] so that new data doesn't come in wonky [14:52:03] ottomata: o/ please check the commands that I wrote in the task, those are more meant as examples rather than the final list.. There are surely more to do :( [14:52:27] (check in the sense verify that they make sense :) [14:52:45] elukey yeah i did, just added comment about what to do for archive [15:00:19] perfect just seen those, thanks! [15:12:04] 10Analytics, 10Event-Platform: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (10MNeisler) @Ottomata Yes, you can remove client IP and geocoded data. as part of this migration. [15:25:11] ottomata: hey there, ready to deploy puppet patch for hdfs permissions [15:27:19] hello! [15:27:22] ok let's go razzi ! [15:28:24] brb then bc [15:29:25] cool [15:30:27] I won't be in the meetings today since I have a ton of interview feedback to write. [15:31:05] Aside from "interviews and their feedback", my standup story for today is: "Hating partman with incandescent rage" [15:40:11] elukey: how do you stop all timers on an-launcher1002? [15:40:21] list-timers | awk | xargs? [15:53:36] !log stopping analytics systemd timers on an-launcher1002 [15:53:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:07:05] (03PS9) 10Milimetric: Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) (owner: 10Fdans) [16:07:07] (03PS10) 10Milimetric: Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans) [16:07:09] (03PS8) 10Milimetric: Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans) [16:08:09] !log manually failover hdfs haadmin from an-master1002 to an-master1001 [16:08:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:26:40] razzi: how goes? just realized we have a PA sync in 3 mins [16:28:57] ottomata: I'm reading logs for failed refine job and something unrelated poped up: 776649 WARN lines with "com.maxmind.geoip2.exception.AddressNotFoundException" (mostly for address 127.0.0.1) [16:29:05] ottomata: Oh, and Hi :) [16:29:21] joal: i think that is normal and annoying [16:29:29] :) [16:29:53] ottomata: good, it paused at a prompt so I'll wait for after PA sync [16:30:09] ok i have to run an errand after standup [16:30:19] might be good to try and get this done asap [16:30:23] esp since jobs are paused [16:30:26] ok yeah [16:30:27] ottomata: should we update the UDF to not try to lookup this address? (and maybe some others?) [16:30:35] joal: that would be good ya [16:30:41] ok - new task [16:31:06] razzi: since we are mostly waiting, can we do this async in background while we go to PA sync? [16:31:09] ottomata: it printed [16:31:09] ``` [16:31:09] Master status for HDFS: [16:31:09] ----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' ----- [16:31:09] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 [16:31:09] standby [16:31:09] ``` [16:31:10] which seems expected, since the next cookbook steps are to manually failover [16:31:15] I think so [16:31:20] great [16:31:28] proceed [16:33:15] 10Analytics: Update geocode UDF to NOT lookup some addresses - https://phabricator.wikimedia.org/T271340 (10JAllemandou) [16:33:25] ottomata: --^ [16:34:11] 10Analytics: Update geocode UDF to NOT lookup some addresses - https://phabricator.wikimedia.org/T271340 (10JAllemandou) [16:36:25] razzi: when that is done proceed with the nodemanager restarts too ya? [16:36:53] ottomata: sg [16:38:12] ottomata: the last failure mentions memory errors [16:39:11] huh interesting [16:40:06] spark_executor_memory => '4G', [16:40:23] gonna bump to 6G for that job [16:40:39] ottomata: how many executors? [16:40:46] hm [16:40:50] dynamic [16:40:54] right, masx? [16:41:01] ah 64 is default [16:41:07] you think that is better to bump? [16:41:21] not really, I'm trying to get a feel of global resources usage [16:41:28] ottomata: CPU per exec? [16:41:34] sorry, I should read the specs :) [16:41:39] will do that [16:42:01] max cores unset, so whatever is default [16:42:08] 1 is default [16:44:26] joal how about i change default refine max exectuors to 128, and bump refine_event job exec mem to 6G? [16:45:08] !log restart yarn nodemanagers [16:45:10] hm, I wonder ottomata - IIRC we have a layer of parallelization for inner spark-jobs [16:45:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:45:21] ottomata: How many jobs in parallel for those? [16:45:25] yes but only per table [16:45:29] Do we parallelize based on size? [16:45:32] each table is sequential [16:45:38] Ah ok [16:45:50] so only one e.g. mediawiki_api_request hourlry refine at a time [16:46:02] And we parallelize refining hours? [16:46:15] I don't get what we parallelize [16:47:07] Default: the number of local CPUs [16:47:08] so [16:47:36] lets say there are 10 tables each with 2 hours that need refinement [16:47:39] that'd be 20 different refine jobs [16:47:49] if parallelism >= 10 [16:47:58] there will be 10 running in parallel, one for each table [16:48:09] but e.g. mediawiki_api_request hour 9 will run before mediawiki_api_request hour 10 will run [16:48:11] Right [16:48:22] Thanks for explanation [16:49:49] just checked a hadoop worker an-worker1099, it shows 72 cores [16:49:57] so that could be 72 parallel refine jobs at once [16:50:04] maybe we should set a max parallism [16:50:15] Let's try with your settings ottomata - I also think we should add "--conf spark.executor.memoryOverhead=1024" [16:50:23] ok [16:50:52] ottomata: That would be a good idea (reducing parallelism), but we would probably be willing to do that based on job size [16:51:08] yeah hard to predict that though [16:51:10] we'd have to specificy manually [16:51:20] ottomata: not really, we can get HDFS data-size [16:51:26] hm [16:51:40] ottomata: With that data, we would also be better at partitioning [16:51:49] oh? [16:52:05] ottomata: for instance https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/MediawikiXMLDumpsConverter.scala#L175 [16:52:19] (you mean spark partitioning, right?) [16:52:24] ottomata: correct [16:55:19] ottomata: looks like the last step is to restart systemd timers and oozie jobs; can you do the timers restart and I'll unpause the jobs on the yarn ui? [16:56:34] yes [16:56:40] razzi: w ready for that? [16:57:00] joal...could this be failing now because we added the .cache() ? [16:57:35] ottomata: I'm ready [16:57:47] ok i'll start timers now then [16:57:52] hmm acutally wait [16:57:56] lets just test the umask! [16:58:03] how so? [16:58:14] i thnk i can just create a file... [16:58:33] we also need to chmod the dirs first right... [16:58:43] ottomata: it shouldn't :( [16:58:58] ottomata: ah yes right [16:59:15] hdfs dfs -ls /tmp/a0 [16:59:15] Found 1 items [16:59:15] -rw-r----- 3 otto hdfs 0 2021-01-06 16:58 /tmp/a0/f1 [16:59:17] looks good [16:59:45] ottomata: caching should lead to data falling on disk if it doesn't fit in memory, not fail :( [16:59:47] razzi: i'm going to chmod and chgrp files now, ok? [16:59:53] ottomata: go for it [17:00:11] ottomata: Can you please let me know when you have changed the seetings and relaunched the job so that I monitor it? [17:00:21] joal i won't get it done til later this afternoon [17:00:27] ack ottomata [17:00:36] the jobs are all paused righ tnow anyway [17:09:32] (03PS10) 10Milimetric: Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) (owner: 10Fdans) [17:09:34] (03PS11) 10Milimetric: Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans) [17:09:36] (03PS9) 10Milimetric: Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans) [17:16:07] razzi here is where I am [17:16:08] https://gist.github.com/ottomata/4d6a488256a32df2d4d3fa374b89fdc5 [17:16:19] i'm running the first two [17:16:23] but they aren't 'done' [17:16:25] it takes a while i guess [17:16:35] also my internet is not being friendly to ssh sessions [17:17:27] oh the chown,chmod commands should be run from e.g. an-master1001.e [17:23:30] joal: i will rerun those failed refine jobs later this afternoon [17:24:04] ack ottomata [17:54:15] PROBLEM - Check the last execution of analytics-dumps-fetch-pageview on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:00:41] PROBLEM - Check the last execution of analytics-dumps-fetch-pageview on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:11:12] Niiiice :) https://engineering.linkedin.com/blog/2021/fastingest-low-latency-gobblin [18:29:32] ottomata: systemctl stop refine-*.timer does the trick for all refines for example :) [18:29:49] just don't do systemctl stop *.timer since it will stop also the system ones [18:30:20] razzi: qq - I see that you did a manual failover for hdfs, was it needed after the cookbook? [18:30:29] in theory it should take care of everying [18:30:33] *everything [18:30:39] failover restart failback [18:31:17] (no manual steps are needed) [18:32:48] ottomata: I'll update https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers#Disable_a_Systemd_timer [18:33:14] in theory if we get to a common namespace/prefix, like analytics-refine-etc.. we could stop all timers in one go with [18:33:19] systemctl stop analytics-* [18:33:21] err [18:33:25] systemctl stop analytics-*.timer [18:33:56] elukey: while you're here - Open that page I linked above (for tomorrow :) [18:34:19] already done :) [18:34:23] <3 [18:34:24] :) [18:35:05] elukey: that's a good point; in the later step I realized it does its own failback and let it do its thing, but did intervene the first time around [18:35:17] ottomata, razzi - please also check analytics-dumps-fetch-pageview on labstore100[6,7].wikimedia.org, timers failed due to permissions :) [18:35:54] razzi: yes it basically does everything for you, if it misses some docs or logging that can help feel free to send a code review :) [18:36:02] you are the first one using it other than me [18:36:30] need to run to dinner, ttl! [18:54:36] elukey: the first manual failover we had to do [18:54:46] because an-master1002 was master namenode [18:55:00] and the cookbook asked us to verify that the normal setup (with an-master1001 as master) was correct [18:55:16] oh intersting ok [18:55:19] will check labstore shortly [18:55:20] hmmm [18:55:22] right [18:55:22] hm [18:57:38] ok that's public data nyway [18:57:41] i'm chmoding that one back [18:57:49] hdfs:///wmf/data/archive/{pageview,projectview} [18:58:12] ottomata: chmoding won't be enough for new data, right? [18:58:23] hm [18:58:26] true. [18:58:52] ottomata: we need to pull data from a user having reading rights on the cluster [18:59:17] ottomata: and we can use the chmod option of hdfs-rsync to make data readable from all on labstore if needed [18:59:28] it looks like that runs as root? [18:59:30] on labstore? [18:59:31] yeah [18:59:31] hm [18:59:35] MEH? [18:59:42] * joal continues to parse queries [18:59:45] have meetingg [18:59:47] will be back [18:59:51] and look later [19:00:30] no [19:00:31] 'dumpsgen' [19:00:32] User dumpsgen executes as user dumpsgen the command [19:22:43] Does anybody know why I wouldn't be able to ssh into labstore1006.eqiad.wmnet? [19:24:17] I get `channel 0: open failed: administratively prohibited: open failed` [19:50:54] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10ppelberg) [19:59:51] ottomata: the chown and chmod for /wmf/data/event are still going, after a couple hours... how long would it be alright to pause oozie jobs and timers? [20:03:52] i think oozie jobs can start [20:03:59] razzi: sorry just got done with meeting [20:04:08] np [20:04:10] yeah event has a lot of dirs and files [20:04:20] razzi: labstore1006.wikimedia.org [20:04:22] ^^ :) [20:04:32] razzi: ya i say go ahead and start oozie [20:04:42] is /wmf/data/event the only remaining one? [20:04:51] if so, we can just start all systemd timers except for the refine ones [20:05:21] haven't started the others; thought I'd wait for /wmf/data/event [20:05:48] they can go in parallel [20:05:50] it won't hurt [20:06:06] hm [20:06:24] raw doesn't really matter that much i think [20:06:32] i'd really like to start camus back up [20:06:36] the longer we wait the harder it will have to work [20:08:10] alright, I'll start the other chowns/chmods in parallel [20:08:55] ok great [20:09:00] impressively, /wmf/data/wmf is 10x larger than /wmf/data/event, so I'm expecting that'll take a few days [20:09:10] its more the number of files that makes it take so long [20:09:11] than the size [20:09:18] but, yeah [20:09:20] yeah true [20:09:29] buuut yeah actually, it might be more files [20:09:33] welll, hard to say [20:09:39] it has more than event does in it [20:09:46] but it does have fewer days of data haning around [20:09:48] (60 I think?) [20:10:00] ok, i think we can't really wait days...hm [20:10:03] i'm going to start camus [20:10:08] i don't expect there to be a real problem [20:12:24] Gone for tonight team - see you tomorrow :) [20:14:01] !log re-starting camus systemd timers [20:14:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:14:28] razzi: i'm going to start other timers that I think should lbe fine [20:14:31] reportupdater ones should be ok [20:14:47] eventlogging to druid ones too [20:14:50] ack [20:15:11] the drop data ones should be ggood too' [20:18:27] !log restarted reportupdater timers [20:18:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:19:42] !log restarted drop systemd timers [20:19:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:19:55] !log restarted eventlogging_to_druid timers [20:19:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:29:06] 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10Ottomata) [20:29:14] 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10Ottomata) p:05Triage→03High [20:40:33] razzi: i'm going to start the rest of the timers too [20:40:41] i can't see how this chown would hurt this [20:41:00] its only chowning and moding existant files to the perms that will be set when new files are created [20:42:34] !log starting remaining refine systemd timers [20:42:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:04:56] ottomata: ahhh right, we don't have an alarm for what namenode is primary (but only that at least one is active), I am curious to know if an-master1002 was active because 1001 died in some way or because an old maintenance [21:05:19] we may want to add an alarm if an-master1001 is not hdfs primary for say a day [21:05:22] wdyt? [21:05:37] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Evaluate a differentially private solution to release wikipedia's project-title-country data - https://phabricator.wikimedia.org/T267283 (10Isaac) > This looks awesome @Isaac! Can't wait to try it out. Thanks! If there's anything... [21:07:40] https://phabricator.wikimedia.org/T271362 is also interesting [21:08:05] so dumpsgen is only in kerberos, it is surely my bad :( [21:08:30] elukey: aye alarm if not master sounds good [21:08:31] hmmm [21:08:43] razzi: elukey, earlier, hdfs dfs -touchz f1 [21:08:49] resulted in a 640 file [21:08:50] but now [21:08:53] 644 [21:08:54] . [21:08:56] if i do [21:09:02] hdfs dfs -Dfs.permissions.umask-mode=027 -touchz f8 [21:09:04] i get the 640 [21:09:21] its like...it was working, but now its not?! [21:09:51] OHHH [21:09:52] wait [21:09:56] maybe this is beacuse i'm doing it from an-launcher1002 [21:10:03] where the pppet has been disabled [21:10:05] yeah [21:10:13] the client picks up the old settings [21:10:18] in theory from stat100x should work [21:10:24] elukey: ...why did we need a namenode restart then? [21:10:49] ok phew, it works right from stat008 [21:12:19] ottomata: the namenodes should be in sync with all the settings, I thought about some weird corner case of file creation etc.. It doesn't hurt to have it applied in there, just to avoid weird debugging sessions later on:) [21:13:20] IIRC I had to roll restart also the Yarn node managers to force them to pick up the new settings (since the can act as clients in some cases) [21:14:23] aye [21:14:40] does dumpsgen need to be inside analytics-privatedata-users? It may be a quick fix but that user will have access to other PII data [21:15:03] i think that's ok [21:15:06] that would work [21:15:07] or [21:15:14] could we use the analytics-privatedata user on labstore? [21:15:16] would that be better? [21:15:45] or hm elukey [21:15:55] we couldl override whatever jobs create those public files [21:16:01] with a different umask [21:16:19] IIRC one of the use cases that I discussed with Moritz at the time was to be mindful of the fact that labstores are reachable from the outside internet, so say one exploits the node for any reason it can also get access to more data on hadoop [21:16:26] so the less the better [21:16:31] aye [21:16:49] ah yes we can make those public for sure [21:17:54] I'll figure something out tomorrow with Joseph if you don't get to it today, you and razzi already did a ton of work, thanks a lot! [21:20:53] elukey: ok great, thank you, i'll let this sit for you too, i'm not quite sure what the right thing to do is [21:20:56] two* [21:21:09] razzi, refine and camus look ok from here [21:21:17] i think once those chmods are done we can call this done. [21:22:01] very very very nice [21:22:08] i sent email to analytlyics-announce [21:22:31] yep saw it, hopefully no complains during the next days :) [21:22:44] razzi: nice work :) [21:22:53] going afk, ttl! [21:28:40] 10Analytics: Separate RSVD anomaly detection into a systemd timer for better alarming with Icinga - https://phabricator.wikimedia.org/T263030 (10ssingh) > @ssingh @elukey > > I've been looking into this a bit and have had some second thoughts. Thanks for summarizing this, @mforns! > Current approach: > - Oozi... [23:20:08] Hi team, for the record, got very distracted by news; glad to see elukey and ottomata stayed on top of the hadoop permissions change. The various `/wmf/data/` chowns and chmods are still running [23:20:19] Signing off for the day!