[02:30:00] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), and 3 others: Spin out a tiny EventLogging RL module for lightweight logging - https://phabricator.wikimedia.org/T187207 (10Krinkle) 05Open→03Resolved @Milimetric My test is with (a simpl... [06:55:29] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:55:39] good morning! [06:55:42] -.- [06:58:48] dsaez: o/ [06:59:03] hi elukey [06:59:09] here is a python3 script a bit too hungry of memory on notebook1003 [06:59:15] *there [06:59:21] yes, sorry I'll clean now [06:59:45] thank youuuu [07:17:44] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) @elukey once you've transferred all the files to the definite location, this t... [07:18:22] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [08:43:36] joal: o/ [08:43:38] bonjour [08:43:59] when you are online I'd need to copy some dbstore1002 files to hdfs [08:44:30] I thought about /wmf/data/archive/backup/misc/dbstore1002_backup [08:44:47] (essentially some .sql files with database dumps) [08:44:53] what do you think? [08:45:27] (we are backing up some oooold databases that are not going to be migrated to the new dbstore env) [08:54:08] in the meantime, I merged the change to migrated the reportupdater hdfs jobs to timers and of course I made a mistake [08:54:11] fixing [09:19:54] so the interlanguage job kept failing even before the timer [09:21:47] that is basically https://phabricator.wikimedia.org/T213219 [09:34:01] !log removed ./jobs/limn-language-data/interlanguage/.reportupdater.pid in /srv/reportupdater on stat1007 to force the first run of the timer [09:34:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:34:26] so empty pid file [09:35:30] will ask to milimetric moar info, I am seeing that the task to alert of empty pid file has been declined [09:38:21] buuut it looks good :) [09:38:32] today we should be able to move report update to timers \o/ [09:38:45] then we have spark jobs [09:43:00] going afk for a bit, cat to the vet :) [10:28:26] helloo elukey just got in, is there anything I should do about these alerts? :) [10:30:58] fdans: hola! Which ones? [10:31:13] the ones from today? [10:31:33] notebook1003 is ok, the other would need to be checke [10:31:35] *checked [10:31:48] the other is the dataloss one? [10:32:44] yes exactly [10:32:51] there are only two right? [10:57:11] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is OK: OK [12:26:32] Hi elukey - Sorry late start for me today - [12:27:24] elukey: copying files to HDFS is completely fine for me :) [12:27:36] And the path looks ok [12:28:38] hello joal! [12:28:46] I have been busy merging the admin puppet change [12:29:03] if you give me the green light I'd remove from test+production the unnecessary ssh keys from the Hadoop masters [12:29:06] (finally) [12:29:28] hm - ssh keys from gone people? [12:31:12] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10elukey) [12:31:37] joal: no no, ssh keys from people that are not listed as admns [12:31:39] *admins [12:31:57] Ah ok [12:32:08] so we'll still map users to hdfs [12:32:13] but they will not be able to ssh [12:32:18] to the hadoop masters [12:32:22] ok [12:32:34] going to test it on the testing cluster now [12:32:48] this is the groundwork to then allow us to deploy users on all the workers [12:32:57] to use linux containers with yarn [12:33:15] Makes sense elukey [12:33:20] +1 ! [12:33:35] super, proceeding :) [12:41:06] joal: worked like a charm on testing [12:48:20] also done in prod: ) [12:48:22] gooood [12:48:25] now lunch :) [12:48:51] Buon apetitio elukey :) [12:48:56] oops -i [13:02:43] (03CR) 10Joal: "Two comments inline - I confirm the newly computed snapshot doesn't have the revision-duplication issue." (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [13:53:25] fdans: I'm currently on data quality - Do you want to discuss in da cave? [13:53:43] joal: yesss [13:55:09] all the report updater jobs have been migrated to timers \o/ [13:55:21] * joal clap clap clap ! [14:02:35] now last steps are the spark jobs [14:18:42] YAY, puppet provided users with no ssh keys :D [14:19:07] \o/ [14:21:30] joal: if you have time [14:21:33] sudo -u hdfs -copyFromLocal dbstore1002_backup /wmf/data/archive/backup/misc/ [14:21:38] from my home dir on stat1007 [14:21:42] does it look reasonable? [14:22:45] elukey: sudo -u hdfs hdfs dfs -copyFromLocal dbstore1002_backup /wmf/data/archive/backup/misc/ [14:22:54] elukey: sounds good :) [14:23:04] elukey: parent folder exists, so good [14:23:20] ah snap dfs yes [14:23:21] thanks :) [14:23:39] elukey: hdfs dfs actually [14:23:43] yep yep [14:23:58] for some reason my brain types it correctly in some cases [14:24:01] and forgets in others [14:24:09] :) [14:24:10] LRU cache buggy [14:24:28] I usually forget the "hdfs" bit when sudoing as [14:24:40] As if the username was the command [14:25:58] worked! [14:26:05] great :) [14:28:01] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10elukey) Done! ` elukey@stat1007:/srv/home/elukey$ sudo -u hdfs hdfs dfs -ls /wmf/data/arc... [14:28:34] 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10elukey) [14:36:01] (03CR) 10Nuria: [V: 03+2 C: 03+2] Add VisualEditorFeatureUse schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/485354 (https://phabricator.wikimedia.org/T212588) (owner: 10Neil P. Quinn-WMF) [14:43:01] * elukey afk for a bit [14:49:08] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - DB size - https://phabricator.wikimedia.org/T212763 (10Nuria) @TheSandDoctor all files are compressed and thus much smaller than what you get once you uncompress them and tried to use them. Please take a look at: https://meta.wikimedia.org/wiki/Data... [14:49:16] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - DB size - https://phabricator.wikimedia.org/T212763 (10Nuria) 05Open→03Declined [14:56:28] elukey: yeah we decided the pid problem was just a weird one-off and it's safer to fail when we see it. The improved reporting of errors should help, as it did with the interlanguage job, thanks for fixing that [15:01:21] joal: thanks for checking the snapshot, it looks like you were right! [15:01:27] though I'm now scared about what that means :) [15:01:43] I'm technically off today, though I'm still not sure if I should work, Stephanie's being strange [15:15:01] milimetric: o/ - sure, maybe the error msg should be a bit more verbose, it took me some minute to figure out the problem [15:15:44] true, that'd be an improvement [15:20:33] (03PS1) 10Milimetric: Clarify pid file error message [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/485673 (https://phabricator.wikimedia.org/T213219) [15:20:55] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Clarify pid file error message [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/485673 (https://phabricator.wikimedia.org/T213219) (owner: 10Milimetric) [15:21:30] k, that's merged, can be deployed next time we deploy RU [15:34:47] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings: event_pageissues Turnilo view contains no valid data from before January 5 - https://phabricator.wikimedia.org/T214136 (10mforns) > I see - does that also explain the "No data was returned" error in Superset? I'm not sure, it fails even with latest data... [15:38:34] mforns: o/ [15:38:45] heya Lucaaa [15:39:06] holaaa [15:39:25] mforns: I'd like to move sanitize_el_analytics to timers if you are ok (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483426/) [15:39:40] is there any special thing to do? [15:39:53] I am basically replacing the crons with the timers [15:40:07] this is the first spark job migrated [15:40:16] if it turns out to be ok I'll migrate the rest [15:43:48] yes, I've seen that, thanks! It looks good to me [15:43:59] is the 4:14 an easter egg? [15:47:22] mforns: 4:14? [15:47:42] elukey, yes there's a job that starts daily at 4:14h [15:48:44] it surprised me, because of the minute 14, why not just 4:00h, and I thought it was an easter egg, like bible passage or sth :] [15:48:56] mforns: ah yes! Not sure why there is 14, good point, it should be 4:15 [15:49:04] basically I took the cron timings from the lines below [15:49:08] hehe [15:49:11] ok [15:49:43] fixed on the fly :) [15:50:20] basically this [15:50:21] # Puppet Name: monitor_sanitize_eventlogging_analytics [15:50:21] 15 4 * * * /usr/local/bin/monitor_sanitize_eventlogging_analytics >> /var/log/refinery/monitor_sanitize_eventlogging_analytics.log 2>&1 [15:50:32] is it ok right? [15:51:00] because in this case the cron will be absented and the timer (even for the refine monitor) will be added [15:53:06] mforns: --^ [15:53:40] in any case, would you have 5/10 mins to check with me after the merge that nothing is exploding? [15:56:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Refactor analytics cronjobs to alarm on failure reliably - https://phabricator.wikimedia.org/T172532 (10elukey) [15:59:51] elukey, sure! [16:31:16] mforns: done1 [16:31:23] ok [16:32:34] 10Analytics, 10Analytics-Kanban: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 (10Milimetric) [16:33:53] 10Analytics: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Milimetric) p:05Triage→03Normal [16:34:51] 10Analytics: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Milimetric) a:03elukey [16:35:08] 10Analytics, 10Analytics-Kanban: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Milimetric) [16:37:16] 10Analytics, 10Analytics-Kanban: Move AQS to nodejs 10 - https://phabricator.wikimedia.org/T210706 (10Milimetric) a:03Milimetric [16:38:10] 10Analytics: Deprecate Spark 1.6 in favor of Spark 2.x only - https://phabricator.wikimedia.org/T212134 (10Milimetric) a:05elukey→03Ottomata [16:38:21] 10Analytics: Deprecate Spark 1.6 in favor of Spark 2.x only - https://phabricator.wikimedia.org/T212134 (10Milimetric) switching to Andrew per Luca [16:42:18] mforns: from my side is all green, let me know if you see anything weird [16:42:25] elukey, ok [16:47:26] 10Analytics, 10Analytics-Kanban: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10elukey) Hey Aaron, I can see the following: ` elukey@stat1007:~$ apt-cache policy git-lfs git-lfs: Installed: 2.3.4-1 Candidate: 2.3.4-1 Version table: 2.6.1-1~bpo9+1 100 100 http:/... [16:47:44] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10elukey) [17:02:05] 10Analytics, 10Analytics-EventLogging, 10Discovery, 10EventBus, 10Services (watching): Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 (10JAllemandou) I had in mind that one of the reason for using avro originally... [17:05:18] Jan 21 17:00:01 an-coord1001 systemd[1]: Started Spark job for sanitize_eventlogging_analytics. [17:05:21] Jan 21 17:00:01 an-coord1001 systemd[1]: sanitize_eventlogging_analytics.service: Main process exited, code=exited, status=203/EXEC [17:05:24] mforns: --^ [17:05:26] no bueno :D [17:05:33] mmmmm [17:05:40] lookin [17:05:55] I think it is a misconfiguration of the timer iselft [17:08:04] ahhh snap [17:08:11] the script doesn't have a shebang [17:09:05] yeah [17:09:10] found it mforns [17:09:25] oh [17:10:38] sending the patch now [17:14:04] mforns: https://gerrit.wikimedia.org/r/485689 [17:14:11] lookin [17:14:50] +1'ed [17:14:54] thanks! [17:14:55] merging [17:14:59] the rest looks good afaics [17:15:00] is that needed for timers? [17:15:05] ok [17:19:28] mforns: yeah basically it seems that systemd does not like anything in ExecStart (basically what to execute) that it is not explicit [17:19:36] in this case, a script without the shebang [17:19:41] aha [17:19:46] didn't know it [17:19:52] if you want to play around [17:19:56] on an-coord you can type [17:20:13] systemctl status sanitize_eventlogging_analytics.service to see how the script is doing [17:20:22] now it is not executing since it completed [17:20:30] but you can see (code=exited, status=0/SUCCESS) [17:20:36] that is the last execution [17:20:42] it's dead [17:20:46] ok ok [17:20:54] well it already finished [17:21:01] cooool [17:21:11] logs are in journalctl -u sanitize_eventlogging_analytics [17:21:18] (with sudo) [17:21:25] where they previously logged anywhere? [17:21:36] I might have missed it [17:21:37] awesome Luca, I had this thing in my todo for many days now [17:22:03] elukey, well the real logs I get using yarn logs [17:22:32] even the driver ones, because this job is called with deploy-mode=cluster [17:23:26] ack, we get some know in journald about spark-submit [17:23:44] if everything goes well I'll migrate the rest :) [17:23:45] k [17:36:54] * elukey off! [19:40:26] a-team, can you please look at my off-site email and respond until tomorrow's stand-up? thxxxxx [21:24:13] (03PS1) 10Joal: Update delete/restore in mediawiki-history [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/485710 (https://phabricator.wikimedia.org/T213603) [21:24:17] milimetric: --^ [21:24:36] milimetric: Please don't hesitate if the changes don't make sense :) [21:24:41] joal: I'm at a total loss, where in the world did the quality job put the results? [21:24:58] isn't it supposed to be in history_check_errors [21:25:05] milimetric: youre talking about the mediawiki-checker right? [21:25:20] milimetric: it only writes output in case of error :) [21:25:25] yeah, I ran it, it said 93 errors out of 5463 and no output [21:25:30] milimetric: no error - no output :) [21:25:41] oh it's below the threshold! [21:25:42] Ah - errors ?? hm [21:25:45] ok, so I guess it's good [21:26:15] checking milimetric [21:26:24] 19/01/21 17:33:18 INFO DenormalizedHistoryChecker: DenormMetricsGrowthErrors ratio: (93 / 5463) = 0.017023613399231193 [21:26:37] (i'll do your code review) [21:26:56] ok, joal, so I'll make the documentation tweak you suggested and besides that I think we're good to deploy, right? [21:27:07] let's do it together tomorrow maybe [21:27:07] +1 milimetri ! [21:27:11] Sounds good :) [21:27:17] ok, cool, good night for now [21:27:38] Bye ;) [21:27:59] Gone for now team - see y'all tomorrow [22:20:38] (03CR) 10Nuria: Update mediawiki-history comment and actor joins (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [22:31:38] (03PS9) 10Milimetric: Update mediawiki-history comment and actor joins [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [22:31:49] (03CR) 10Milimetric: Update mediawiki-history comment and actor joins (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [22:56:59] (03CR) 10Nuria: Update mediawiki-history comment and actor joins (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [23:34:18] milimetric: i think these docs are quite outdated: https://www.mediawiki.org/wiki/Extension:EventLogging#Using_mediawiki-vagrant but I also think i saw a page where you updated this info, right? [23:42:32] milimetric: nevermind found ticket and corrected docs: https://www.mediawiki.org/wiki/Extension:EventLogging#Using_mediawiki-vagrant