[00:02:37] 10Analytics, 10EventBus, 10Core Platform Team Backlog (Later), 10Services (later), and 2 others: EventBus should not use service container in application logic - https://phabricator.wikimedia.org/T204296 (10Pchelolo) a:03holger.knust [05:03:53] 10Analytics, 10Analytics-Wikistats: There are two entries for Cantonese language - https://phabricator.wikimedia.org/T215139 (10fdans) [08:23:17] Good morning jetlagged-europ-team :) [08:23:31] I hope elukey has made it home yesterday :( [09:06:40] elukey: wow just saw the email - Man I hope you managed to sleep not too bad :( [10:06:49] PROBLEM - Check the last execution of check_webrequest_partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions [11:15:54] 10Analytics: Clean up home dirs for user mkroetzsch - https://phabricator.wikimedia.org/T214501 (10MoritzMuehlenhoff) The user has now been removed. [11:48:51] 10Analytics, 10Operations, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10aborrero) >>! In T212824#4922929, @Dzahn wrote: >>>! In T212824#4922713, @elukey wrote: >> @aborrero has already done a similar thing for the tool-forg... [13:33:04] Heya team - fighting with my computer to get it back in track (graphics driver issue after upgrade :( [13:46:57] joal: o/ [13:47:16] I managed to get to home [13:47:25] was reaaaaally long [14:05:01] Maaaan [14:05:22] elukey: I'm back in the game (graphics issue solved) - Please get some rest :S [14:06:19] joal: yep yep I am not going to be fully around, I slept a bit but now I am going to avoid sleeping more otherwise I'll stay in SF's timezone for this week :D [14:06:22] how was your flight? [14:06:45] elukey: a bit of delay for the second one, but nothing major [14:07:02] and now same for me: staying awake! [14:07:48] I wanted to avoid sleeping at all but I was destroyed a couple of hour ago :D [14:08:05] I can imagine elukey [14:08:36] elukey: now that the computer is working I will look into failed webrequest job [14:11:20] I was doing the same! :) [14:13:47] I don't get from the yarn logs if it is a temporary HDFS glitch or something else [14:27:09] on an-coord1001 the check_webrequest_partitions.service is marked as failed since 4h, known issue? [14:27:43] moritzm: hi - known issue, we're on it (sorry for not getting faster) - Thanks for the ping :) [14:29:27] cool, just wanted to make sure it has reached you :-) [14:29:28] moritzm: thanks! That is one of the crons that became timers [14:29:34] (as FYI) [14:30:19] good, with the cron we would have missed that I guess [14:30:50] also, there's a disk space alert for stat1007: "DISK CRITICAL - free space: / 1805 MB (2% inode=97%):" [14:30:57] the thing was used to send an email to alert about missing data, buuut if that ends up in the spam .. :) [14:31:04] yep I am also checking stat1007 now [14:33:11] mmmm what is userarchive? [14:33:17] I am kinda confused [14:37:51] ahhh it is the admin module! [14:39:14] TIL: it saves old users data to /var/userarchive when deleting the user [14:42:41] ah, that explains [14:42:53] mkroetzsch was removed earlier the day [14:43:02] project has ended, MOU expired [14:43:24] the whole thing is also racy [14:43:57] the same puppet run tries to deluser the UID in question, but as long as the tarball is generated in userarchive that UID is in use and it fails... [14:44:16] moritzm: but it doesn't remove home's data right? [14:44:30] I mean, I am safe to remove /var/userarchive/* now [14:44:38] I have no idea, I had the same TIL five minutes ago :-) [14:46:10] elukey: log in oozie for failed job seem to be related to an HDFS glitch, but can't say more :( [14:46:12] I think so [14:46:16] elukey: ok for me to relaunch? [14:46:32] in /srv/home/mkroetzsch there's still 170G of user data [14:47:01] joal: I had the same impression, +1 [14:47:04] I'm wondering what the purpose of this userarchive script is, if the home is kept anyway [14:47:30] !log Rerun webrequest-load-coord-text for 2019-02-04T04:00:00 [14:47:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:47:52] moritzm: really strange [14:47:53] /usr/sbin/deluser --remove-home --backup-to=$ARCHIVE_DIR $username &> /dev/null [14:47:58] in theory this should remove the home [14:48:10] but it doesn't, and it creates the archive [14:48:32] the tarballs is most likely incomplete [14:48:37] the user has 170G in the home [14:48:47] but the entire / partition is only 90G [14:49:38] cleaned up the archive dir [14:49:43] root partition is ok now [14:49:50] so the 80G we see in /var/userarchive are probably an incomplete tarball, so deluser failed [14:50:10] going to open a task to investigate, I didn't know about this cleanup feature [14:50:13] :D [14:50:24] need to go errand for half an hour, brb! [14:51:06] on systems where /home is on a separate partition (like for stat hosts) we should probably set $ARCHIVE_DIR to a directory on the separate parition as well [15:26:03] joal: Hello! Do you have a minute to discuss https://phabricator.wikimedia.org/T214897#4923780 ? [15:26:45] Hi GoranSM1 - I have some time but will need to drop in ~10 minutes - So either short or later :) [15:27:07] joal: Later sounds better? [15:27:39] Sounds easier (at least for me) - Would 18:30 CEST work for you? [15:27:52] joal: Thanks! Just let me know when you have some time. In a nutshell: from pyspark I can't read.parquet because it can't infer the schema, and I also have some trouble w. parquet-tools schema [15:28:04] joal: 18:30 CET is perfect, thank you! [15:28:13] elukey: created https://phabricator.wikimedia.org/T215171 for the userarchive handling [15:28:46] GoranSM1: You probably haven't put a subforlder in the parent folder I sent in the task: there are 2 possible subfolders (based on the dump date) [15:29:37] I assume you want the most recent, and therefore you should try to read wih the path being: /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001 [15:29:41] GoranSM1: --^ [15:29:51] Will ping you around 6:30 when I get back [15:30:08] joal: I see them. Thank you, 18:30 and maybe I manage to solve the problem until then - will let you know! [15:30:15] Sure :) [15:30:25] joal: Thanks :) [15:36:44] joal: All fine, I can access the files. Thanks! [15:36:50] np GoranSM1 :) [15:40:14] elukey: failed job succeeded that time - not sure about the issue :( [15:46:24] joal: back :) - so it seemed an issue talking with an-worker1080, I'd archive it as a glitch, but let's keep an eye on new recurrences [15:53:25] sure [15:53:53] Will miss standup - Melissa can't get the kids - Back after [16:01:57] :) only one in the batcave, but I'm in no condition to talk to humans anyway [16:03:11] thinking out loud about the main thing that's on my mind: the actor/comment refactor: [16:03:41] 1. merged puppet sqoop changes: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/486203/ [16:05:18] 2. just thought of it!!! forgot to change logging_compat back to logging!!! Should do that now! Doing [16:19:11] (03PS1) 10Milimetric: Switch back to sqooping from logging [analytics/refinery] - 10https://gerrit.wikimedia.org/r/487871 [16:20:12] joal: theoretically should merge this ^ but I think it's fine to leave it as is too, we don't take too much of a performance hit. I'll wait for your thoughts because I don't want to trade correctness for performance with my jetlagged brain [16:23:05] 3. merge and deploy refinery / refinery-source changes: Done [16:23:42] 4. restart the oozie jobs (load, denormalize, check, reduce) [16:25:05] judging by https://hue.wikimedia.org/oozie/list_oozie_coordinator/0070366-181112144035577-oozie-oozi-C/ and https://hue.wikimedia.org/oozie/list_oozie_coordinator/0070362-181112144035577-oozie-oozi-C/, joseph restarted load and denorm [16:25:39] I guess maybe check and reduce don't need to be restarted... yeah, they should be compatible, the join only affected processing not what the history looks like [16:28:38] mmmk, that should cover it, but probably good to have a list [16:30:13] 10Analytics, 10EventBus, 10Core Platform Team Backlog (Later), 10Services (later), 10good first bug: EventBus should make better use of DI - https://phabricator.wikimedia.org/T204295 (10Pchelolo) a:05Pchelolo→03holger.knust [16:33:35] 10Analytics, 10EventBus, 10Core Platform Team Backlog (Later), 10Services (later), and 2 others: EventBus should not use service container in application logic - https://phabricator.wikimedia.org/T204296 (10Pchelolo) 05Open→03Declined Given that all the code is actually called from hooks, which are sta... [16:48:13] RECOVERY - Check the last execution of check_webrequest_partitions on an-coord1001 is OK: OK: Status of the systemd unit check_webrequest_partitions [16:51:02] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Security-Team, and 3 others: Modern Event Platform: Stream Intake Service: AJV usage security review - https://phabricator.wikimedia.org/T208251 (10sbassett) **Update:** review will be posted here by Friday (2/8/2019) at the latest. [17:07:01] milimetric: Hi ! [17:07:16] Just read your list - I +1 everything :) [17:08:22] Meaning that normally nothing else is to do [17:09:51] joal: ok, cool, so we can merge the logging compat change later you think? [17:09:59] We can merge your patch for logging-no-compat when you want [17:10:13] I don't think it should impact anything [17:10:38] ok, let's let it go like this since that's how I tested the full pipeline, and merge after this snapshot [17:10:49] thanks joal, I'm gonna go back to being dazed and sleepy [17:11:23] oh joal regarding an-coord1001, I saw some warnings from oozie last week but never looked at it, it was webrequest statistics [17:11:38] I tried but couldn't look on my phone and the wifi was always broken at the venue [17:11:53] was going to follow up this week but saw you guys had problems with it too [17:12:01] milimetric: I'm planning to double check them with the script, but did not do today (maybe later0 [17:12:26] milimetric: The error one just needed a rerun, and the others are less important [17:12:33] k [17:22:58] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Mediawiki history has no data on IP blocks - https://phabricator.wikimedia.org/T211627 (10nettrom_WMF) @JAllemandou : Thanks for meeting up with me during All Hands to discuss this, and also giving me handy tips on working with the Data Lake, I really appr... [17:32:06] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Mediawiki history has no data on IP blocks - https://phabricator.wikimedia.org/T211627 (10JAllemandou) Thanks @nettrom_WMF for the follow up :) I'll try to include that in the next bunch of big changes I'm working on for mediawiki-history :) [17:38:48] 10Analytics, 10ORES, 10Scoring-platform-team: Backfill ORES Hadoop scores with historical data - https://phabricator.wikimedia.org/T209737 (10Ladsgroup) Since there's no assignee here, I'll move it to backlog, feel free to fix. [17:56:36] going afk people, will be hopefully less sleepy tomorrow :) [18:16:19] Bye elukey - Have a good and long night :) [18:27:55] ciao joal, [18:28:12] and milimetric [18:59:22] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Product-Analytics, and 5 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10EBernhardson) Overall the http structure looks perfectly reasonable. Adding a `has_cookies` boolean field coul... [20:30:25] !log Confirm that last week dataloss warnings were false alarms (upload -> 2019-1-28-15, 2019-1-28-16, 2019-2-1-1, 2019-2-1-4, 2019-2-1-13 -- text -> 2019-2-1-13, 2019-2-1-15) [20:30:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:42:14] Gone for tonight team - See 'all tomorrow