[02:06:30] <wikibugs_>	 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3086638 (10Ottomata)
[02:18:57] <wikibugs_>	 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3086638 (10Pchelolo) Em. That's weird. When I go to https://stream.wikimedia.org/v2/stream/recentchange I see proper value in "old" property. I only see `"old":null` for pages tha...
[04:30:11] <wikibugs_>	 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3086735 (10Ottomata) Ah!  haha, no, I had an old grep in my CLI history that was grep -v edit (negation)!  DOH!  @Jdlrobson it looks fine to me, can you confirm you don't see them...
[07:01:28] <mforns>	 hi a-team :]
[07:22:50] <wikibugs_>	 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3086846 (10Lcanasdiaz) Data about Git and Gerrit is being refreshed this week. On Monday we found a bug in Gerrit which is blocking us, as soon as it is f...
[08:17:59] <elukey>	 o/
[08:18:59] <mforns>	 hey o/
[08:19:41] <elukey>	 checking the oozie jobs
[08:21:15] <elukey>	 mforns: we have been experiencing issues with load balancing during the past hours, and tracing the load-text jobs that failed I can see:
[08:21:18] <elukey>	 0012364-170228165458841-oozie-oozi-W@send_email_with_attachments              ERROR     -                      ERROR      EM007
[08:22:29] <elukey>	 that was triggered by check_sequence_statistics
[08:22:32] <elukey>	 Actions
[08:22:32] <elukey>	 ------------------------------------------------------------------------------------------------------------------------------------
[08:22:35] <elukey>	 ID                                                                            Status    Ext ID                 Ext Status Err Code
[08:22:38] <elukey>	 ------------------------------------------------------------------------------------------------------------------------------------
[08:22:41] <elukey>	 0012363-170228165458841-oozie-oozi-W@:start:                                  OK        -                      OK         -
[08:22:44] <elukey>	 ------------------------------------------------------------------------------------------------------------------------------------
[08:22:47] <elukey>	 0012363-170228165458841-oozie-oozi-W@extract_error_data_loss                  OK        job_1488294419903_29358SUCCEEDED  -
[08:22:50] <elukey>	 ------------------------------------------------------------------------------------------------------------------------------------
[08:22:53] <elukey>	 0012363-170228165458841-oozie-oozi-W@check_error_data_loss                    OK        -                
[08:22:56] <elukey>	 mmmm
[08:26:33] <elukey>	 didn't paste it all
[08:26:47] <elukey>	 the last bit is
[08:26:49] <elukey>	 ------------------------------------------------------------------------------------------------------------------------------------
[08:26:52] <elukey>	 0012363-170228165458841-oozie-oozi-W@send_error_data_loss_email               ERROR     0012364-170228165458841-oozie-oozi-WKILLED     -
[08:26:55] <elukey>	 ------------------------------------------------------------------------------------------------------------------------------------
[08:26:58] <elukey>	 0012363-170228165458841-oozie-oozi-W@kill                                     OK        -                      OK         E0729
[08:27:01] <elukey>	 ------------------------------------------------------------------------------------------------------------------------------------
[08:28:06] <elukey>	 !log re-run 186-09 Mar 2017 00:00:00 (webrequest-load-maps) on Hie
[08:28:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:28:10] <elukey>	 Hie
[08:28:11] <elukey>	 ahhaah
[08:28:20] <elukey>	 ok let's see it the maps job works
[08:30:30] <elukey>	 ahhh EM007 seems to be an error trying to contact the SMTP server
[08:30:38] <elukey>	 that might be related to our issue
[08:30:54] <elukey>	 so we didn't get the proper emails
[08:31:21] <elukey>	 weeeeird
[08:33:30] <elukey>	 org.apache.oozie.command.CommandException: E0800: Action it is not running its in [OK] state, action [0012281-170228165458841-oozie-oozi-W@mark_add_partition_done]
[08:37:54] <elukey>	 the maps job that failed in a similar weird way as depicted above just completed successfully
[08:39:10] <elukey>	 !log re-run all the failed misc webrequest-load oozie jobs (total of four)
[08:39:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:39:17] <elukey>	 in the meantime, I'll check the text ones
[08:43:27] <elukey>	 Example of failed oozie ID if anybody wants to check - 0012010-170228165458841-oozie-oozi-W (upload), 0012327-170228165458841-oozie-oozi-W and 0012385-170228165458841-oozie-oozi-W (text)
[08:43:43] <elukey>	 !log re-run via Hue the failed upload-load job
[08:43:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:52:19] <mforns>	 elukey, reading now your mesage
[08:53:12] <elukey>	 so it seems that all these jobs failed in check_sequence_statistics
[08:53:17] <elukey>	 but can't really find why
[08:53:23] <elukey>	 maybe we can check on hdfs
[08:53:24] <elukey>	 ?
[08:53:37] <mforns>	 reading :]
[08:54:51] <mforns>	 elukey, ok, now thinking :]
[08:56:09] <elukey>	 does it make sense what I am saying? Also weird that re-runing jobs fixes it
[08:56:18] <elukey>	 I would have expected an error email
[08:57:07] <elukey>	 !log re-running webrequest-load-text failed jobs too via Hue
[08:57:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:59:51] <elukey>	 the oozie emails seems to indicated "Failing action: send_warning_data_loss_email" too
[08:59:54] <elukey>	 but that's weird
[09:00:33] <mforns>	 elukey, yes weird
[09:00:45] <mforns>	 I'm going to check the integrity of the stats table
[09:01:07] <elukey>	 misc and maps are ok now
[09:02:48] <elukey>	 I can also see
[09:02:49] <elukey>	 2017-03-09 08:43:09,160 ERROR CompletedActionXCommand:517 - SERVER[analytics1003.eqiad.wmnet] USER[-] GROUP[-] TOKEN[] APP[-] JOB[0012588-170228165458841-oozie-oozi-W] ACTION[0012588-170228165458841-oozie-oozi-W@mark_add_partition_done] XException,
[09:02:54] <elukey>	 org.apache.oozie.command.CommandException: E0800: Action it is not running its in [OK] state, action [0012588-170228165458841-oozie-oozi-W@mark_add_partition_done]
[09:02:57] <elukey>	 that doesn't block the job though
[09:03:01] <elukey>	 (might be not relevant)
[09:03:05] <mforns>	 aha
[09:03:59] <elukey>	 this is oozie job -log 0012588-170228165458841-oozie-oozi-W
[09:04:02] <elukey>	 from stat1004
[09:04:11] <elukey>	 (current upload load job that I've restarted)
[09:04:38] <mforns>	 elukey, from yesterday no?
[09:05:36] <elukey>	 nono today, I just restarted the job
[09:05:41] <elukey>	 and it keeps going
[09:05:51] <elukey>	 so it might not be related 
[09:05:53] <mforns>	 elukey, yes I mean for march 8th 19h
[09:06:18] <mforns>	 the only upload error was for yesterday 19h right?
[09:06:22] <elukey>	 ah yes!
[09:06:25] <mforns>	 ok ok
[09:06:36] <mforns>	 elukey, question:
[09:07:04] <mforns>	 the misc and maps errors, you re-run them and they failed, and then you re-run them again, and they went OK?
[09:07:19] <elukey>	 nope, I just re-ran them once
[09:07:27] <elukey>	 and they complete successfully
[09:07:29] <mforns>	 ok ok
[09:07:53] <mforns>	 but there were some re-run failures no?
[09:08:17] <elukey>	 where do you see them?
[09:08:31] <elukey>	 anyhow, even 0012588-170228165458841-oozie-oozi-W (new upload load job) is doing refine atm, so all good
[09:09:15] <mforns>	 elukey, you logged failed reruns: "<elukey> !log re-running webrequest-load-text failed jobs too via Hue"
[09:10:03] <elukey>	 nono what I meant in there was hitting rerun on hue
[09:10:18] <mforns>	 ah! ok misunderstood, my bad :P
[09:10:29] <mforns>	 cool
[09:10:32] <elukey>	 nono note taken, I should have been more clear :)
[09:10:35] <elukey>	 thanks for pointing out
[09:10:40] <mforns>	 let me check the stats table
[09:11:11] <mforns>	 elukey, actually your sentence is perfectly correct
[09:17:12] <joal>	 Hi guys ! Thanks for caring all that !
[09:17:19] <joal>	 Is there anything I can help with?
[09:20:37] <elukey>	 joal: o/ - yes! We have no idea what failed :9
[09:20:39] <elukey>	 :(
[09:20:51] <joal>	 Arf
[09:21:20] <elukey>	 it seemed to me an issue sending the check_sequence_statistics error email
[09:21:24] <elukey>	 via SMTP
[09:21:31] <joal>	 elukey: From what I understood, jobs were in errors, and rerunning them fixed them?
[09:21:34] <elukey>	 but then re-running the jobs fixes the problem
[09:21:38] <elukey>	 yeah
[09:21:44] <elukey>	 I wouldn't have expected it
[09:21:46] <joal>	 weird
[09:21:54] <mforns>	 elukey, hi joal! it's weird, the error file (and warning file) are empty and also are named differently
[09:21:54] <elukey>	 and all of the, afaics, failed in check_sequence_statistics
[09:22:03] <mforns>	 not 00000_0, but 000000_1000
[09:22:24] <mforns>	 empty suggests there was no error, but still the error message was triggered
[09:22:31] <mforns>	 and the file name is weird
[09:23:18] <joal>	 hm
[09:23:43] <elukey>	 as far as I can see, webrequest-load-check_sequence_statistics-wf-text-2017-3-9-2 failed the re-run
[09:24:04] <elukey>	 from oozie info check_sequence_statistics
[09:24:18] <elukey>	 to be clear - we currently are experiencing some issues in eqiad with load balancing
[09:24:30] <elukey>	 but not sure how it could affect us
[09:24:36] <joal>	 mayybe errors were triggered because of internal hadoop errors, leading to some containers disconnected from their master (still writing results, but job failed)
[09:25:16] <mforns>	 the failed job in hue does not have any actions or logs
[09:26:52] <elukey>	 yeah you need to check it via oozie CLI
[09:26:58] <mforns>	 k
[09:27:05] <elukey>	 because of a weird "feature" of the new Hue
[09:27:06] <elukey>	 sigh
[09:27:20] <elukey>	 I am using oozie job -info JOBID
[09:27:24] <elukey>	 and -log
[09:28:55] <mforns>	 ok, doing it now
[09:29:28] <mforns>	 it looks it's failing when marking the dataset done...
[09:29:45] <mforns>	 2017-03-09 09:05:55,997 ERROR CompletedActionXCommand:517 - SERVER[analytics1003.eqiad.wmnet] USER[-] GROUP[-] TOKEN[] APP[-] JOB[0012588-170228165458841-oozie-oozi-W] ACTION[0012588-170228165458841-oozie-oozi-W@mark_raw_dataset_done] XException
[09:30:07] <joal>	 mforns: about the names, it's normal that error were triggered
[09:30:56] <joal>	 mforns: in workflow.xml, check is made on 000000_0
[09:31:04] <elukey>	 0012614-170228165458841-oozie-oozi-W (the other webrequest-text-load job that I re-run) is in refine now
[09:31:05] <mforns>	 joal, I know
[09:31:31] <joal>	 mforns: But, if an error occur during the check (attempt fail or something), then the name will be different
[09:31:51] <mforns>	 mmm
[09:36:55] <mforns>	 elukey, the other text job has the same error, it will probably fail
[09:39:43] <joal>	 elukey: it looks as if the problem was impactiung hadoop itself ... Is that possible?
[09:42:30] <elukey>	 joal: it might be, but we should double check timelines.. 
[09:42:58] <joal>	 k
[09:43:09] <elukey>	 first error started at ~3AM for me (CEST), before the impact
[09:43:13] <elukey>	 (on lbs)
[09:43:53] <elukey>	 mforns: one of the text jobs is in refine (and passed the check_seq_statistics), the other one failed.. weird
[09:44:06] <mforns>	 O.o
[09:52:16] <elukey>	 !log restarted Mar 2017 02:00:00 webrequest-load-text (second time)
[09:52:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:52:59] <mforns>	 elukey, it succeded?
[09:53:14] <elukey>	 mforns: no that one died
[09:53:22] <elukey>	 I restarted it
[09:53:49] <mforns>	 elukey, but the 1h one is green now
[09:53:58] <elukey>	 yeah
[09:54:06] <mforns>	 what happened?
[09:54:19] <elukey>	 mforns: I am not getting the point sorry
[09:54:33] <joal>	 That's really weird ... Could be related to cluster being overwhelmed
[09:54:49] <elukey>	 two text jobs failed, I restarted both the first time and one succeeded and the other one didn't.. then I restarted the one that failed two times again
[09:55:10] <mforns>	 there were 2 hours missing for text, hour=1 and hour=2, you re-ran both and both failed, now hour=1 is success, do you know what made it succedd?
[09:55:21] <mforns>	 *succeed
[09:55:30] <elukey>	 nono as said above, I re-ran both and only one failed
[09:55:34] <elukey>	 hour=2
[09:55:35] <mforns>	 oh
[09:56:49] <joal>	 Cluster was no overwhelmed - looks really related to something else
[09:57:51] <joal>	 elukey: from resource manager metrics, looks like an1041 is different than others
[09:57:56] <joal>	 elukey: could be related?
[10:01:58] <elukey>	 not sure, both 1041 and 1040 have different behaviors in datanode/nodemanager, but let's keep in mind that 1) 1040's hdfs datanode parts have been wiped 2) 1041/1040 are running with a 4.4 kernel, menawhile on trusty we are at 3.X
[10:02:09] <elukey>	 so there is a *lot* of differences in handling memory
[10:02:25] <joal>	 elukey: ok
[10:02:44] <elukey>	 BUT
[10:02:58] <elukey>	 it could be interesting to see if those failures were coming from jobs running on these nodes
[10:03:09] <elukey>	 is it possible from oozie/mapred logs?
[10:03:20] <elukey>	 that could be an explanation
[10:03:45] <joal>	 elukey: looks like 41 has a different nodemanager memry pattern than 40, which is possibly not expected
[10:04:22] <elukey>	 same number of containers running?
[10:05:03] <joal>	 elukey: always 0
[10:05:14] <joal>	 Which explains memory, but is that expected?
[10:06:32] <elukey>	 ah! not at all
[10:06:34] <elukey>	 let me check
[10:07:23] <joal>	 elukey: something else to check: overall node manager heap used over last 24 hours has really changed
[10:09:19] <elukey>	 !log restarted yarn-nodenamanger on analtycs1040
[10:09:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:09:27] <elukey>	 nothing weird from the logs afaics
[10:10:35] <joal>	 hm
[10:10:54] <elukey>	 overall node manager heap used over last 24 hours has really changed ==> I am not seeing it
[10:11:12] <joal>	 elukey: batcave for a minute?
[10:11:15] <elukey>	 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=17&fullscreen&from=now-2d&to=now ?
[10:11:57] <elukey>	 joal: I am also checking the ops issue atm, a bit busy but maybe in 5/10 mins?
[10:11:58] <joal>	 elukey: My bad, not nodemanager, resource manager !
[10:12:11] <elukey>	 ahhh
[10:12:16] <joal>	 elukey: probably not needed actually
[10:13:56] <elukey>	 this is on an1041
[10:14:01] <elukey>	 2017-03-09 10:13:31,411 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Disk Error Exception:
[10:14:04] <elukey>	 org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not readable: /var/lib/hadoop/data/j/yarn/local/nmPrivate
[10:14:07] <elukey>	 gneeeee
[10:14:21] <mforns>	 oh
[10:14:21] <joal>	 riiiight
[10:14:55] <elukey>	 elukey@analytics1041:/var/log/hadoop-yarn$ sudo ls -l /var/lib/hadoop/data/j/yarn/local/nmPrivate
[10:14:58] <elukey>	 total 12
[10:15:01] <elukey>	 drwxrwxr-x 2 ntp ssl-cert 4096 Mar  8 09:48 application_1488294419903_27409
[10:15:04] <elukey>	 drwxrwxr-x 3 ntp ssl-cert 4096 Mar  8 09:55 application_1488294419903_27435
[10:15:07] <elukey>	 drwxrwxr-x 5 ntp ssl-cert 4096 Mar  8 09:56 application_1488294419903_27436
[10:15:10] <elukey>	 WHAAAAT?
[10:15:16] <elukey>	 this is probably my fault
[10:15:29] <elukey>	 I fixed /var/lib/hadoop/data/j/hdfs yesterday
[10:15:45] <elukey>	 ah snap
[10:15:48] <joal>	 elukey: no big deal, a couple job failed, and problem seems identified
[10:16:02] <joal>	 mforns: Thanks for helping restarting the things !
[10:16:59] <mforns>	 np, actually it was Luca, thanks guys for fixing this, has been interesing to follow :]
[10:18:13] <elukey>	 just to double check I am running root@analytics1042:/var/lib/hadoop/data/b# find ! -user yarn -group yarn
[10:18:49] <elukey>	 so the current theory is that a single new hosts, 1041, failed to run containers due to fs issues
[10:20:33] <elukey>	 ok stopped nodemanager on 1041, fixing perms
[10:20:44] <mforns>	 guys, I have to leave now, will be back 1h30m before standup, cheers
[10:22:51] <joal>	 bye mforns 
[10:23:01] <joal>	 Thanks elukey for fixing :)
[10:25:27] <elukey>	 really sorry for the trouble, I haven't checked the /var/lib/hadoop/data/X/yarn dirs..
[10:26:10] <elukey>	 and it makes sense: since I reinstalled the OS, new gid/uid for users have been created. So after mounting the partitions, the inodes were still pointing to old gid/uid
[10:26:20] <elukey>	 I fixed the hdfs dir
[10:26:23] <elukey>	 and not yarn
[10:26:29] <elukey>	 I thought it was empty or something
[10:27:42] <joal>	 elukey: no big deal - It's good learning for global future reimaging !
[10:27:50] <joal>	 Thanks for doing that :)
[10:28:05] <joal>	 elukey: I love the idea we'll be full-debian soon :D
[10:29:08] <elukey>	 me too! 
[10:31:53] <elukey>	 !log analytics1041 yarn nodemanager stopped, chowning to yarn:yarn all the perms in /var/lib/hadoop/data/X/yarn dirs
[10:31:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:40:58] <elukey>	 ok going to write all the steps (including verification of permissions) in the Debian migration guide
[10:41:11] <joal>	 !Log Restart webrequest load text 2017-03-09T02:00Z
[10:56:22] <elukey>	 still fixing perms on 1041
[10:56:26] <elukey>	 ahahahha
[10:56:30] <elukey>	 looong command
[11:04:25] <elukey>	 !log an1041 yarn nodemanager back running
[11:04:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:05:28] <elukey>	 oh yess we have now containers on 1041
[11:05:30] <elukey>	 \o/
[11:21:57] <elukey>	 going to step away for ~30 mins!
[11:50:54] <fdans>	 joal: bonjour! one tiny question and I think I'm go for testing the job
[11:52:41] <fdans>	 right now my data is stored in hdfs, but it's compressed in gz, does it need to be decompressed before the job starts?
[11:53:21] <fdans>	 joal: also, I'm guessing mapreduce.input.fileinputformat.inputdir expects a regular expression path that matches all the files I want to load?
[13:17:49] * fdans goes for some lunchy lunch
[13:56:47] <elukey>	 joal: http://events.linuxfoundation.org/events/apache-big-data-north-america/program/schedule
[14:04:17] <joal>	 fdans: sorry ! Completely missed your ping !!
[14:05:12] <joal>	 fdans: it's actually very easy easy with mapreduce: it decompresses by itself regular formats (gz, snappy, bz2), and reads every file present in a folder
[14:06:42] <joal>	 elukey: mwarf, will not go there :(
[14:08:25] <elukey>	 http://events.linuxfoundation.org/events/apachecon-north-america/program/schedule - https://apachecon2017.sched.com/event/9zvG/cassandra-serving-netflix-scale-vinay-chella-netflix?iframe=no&w=100%&sidebar=yes&bg=no
[14:09:04] <elukey>	 https://apachecon2017.sched.com/event/9zp7/10-things-to-consider-when-using-apache-kafka-utilization-points-of-apache-kafka-obtained-from-iot-use-case-naoto-umemori-yuji-hagiwara-ntt-data-corporation?iframe=no&w=100%&sidebar=yes&bg=no
[14:17:11] <joal>	 !log restart failed webrequest load [upload maps misc] 2017-03-09T09:00Z
[14:17:12] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:18:09] <fdans>	 oh joal, so I don't need to do anything to those files! cool, I'm going to give this a try, unless you want to take a look at it first
[14:18:43] <joal>	 fdans: just a quick double check - Which keyspace are you ganna use?
[14:19:14] <fdans>	 local_group_default_T_pageviews_per_project_v2
[14:19:22] <fdans>	 joal ^
[14:19:57] <joal>	 fdans: I suggested to create a new one dedicated for those tests
[14:20:32] <joal>	 fdans: if anything goes wrong with the loading, or anything else, this keyspace is the real prod one
[14:20:51] <fdans>	 you're totally right joal
[14:23:29] <joal>	 fdans: easiest way is, in CQLSH, do a: describe "local_group_default_T_pageviews_per_project_v2";
[14:24:01] <joal>	 It'll give you the CREATE commands that needs to be run for creating a fake keyspace (changing the keyspace name manually, obviously)
[14:24:09] <fdans>	 yeah that's what I've been doing :)
[14:25:57] <fdans>	 joal however, I can't find in the docs the way of starting cqlsh in stat1004, is there any magic involved?
[14:27:09] <joal>	 fdans: always :)
[14:27:53] <joal>	 fdans: https://wikitech.wikimedia.org/wiki/Analytics/AQS#Cassandra_CLI
[14:28:47] <joal>	 fdans: meaning: cqlsh -h aqs1004-a.eqiad.wmnet -u aqsloader -p
[14:29:25] <fdans>	 joal: what I mean is that cqlsh is not found as a command in stat1004
[14:29:48] <joal>	 fdans: on aqs1004 :)
[14:29:57] <fdans>	 ohhhhhh I seeeee
[14:29:58] <fdans>	 silly me
[14:30:03] <fdans>	 thank you joal 
[14:31:09] <mforns>	 fdans, do you have 5 minutes to help me executing aqs test with local cassandra?
[14:31:18] <mforns>	 :]
[14:31:23] <fdans>	 mforns: hell yeah!
[14:31:29] <fdans>	 a la baticueva!
[14:31:34] <mforns>	 :D baticueva it is
[14:48:41] <elukey>	 I automated https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Worker_Reimage_.2812_disk.2C_2_flex_bay_drives_-_analytics1028-analytics1057.29
[14:49:04] <elukey>	 so now it is all written down and a worker reimage should take ~1 hour
[14:49:15] <elukey>	 and it shouldn't fail like this morning :)
[14:49:20] <elukey>	 joal,mforns --^
[14:49:40] <mforns>	 cool elukey 
[14:49:52] <wikibugs_>	 06Analytics-Kanban: Extract edit history denormalized data from intermediate data - https://phabricator.wikimedia.org/T144717#3087817 (10JAllemandou)
[14:49:55] <wikibugs_>	 06Analytics-Kanban: Productionize Edit History Reconstruction and Extraction - https://phabricator.wikimedia.org/T152035#3087816 (10JAllemandou)
[14:50:09] <elukey>	 today I leaned a trick with awk
[14:50:12] <elukey>	 awk -F 'sd|1' '{print $2}')
[14:50:16] <joal>	 elukey: Thanks :)
[14:50:24] <elukey>	 sorry the last ) is not needed
[14:50:35] <joal>	 elukey: What does it do?
[14:50:38] <elukey>	 say you have /dev/sdb1 and you need only "b"
[14:50:39] <joal>	 I don't know it !
[14:50:54] <joal>	 Ohhh ! Get it :)
[14:50:57] <joal>	 Awesome
[14:51:20] <elukey>	 elukey@analytics1041:~$ echo "/dev/sdb1" | awk -F 'sd|1' '{print $2}'
[14:51:23] <elukey>	 b
[14:51:29] <elukey>	 awk sometimes is so awesome
[14:51:50] <wikibugs_>	 06Analytics-Kanban: Create oozie job for mediawiki edit history job - https://phabricator.wikimedia.org/T160074#3087833 (10JAllemandou)
[14:53:32] <ottomata>	 nice elukey!
[14:55:03] <wikibugs_>	 06Analytics-Kanban, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3087865 (10Ottomata) I think we are ready to proceed with this when yall are.  Should we schedule a day next wee...
[14:56:22] <fdans>	 elukey: q, do I have permission to access aqs1004?
[15:01:15] <elukey>	 fdans: probably not, have  you tried to ssh ?
[15:01:23] <elukey>	 ottomata: \o/
[15:01:46] <fdans>	 yeah, it's asking me for a password
[15:01:48] <fdans>	 https://www.irccloud.com/pastebin/1bQRQ7Ma/
[15:02:15] <elukey>	 fdans: I can open a phab task and request access, but it will probably take a ops meeting
[15:02:22] <elukey>	 and next week is us holiday
[15:02:26] <elukey>	 so it might take a while
[15:03:44] <elukey>	 ah no maybe there is a quicker way
[15:03:46] <elukey>	 aqs-users
[15:03:50] <elukey>	 it does not have sudo
[15:04:03] <elukey>	 fdans: what do you need to do on aqs1004? D:
[15:04:04] <elukey>	 :D
[15:09:27] <wikibugs_>	 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070663 (10elukey) Just completed https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Worker_Nodes_...
[15:12:11] <fdans>	 elukey: I need to create a test keyspace now and later the final keyspace for the corrected data
[15:12:29] <elukey>	 joal: https://mentors.debian.net/package/clickhouse
[15:12:50] <elukey>	 fdans: so you'll need to sudo as hdfs
[15:12:55] <elukey>	 err cassandra
[15:13:19] <fdans>	 why sudo elukey?
[15:13:24] <elukey>	 so you'd need https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L422
[15:13:34] <elukey>	 because you are changing keyspaces
[15:13:43] <elukey>	 IIRC the cassandra user need to be used 
[15:13:47] <elukey>	 but I might be wrong
[15:14:00] <fdans>	 right
[15:14:40] <wikibugs_>	 06Analytics-Kanban, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3087960 (10mobrovac) Wednesday 15th?
[15:14:44] <joal>	 elukey: I think aqsloader can do
[15:14:45] <fdans>	 elukey: should I create a phab task to request access then?
[15:15:40] <elukey>	 joal: sure but fdans can't sudo as aqsloader :)
[15:15:53] <elukey>	 and IIRC we don't set CREATE perms
[15:15:56] <elukey>	 for aqsloader
[15:15:59] <joal>	 elukey: I'm assuming that clickhouse package in that page is good, but I really have no idea
[15:16:15] <joal>	 elukey: hm
[15:16:16] <elukey>	 it is still not in Debian but in search for a mentor
[15:16:24] <elukey>	 might ask to Moritz to check :)
[15:16:25] <joal>	 elukey: riiiiiight :)
[15:16:27] <joal>	 makes sense
[15:16:29] <wikibugs_>	 06Analytics-Kanban, 10Wikimedia-Stream: EventStreams Blog Post - https://phabricator.wikimedia.org/T160080#3087962 (10Ottomata)
[15:16:49] <elukey>	 fdans: it depends, not sure if I am ok with you messing on aqs
[15:16:49] <wikibugs_>	 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: EventStreams - https://phabricator.wikimedia.org/T130651#3087976 (10Ottomata)
[15:16:51] <joal>	 elukey, fdans : I'm gonnq try to creqte a new keyspace using aqsloader and see
[15:16:52] <wikibugs_>	 06Analytics-Kanban, 10Wikimedia-Stream: EventStreams Blog Post - https://phabricator.wikimedia.org/T160080#3087975 (10Ottomata)
[15:16:54] <elukey>	 and sudo and stuff
[15:17:03] <elukey>	 (:D :D :D)
[15:17:32] <fdans>	 butbut elukey when I break stuff I'm super nice about it!
[15:17:39] <elukey>	 jokes aside, I don't see anything wrong, we can create a task requesting aqs-admins perms
[15:18:38] <joal>	 elukey: I confirm, aqsloader has no create perms
[15:18:58] <joal>	 elukey: Asking permission to sudo as cassandra in order to create keyspace
[15:20:44] <elukey>	 nice, sometimes it works as expected :P
[15:22:15] <joal>	 elukey: actually I can't connect as cassandra even when sudoing
[15:23:26] <joal>	 elukey: :5
[15:24:43] <wikibugs_>	 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#3088006 (10Aklapper) @Lcanasdiaz: Did Mukunda's comment help? (Or is there some issue that after a 503 you might need to start from scratch again and hence miss da...
[15:25:09] <elukey>	 fdans: I was wrong!
[15:25:09] <elukey>	 sorry 
[15:25:12] <elukey>	 elukey@aqs1004:~$ cqlsh -u cassandra aqs1004-a
[15:25:12] <elukey>	 Password:
[15:25:13] <elukey>	 Connected to Analytics Query Service Storage at aqs1004-a:9042.
[15:25:13] <elukey>	 [cqlsh 5.0.1 | Cassandra 2.2.6 | CQL spec 3.3.1 | Native protocol v4]
[15:25:16] <elukey>	 Use HELP for help.
[15:25:35] <elukey>	 we should only need aqs-users, but it is best if we ask for aqs-admins
[15:25:56] <elukey>	 actually this is something to fix asap
[15:26:11] <elukey>	 (changing the pass for cassandra and keeping it somewhere safe)
[15:26:28] <elukey>	 let's discuss it during standup
[15:27:03] <fdans>	 mforns will probably need access too since he's working on the legacy pageview that will require him to create new keyspaces and populate them, right?
[15:27:16] <mforns>	 oh yes please :]
[15:27:23] <elukey>	 he is in aqs-users, he should be able to loging
[15:27:26] <elukey>	 *login
[15:27:32] <elukey>	 as well as nuria
[15:28:21] <mforns>	 elukey, yes I can ssh into aqs machines, if that is what you say
[15:29:07] <fdans>	 ah right, sorry
[15:29:32] <joal>	 elukey: how should I do to do it for now (and even, can I ?)
[15:31:58] <wikibugs_>	 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#3088023 (10Lcanasdiaz) Data collection is already done, we are polishing up the enrichment process (where raw data is converted to enriched data), so the panel wil...
[15:35:24] <elukey>	 joal: anybody with ssh access can loging as cassandra (that is bad)
[15:35:34] <elukey>	 but he/she needs to know the pass..
[15:36:06] <elukey>	 are you not able to ssh using cqlsh -u cassandra? (sorry I am not getting the issue :)
[15:36:46] <wikibugs_>	 10Analytics-Tech-community-metrics: Add "Ticket Openers" to equivalent of "Activity by contributors" in kibana - https://phabricator.wikimedia.org/T105634#3088033 (10Aklapper) I'm going to merge this into T28 as it already lists "Top task creators in last month". T28 is currently blocked on T138002.
[15:36:52] <wikibugs_>	 10Analytics-Tech-community-metrics: Add "Ticket Openers" to equivalent of "Activity by contributors" in kibana - https://phabricator.wikimedia.org/T105634#3088037 (10Aklapper)
[15:36:55] <wikibugs_>	 10Analytics-Tech-community-metrics, 10Phabricator, 06Developer-Relations (Jan-Mar-2017): Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#7527 (10Aklapper)
[15:36:58] <joal>	 elukey: the thing is: I don't know the pass ;)
[15:37:07] <joal>	 which is actually good !
[15:37:51] <wikibugs_>	 10Analytics-Tech-community-metrics: Add remaining KPIs to Overview once available in kibana - https://phabricator.wikimedia.org/T116572#3088045 (10Aklapper)
[15:37:58] <wikibugs_>	 10Analytics-Tech-community-metrics, 10Phabricator, 06Developer-Relations (Jan-Mar-2017): Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#11655 (10Aklapper) 05Open>03stalled Waiting for {T138002} here to see what would be the 'default' panel and displayed widgets....
[15:48:36] <joal>	 fdans: Thanks to some elukey magics, I have the needed grail: "test_fdans_pageviews_per_project_v2" (with expected data table)
[15:49:47] <moritzm>	 elukey, ottomata: thoughts on https://phabricator.wikimedia.org/T149225? I think I'll decline this, changing the existing groups would be massive churn for hardly any gain
[15:50:59] <elukey>	 +1, at the moment we don't have the time in my opinion to migrate all the users to a new scheme.. and we have improved a lot the documentation
[15:53:08] <wikibugs_>	 06Analytics-Kanban: Create cron job in puppet sqooping prod and labs DBs - https://phabricator.wikimedia.org/T160083#3088109 (10JAllemandou)
[15:53:30] <joal>	 ottomata: just created --^, and assigned it to you
[15:54:18] <wikibugs_>	 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Go through default Kibana widgets; decide which ones are not relevant for us and remove them - https://phabricator.wikimedia.org/T147001#3088127 (10Aklapper) Asked Bitergia for the naming scheme for custom widgets via email on 20170210....
[15:54:41] <ottomata>	 moritzm:  yeah makes sense, i'll comment and close
[15:55:13] <moritzm>	 k, thanks
[15:56:33] <moritzm>	 elukey, ottomata: also https://phabricator.wikimedia.org/T149222, I don't see the point here either, the private ones are intentionally private after all, while the others are accidential at best
[15:56:42] <moritzm>	 I'd say also close this
[15:56:50] <fdans>	 joal, elukey: thank you so much!!
[15:57:02] <moritzm>	 (currently sweeping through auth bugs as part of the Q ops goal)
[15:58:08] <elukey>	 moritzm: +1
[16:00:53] <ottomata>	 moritzm:  i'll explain and close
[16:01:08] <nuria>	 a-team: standddupppp
[16:01:21] <nuria>	 joal, ottomata : standdduppp
[16:01:21] <wikibugs_>	 (03PS1) 10Joal: [WIP] Update static wiki projects list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/342030
[16:02:51] <moritzm>	 ok, thanks 
[16:03:13] <ottomata>	 agh!
[16:07:43] <wikibugs_>	 06Analytics-Kanban, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3088162 (10Ottomata) Ya.  @elukey should we do varnishes before or after this?  I can add librdkafka 0.9.4 to ou...
[16:57:17] <fdans>	 joal: another question, so I'm trying to load all granularities with a single job, however it seems I have to specify a constant value for granularity. is there any way of going around this that doesn't involve changing the reducer/creating three jobs?
[17:09:58] <nuria>	 fdans: can you point me to the job is it in gerrit?
[17:15:20] <joal>	 hey fdans, you should review the data formats you're dealing with
[17:15:48] <joal>	 fdans: If I don't mistake, in your use-case granularity will be a new field to add to the list of hive fields
[17:16:48] <wikibugs_>	 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3088359 (10Aklapper) Yay, thanks for working on this!  (Note to myself, just adding more test cases: After data is refreshed, Andre to check whether https...
[17:17:08] <wikibugs_>	 (03PS22) 10Joal: Add mediawiki history spark jobs to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548)
[17:19:52] <joal>	 fdans: do you want us to spend some time on this now?
[17:35:30] <fdans>	 joal: sorry, had to pop out for a second
[17:35:36] <fdans>	 yes, batcave?
[17:36:02] <joal>	 fdans: yes !
[18:02:45] <wikibugs_>	 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: EventStreams - https://phabricator.wikimedia.org/T130651#3088529 (10Jdlrobson)
[18:02:48] <wikibugs_>	 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3088527 (10Jdlrobson) 05Open>03Resolved Confirmed. I think I was looking at some non-edit events. Thanks for the follow up.
[18:09:20] * elukey afk!
[18:59:07] <wikibugs_>	 (03PS4) 10Joal: Add oozie jobs for mw history denormalized [analytics/refinery] - 10https://gerrit.wikimedia.org/r/341030
[19:09:39] * halfak looks around for ottomata
[19:10:05] <halfak>	 I'm around for the analytics/devops/research checkin
[19:13:32] <joal>	 halfak: Hey !
[19:13:40] <halfak>	 o/ joal 
[19:13:42] <joal>	 Sorry missed the thing - anyone with you?
[19:14:08] <joal>	 halfak: ---^
[19:14:34] <halfak>	 Nope.  All by myself 
[19:14:48] <halfak>	 I don't actually have anything pressing, but thought I should make myself available :D
[19:14:49] <joal>	 halfak: seems like it's a 'cancel' signal
[19:14:53] <halfak>	 yup
[19:14:54] <joal>	 :D
[19:15:04] <joal>	 ok - sorry for not showing up in time
[19:15:09] <halfak>	 no worries. :) 
[19:16:22] <joal>	 Have a good end of day, halfak, see you later !
[19:17:21] <joal>	 Gone for today a-team !
[19:17:48] <milimetric>	 nite jo
[19:18:28] <joal>	 fdans: your job is still running :)
[19:29:06] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3088833 (10Jdlrobson) We may need to revisit the config values for max_pages The fact I'm seeing older art...
[19:57:58] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3088917 (10mobrovac) >>! In T156411#3088833, @Jdlrobson wrote: > We may need to revisit the config values...
[20:04:49] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3088961 (10Pchelolo) @Jdlrobson What you could do for quick testing of ideas and outputs is the following:...
[20:19:42] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089015 (10Jdlrobson) Right now the fact is that older articles are not showing up in results.  Theres onl...
[20:30:10] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089068 (10mobrovac) There are 10 edits/second across all wikis. If we assume we are tracking all of them...
[20:45:20] <nuria>	 ottomata: I added you to CR for UA metrics, we think we are ready to merge that  cc Krinkle 
[20:45:40] <ottomata>	 k
[20:45:42] <Krinkle>	 ottomata: https://gerrit.wikimedia.org/r/#/c/341723/ also :)
[20:45:52] <Krinkle>	 re: webperf, I'm okay with landing.
[20:46:54] <Krinkle>	 Actually, if you could wait 30min or so to give the train a bit more time to roll out and for us to catch regressions in dashbaords as new data comes in. Its' a hot time of the week for us.
[20:46:57] <Krinkle>	 just went to all wikis
[20:47:19] <ottomata>	 Krinkle: wouldn't it be better to do the scap deploy first?
[20:47:25] <ottomata>	 then you'll have your own eventlogging/webperf
[20:47:29] <ottomata>	 and can just set PYTHONPATH=...
[20:47:55] <ottomata>	 i guess it doesn't really matter
[20:48:00] <ottomata>	 oh, i guess you haven't deploy the trebuchet one in a long time
[20:48:08] <ottomata>	 so maybe better to stick with what you have, rather than change code
[20:48:09] <ottomata>	 ok ok ok
[20:48:10] <ottomata>	 :)
[20:48:11] <Krinkle>	 I'd rather do this step first :)
[20:48:26] <Krinkle>	 But.. wait 30min with merging (unless you can make sure it won't hit hafnium until 30min)
[20:48:34] <ottomata>	 naw, i'll wait
[20:48:41] <Krinkle>	 would also like feedback on https://gerrit.wikimedia.org/r/#/c/341724/3
[20:48:58] <ottomata>	 actually, i think i'm going to get kicked out of this cafe, and i am going to peace out for the day in about an hour, so maybe we can merge tomorrow?
[20:49:11] <ottomata>	 oo reading...
[20:50:39] <Krinkle>	 yeah, no rush
[20:52:07] <nuria>	 ottomata: ya, totally, no rush
[21:50:41] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089513 (10Jdlrobson) @mobrovac that sounds like a good starting point and wouldn't hurt. I'll let you kno...
[21:55:05] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089544 (10Jdlrobson) Thanks for the discussion - I've captured all this in a spike - T160127.
[22:04:29] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089588 (10mobrovac) I have bumped `max_pages`, let's see if that helps. The memory has increased but the...
[22:19:23] <wikibugs_>	 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089628 (10Jdlrobson) <3
[23:29:20] <wikibugs_>	 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Change userAgent field to user_agent_map in EventCapsule - https://phabricator.wikimedia.org/T153207#3089847 (10Nuria) @Tbayer and @Nemo_bis FYI that we will be deploying this next week, after our work with @Krinkle we have also included brows...