[02:06:30] 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3086638 (10Ottomata) [02:18:57] 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3086638 (10Pchelolo) Em. That's weird. When I go to https://stream.wikimedia.org/v2/stream/recentchange I see proper value in "old" property. I only see `"old":null` for pages tha... [04:30:11] 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3086735 (10Ottomata) Ah! haha, no, I had an old grep in my CLI history that was grep -v edit (negation)! DOH! @Jdlrobson it looks fine to me, can you confirm you don't see them... [07:01:28] hi a-team :] [07:22:50] 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3086846 (10Lcanasdiaz) Data about Git and Gerrit is being refreshed this week. On Monday we found a bug in Gerrit which is blocking us, as soon as it is f... [08:17:59] o/ [08:18:59] hey o/ [08:19:41] checking the oozie jobs [08:21:15] mforns: we have been experiencing issues with load balancing during the past hours, and tracing the load-text jobs that failed I can see: [08:21:18] 0012364-170228165458841-oozie-oozi-W@send_email_with_attachments ERROR - ERROR EM007 [08:22:29] that was triggered by check_sequence_statistics [08:22:32] Actions [08:22:32] ------------------------------------------------------------------------------------------------------------------------------------ [08:22:35] ID Status Ext ID Ext Status Err Code [08:22:38] ------------------------------------------------------------------------------------------------------------------------------------ [08:22:41] 0012363-170228165458841-oozie-oozi-W@:start: OK - OK - [08:22:44] ------------------------------------------------------------------------------------------------------------------------------------ [08:22:47] 0012363-170228165458841-oozie-oozi-W@extract_error_data_loss OK job_1488294419903_29358SUCCEEDED - [08:22:50] ------------------------------------------------------------------------------------------------------------------------------------ [08:22:53] 0012363-170228165458841-oozie-oozi-W@check_error_data_loss OK - [08:22:56] mmmm [08:26:33] didn't paste it all [08:26:47] the last bit is [08:26:49] ------------------------------------------------------------------------------------------------------------------------------------ [08:26:52] 0012363-170228165458841-oozie-oozi-W@send_error_data_loss_email ERROR 0012364-170228165458841-oozie-oozi-WKILLED - [08:26:55] ------------------------------------------------------------------------------------------------------------------------------------ [08:26:58] 0012363-170228165458841-oozie-oozi-W@kill OK - OK E0729 [08:27:01] ------------------------------------------------------------------------------------------------------------------------------------ [08:28:06] !log re-run 186-09 Mar 2017 00:00:00 (webrequest-load-maps) on Hie [08:28:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:28:10] Hie [08:28:11] ahhaah [08:28:20] ok let's see it the maps job works [08:30:30] ahhh EM007 seems to be an error trying to contact the SMTP server [08:30:38] that might be related to our issue [08:30:54] so we didn't get the proper emails [08:31:21] weeeeird [08:33:30] org.apache.oozie.command.CommandException: E0800: Action it is not running its in [OK] state, action [0012281-170228165458841-oozie-oozi-W@mark_add_partition_done] [08:37:54] the maps job that failed in a similar weird way as depicted above just completed successfully [08:39:10] !log re-run all the failed misc webrequest-load oozie jobs (total of four) [08:39:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:39:17] in the meantime, I'll check the text ones [08:43:27] Example of failed oozie ID if anybody wants to check - 0012010-170228165458841-oozie-oozi-W (upload), 0012327-170228165458841-oozie-oozi-W and 0012385-170228165458841-oozie-oozi-W (text) [08:43:43] !log re-run via Hue the failed upload-load job [08:43:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:52:19] elukey, reading now your mesage [08:53:12] so it seems that all these jobs failed in check_sequence_statistics [08:53:17] but can't really find why [08:53:23] maybe we can check on hdfs [08:53:24] ? [08:53:37] reading :] [08:54:51] elukey, ok, now thinking :] [08:56:09] does it make sense what I am saying? Also weird that re-runing jobs fixes it [08:56:18] I would have expected an error email [08:57:07] !log re-running webrequest-load-text failed jobs too via Hue [08:57:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:59:51] the oozie emails seems to indicated "Failing action: send_warning_data_loss_email" too [08:59:54] but that's weird [09:00:33] elukey, yes weird [09:00:45] I'm going to check the integrity of the stats table [09:01:07] misc and maps are ok now [09:02:48] I can also see [09:02:49] 2017-03-09 08:43:09,160 ERROR CompletedActionXCommand:517 - SERVER[analytics1003.eqiad.wmnet] USER[-] GROUP[-] TOKEN[] APP[-] JOB[0012588-170228165458841-oozie-oozi-W] ACTION[0012588-170228165458841-oozie-oozi-W@mark_add_partition_done] XException, [09:02:54] org.apache.oozie.command.CommandException: E0800: Action it is not running its in [OK] state, action [0012588-170228165458841-oozie-oozi-W@mark_add_partition_done] [09:02:57] that doesn't block the job though [09:03:01] (might be not relevant) [09:03:05] aha [09:03:59] this is oozie job -log 0012588-170228165458841-oozie-oozi-W [09:04:02] from stat1004 [09:04:11] (current upload load job that I've restarted) [09:04:38] elukey, from yesterday no? [09:05:36] nono today, I just restarted the job [09:05:41] and it keeps going [09:05:51] so it might not be related [09:05:53] elukey, yes I mean for march 8th 19h [09:06:18] the only upload error was for yesterday 19h right? [09:06:22] ah yes! [09:06:25] ok ok [09:06:36] elukey, question: [09:07:04] the misc and maps errors, you re-run them and they failed, and then you re-run them again, and they went OK? [09:07:19] nope, I just re-ran them once [09:07:27] and they complete successfully [09:07:29] ok ok [09:07:53] but there were some re-run failures no? [09:08:17] where do you see them? [09:08:31] anyhow, even 0012588-170228165458841-oozie-oozi-W (new upload load job) is doing refine atm, so all good [09:09:15] elukey, you logged failed reruns: " !log re-running webrequest-load-text failed jobs too via Hue" [09:10:03] nono what I meant in there was hitting rerun on hue [09:10:18] ah! ok misunderstood, my bad :P [09:10:29] cool [09:10:32] nono note taken, I should have been more clear :) [09:10:35] thanks for pointing out [09:10:40] let me check the stats table [09:11:11] elukey, actually your sentence is perfectly correct [09:17:12] Hi guys ! Thanks for caring all that ! [09:17:19] Is there anything I can help with? [09:20:37] joal: o/ - yes! We have no idea what failed :9 [09:20:39] :( [09:20:51] Arf [09:21:20] it seemed to me an issue sending the check_sequence_statistics error email [09:21:24] via SMTP [09:21:31] elukey: From what I understood, jobs were in errors, and rerunning them fixed them? [09:21:34] but then re-running the jobs fixes the problem [09:21:38] yeah [09:21:44] I wouldn't have expected it [09:21:46] weird [09:21:54] elukey, hi joal! it's weird, the error file (and warning file) are empty and also are named differently [09:21:54] and all of the, afaics, failed in check_sequence_statistics [09:22:03] not 00000_0, but 000000_1000 [09:22:24] empty suggests there was no error, but still the error message was triggered [09:22:31] and the file name is weird [09:23:18] hm [09:23:43] as far as I can see, webrequest-load-check_sequence_statistics-wf-text-2017-3-9-2 failed the re-run [09:24:04] from oozie info check_sequence_statistics [09:24:18] to be clear - we currently are experiencing some issues in eqiad with load balancing [09:24:30] but not sure how it could affect us [09:24:36] mayybe errors were triggered because of internal hadoop errors, leading to some containers disconnected from their master (still writing results, but job failed) [09:25:16] the failed job in hue does not have any actions or logs [09:26:52] yeah you need to check it via oozie CLI [09:26:58] k [09:27:05] because of a weird "feature" of the new Hue [09:27:06] sigh [09:27:20] I am using oozie job -info JOBID [09:27:24] and -log [09:28:55] ok, doing it now [09:29:28] it looks it's failing when marking the dataset done... [09:29:45] 2017-03-09 09:05:55,997 ERROR CompletedActionXCommand:517 - SERVER[analytics1003.eqiad.wmnet] USER[-] GROUP[-] TOKEN[] APP[-] JOB[0012588-170228165458841-oozie-oozi-W] ACTION[0012588-170228165458841-oozie-oozi-W@mark_raw_dataset_done] XException [09:30:07] mforns: about the names, it's normal that error were triggered [09:30:56] mforns: in workflow.xml, check is made on 000000_0 [09:31:04] 0012614-170228165458841-oozie-oozi-W (the other webrequest-text-load job that I re-run) is in refine now [09:31:05] joal, I know [09:31:31] mforns: But, if an error occur during the check (attempt fail or something), then the name will be different [09:31:51] mmm [09:36:55] elukey, the other text job has the same error, it will probably fail [09:39:43] elukey: it looks as if the problem was impactiung hadoop itself ... Is that possible? [09:42:30] joal: it might be, but we should double check timelines.. [09:42:58] k [09:43:09] first error started at ~3AM for me (CEST), before the impact [09:43:13] (on lbs) [09:43:53] mforns: one of the text jobs is in refine (and passed the check_seq_statistics), the other one failed.. weird [09:44:06] O.o [09:52:16] !log restarted Mar 2017 02:00:00 webrequest-load-text (second time) [09:52:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:52:59] elukey, it succeded? [09:53:14] mforns: no that one died [09:53:22] I restarted it [09:53:49] elukey, but the 1h one is green now [09:53:58] yeah [09:54:06] what happened? [09:54:19] mforns: I am not getting the point sorry [09:54:33] That's really weird ... Could be related to cluster being overwhelmed [09:54:49] two text jobs failed, I restarted both the first time and one succeeded and the other one didn't.. then I restarted the one that failed two times again [09:55:10] there were 2 hours missing for text, hour=1 and hour=2, you re-ran both and both failed, now hour=1 is success, do you know what made it succedd? [09:55:21] *succeed [09:55:30] nono as said above, I re-ran both and only one failed [09:55:34] hour=2 [09:55:35] oh [09:56:49] Cluster was no overwhelmed - looks really related to something else [09:57:51] elukey: from resource manager metrics, looks like an1041 is different than others [09:57:56] elukey: could be related? [10:01:58] not sure, both 1041 and 1040 have different behaviors in datanode/nodemanager, but let's keep in mind that 1) 1040's hdfs datanode parts have been wiped 2) 1041/1040 are running with a 4.4 kernel, menawhile on trusty we are at 3.X [10:02:09] so there is a *lot* of differences in handling memory [10:02:25] elukey: ok [10:02:44] BUT [10:02:58] it could be interesting to see if those failures were coming from jobs running on these nodes [10:03:09] is it possible from oozie/mapred logs? [10:03:20] that could be an explanation [10:03:45] elukey: looks like 41 has a different nodemanager memry pattern than 40, which is possibly not expected [10:04:22] same number of containers running? [10:05:03] elukey: always 0 [10:05:14] Which explains memory, but is that expected? [10:06:32] ah! not at all [10:06:34] let me check [10:07:23] elukey: something else to check: overall node manager heap used over last 24 hours has really changed [10:09:19] !log restarted yarn-nodenamanger on analtycs1040 [10:09:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:09:27] nothing weird from the logs afaics [10:10:35] hm [10:10:54] overall node manager heap used over last 24 hours has really changed ==> I am not seeing it [10:11:12] elukey: batcave for a minute? [10:11:15] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=17&fullscreen&from=now-2d&to=now ? [10:11:57] joal: I am also checking the ops issue atm, a bit busy but maybe in 5/10 mins? [10:11:58] elukey: My bad, not nodemanager, resource manager ! [10:12:11] ahhh [10:12:16] elukey: probably not needed actually [10:13:56] this is on an1041 [10:14:01] 2017-03-09 10:13:31,411 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Disk Error Exception: [10:14:04] org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not readable: /var/lib/hadoop/data/j/yarn/local/nmPrivate [10:14:07] gneeeee [10:14:21] oh [10:14:21] riiiight [10:14:55] elukey@analytics1041:/var/log/hadoop-yarn$ sudo ls -l /var/lib/hadoop/data/j/yarn/local/nmPrivate [10:14:58] total 12 [10:15:01] drwxrwxr-x 2 ntp ssl-cert 4096 Mar 8 09:48 application_1488294419903_27409 [10:15:04] drwxrwxr-x 3 ntp ssl-cert 4096 Mar 8 09:55 application_1488294419903_27435 [10:15:07] drwxrwxr-x 5 ntp ssl-cert 4096 Mar 8 09:56 application_1488294419903_27436 [10:15:10] WHAAAAT? [10:15:16] this is probably my fault [10:15:29] I fixed /var/lib/hadoop/data/j/hdfs yesterday [10:15:45] ah snap [10:15:48] elukey: no big deal, a couple job failed, and problem seems identified [10:16:02] mforns: Thanks for helping restarting the things ! [10:16:59] np, actually it was Luca, thanks guys for fixing this, has been interesing to follow :] [10:18:13] just to double check I am running root@analytics1042:/var/lib/hadoop/data/b# find ! -user yarn -group yarn [10:18:49] so the current theory is that a single new hosts, 1041, failed to run containers due to fs issues [10:20:33] ok stopped nodemanager on 1041, fixing perms [10:20:44] guys, I have to leave now, will be back 1h30m before standup, cheers [10:22:51] bye mforns [10:23:01] Thanks elukey for fixing :) [10:25:27] really sorry for the trouble, I haven't checked the /var/lib/hadoop/data/X/yarn dirs.. [10:26:10] and it makes sense: since I reinstalled the OS, new gid/uid for users have been created. So after mounting the partitions, the inodes were still pointing to old gid/uid [10:26:20] I fixed the hdfs dir [10:26:23] and not yarn [10:26:29] I thought it was empty or something [10:27:42] elukey: no big deal - It's good learning for global future reimaging ! [10:27:50] Thanks for doing that :) [10:28:05] elukey: I love the idea we'll be full-debian soon :D [10:29:08] me too! [10:31:53] !log analytics1041 yarn nodemanager stopped, chowning to yarn:yarn all the perms in /var/lib/hadoop/data/X/yarn dirs [10:31:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:40:58] ok going to write all the steps (including verification of permissions) in the Debian migration guide [10:41:11] !Log Restart webrequest load text 2017-03-09T02:00Z [10:56:22] still fixing perms on 1041 [10:56:26] ahahahha [10:56:30] looong command [11:04:25] !log an1041 yarn nodemanager back running [11:04:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:05:28] oh yess we have now containers on 1041 [11:05:30] \o/ [11:21:57] going to step away for ~30 mins! [11:50:54] joal: bonjour! one tiny question and I think I'm go for testing the job [11:52:41] right now my data is stored in hdfs, but it's compressed in gz, does it need to be decompressed before the job starts? [11:53:21] joal: also, I'm guessing mapreduce.input.fileinputformat.inputdir expects a regular expression path that matches all the files I want to load? [13:17:49] * fdans goes for some lunchy lunch [13:56:47] joal: http://events.linuxfoundation.org/events/apache-big-data-north-america/program/schedule [14:04:17] fdans: sorry ! Completely missed your ping !! [14:05:12] fdans: it's actually very easy easy with mapreduce: it decompresses by itself regular formats (gz, snappy, bz2), and reads every file present in a folder [14:06:42] elukey: mwarf, will not go there :( [14:08:25] http://events.linuxfoundation.org/events/apachecon-north-america/program/schedule - https://apachecon2017.sched.com/event/9zvG/cassandra-serving-netflix-scale-vinay-chella-netflix?iframe=no&w=100%&sidebar=yes&bg=no [14:09:04] https://apachecon2017.sched.com/event/9zp7/10-things-to-consider-when-using-apache-kafka-utilization-points-of-apache-kafka-obtained-from-iot-use-case-naoto-umemori-yuji-hagiwara-ntt-data-corporation?iframe=no&w=100%&sidebar=yes&bg=no [14:17:11] !log restart failed webrequest load [upload maps misc] 2017-03-09T09:00Z [14:17:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:18:09] oh joal, so I don't need to do anything to those files! cool, I'm going to give this a try, unless you want to take a look at it first [14:18:43] fdans: just a quick double check - Which keyspace are you ganna use? [14:19:14] local_group_default_T_pageviews_per_project_v2 [14:19:22] joal ^ [14:19:57] fdans: I suggested to create a new one dedicated for those tests [14:20:32] fdans: if anything goes wrong with the loading, or anything else, this keyspace is the real prod one [14:20:51] you're totally right joal [14:23:29] fdans: easiest way is, in CQLSH, do a: describe "local_group_default_T_pageviews_per_project_v2"; [14:24:01] It'll give you the CREATE commands that needs to be run for creating a fake keyspace (changing the keyspace name manually, obviously) [14:24:09] yeah that's what I've been doing :) [14:25:57] joal however, I can't find in the docs the way of starting cqlsh in stat1004, is there any magic involved? [14:27:09] fdans: always :) [14:27:53] fdans: https://wikitech.wikimedia.org/wiki/Analytics/AQS#Cassandra_CLI [14:28:47] fdans: meaning: cqlsh -h aqs1004-a.eqiad.wmnet -u aqsloader -p [14:29:25] joal: what I mean is that cqlsh is not found as a command in stat1004 [14:29:48] fdans: on aqs1004 :) [14:29:57] ohhhhhh I seeeee [14:29:58] silly me [14:30:03] thank you joal [14:31:09] fdans, do you have 5 minutes to help me executing aqs test with local cassandra? [14:31:18] :] [14:31:23] mforns: hell yeah! [14:31:29] a la baticueva! [14:31:34] :D baticueva it is [14:48:41] I automated https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Worker_Reimage_.2812_disk.2C_2_flex_bay_drives_-_analytics1028-analytics1057.29 [14:49:04] so now it is all written down and a worker reimage should take ~1 hour [14:49:15] and it shouldn't fail like this morning :) [14:49:20] joal,mforns --^ [14:49:40] cool elukey [14:49:52] 06Analytics-Kanban: Extract edit history denormalized data from intermediate data - https://phabricator.wikimedia.org/T144717#3087817 (10JAllemandou) [14:49:55] 06Analytics-Kanban: Productionize Edit History Reconstruction and Extraction - https://phabricator.wikimedia.org/T152035#3087816 (10JAllemandou) [14:50:09] today I leaned a trick with awk [14:50:12] awk -F 'sd|1' '{print $2}') [14:50:16] elukey: Thanks :) [14:50:24] sorry the last ) is not needed [14:50:35] elukey: What does it do? [14:50:38] say you have /dev/sdb1 and you need only "b" [14:50:39] I don't know it ! [14:50:54] Ohhh ! Get it :) [14:50:57] Awesome [14:51:20] elukey@analytics1041:~$ echo "/dev/sdb1" | awk -F 'sd|1' '{print $2}' [14:51:23] b [14:51:29] awk sometimes is so awesome [14:51:50] 06Analytics-Kanban: Create oozie job for mediawiki edit history job - https://phabricator.wikimedia.org/T160074#3087833 (10JAllemandou) [14:53:32] nice elukey! [14:55:03] 06Analytics-Kanban, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3087865 (10Ottomata) I think we are ready to proceed with this when yall are. Should we schedule a day next wee... [14:56:22] elukey: q, do I have permission to access aqs1004? [15:01:15] fdans: probably not, have you tried to ssh ? [15:01:23] ottomata: \o/ [15:01:46] yeah, it's asking me for a password [15:01:48] https://www.irccloud.com/pastebin/1bQRQ7Ma/ [15:02:15] fdans: I can open a phab task and request access, but it will probably take a ops meeting [15:02:22] and next week is us holiday [15:02:26] so it might take a while [15:03:44] ah no maybe there is a quicker way [15:03:46] aqs-users [15:03:50] it does not have sudo [15:04:03] fdans: what do you need to do on aqs1004? D: [15:04:04] :D [15:09:27] 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070663 (10elukey) Just completed https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Worker_Nodes_... [15:12:11] elukey: I need to create a test keyspace now and later the final keyspace for the corrected data [15:12:29] joal: https://mentors.debian.net/package/clickhouse [15:12:50] fdans: so you'll need to sudo as hdfs [15:12:55] err cassandra [15:13:19] why sudo elukey? [15:13:24] so you'd need https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L422 [15:13:34] because you are changing keyspaces [15:13:43] IIRC the cassandra user need to be used [15:13:47] but I might be wrong [15:14:00] right [15:14:40] 06Analytics-Kanban, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3087960 (10mobrovac) Wednesday 15th? [15:14:44] elukey: I think aqsloader can do [15:14:45] elukey: should I create a phab task to request access then? [15:15:40] joal: sure but fdans can't sudo as aqsloader :) [15:15:53] and IIRC we don't set CREATE perms [15:15:56] for aqsloader [15:15:59] elukey: I'm assuming that clickhouse package in that page is good, but I really have no idea [15:16:15] elukey: hm [15:16:16] it is still not in Debian but in search for a mentor [15:16:24] might ask to Moritz to check :) [15:16:25] elukey: riiiiiight :) [15:16:27] makes sense [15:16:29] 06Analytics-Kanban, 10Wikimedia-Stream: EventStreams Blog Post - https://phabricator.wikimedia.org/T160080#3087962 (10Ottomata) [15:16:49] fdans: it depends, not sure if I am ok with you messing on aqs [15:16:49] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: EventStreams - https://phabricator.wikimedia.org/T130651#3087976 (10Ottomata) [15:16:51] elukey, fdans : I'm gonnq try to creqte a new keyspace using aqsloader and see [15:16:52] 06Analytics-Kanban, 10Wikimedia-Stream: EventStreams Blog Post - https://phabricator.wikimedia.org/T160080#3087975 (10Ottomata) [15:16:54] and sudo and stuff [15:17:03] (:D :D :D) [15:17:32] butbut elukey when I break stuff I'm super nice about it! [15:17:39] jokes aside, I don't see anything wrong, we can create a task requesting aqs-admins perms [15:18:38] elukey: I confirm, aqsloader has no create perms [15:18:58] elukey: Asking permission to sudo as cassandra in order to create keyspace [15:20:44] nice, sometimes it works as expected :P [15:22:15] elukey: actually I can't connect as cassandra even when sudoing [15:23:26] elukey: :5 [15:24:43] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#3088006 (10Aklapper) @Lcanasdiaz: Did Mukunda's comment help? (Or is there some issue that after a 503 you might need to start from scratch again and hence miss da... [15:25:09] fdans: I was wrong! [15:25:09] sorry [15:25:12] elukey@aqs1004:~$ cqlsh -u cassandra aqs1004-a [15:25:12] Password: [15:25:13] Connected to Analytics Query Service Storage at aqs1004-a:9042. [15:25:13] [cqlsh 5.0.1 | Cassandra 2.2.6 | CQL spec 3.3.1 | Native protocol v4] [15:25:16] Use HELP for help. [15:25:35] we should only need aqs-users, but it is best if we ask for aqs-admins [15:25:56] actually this is something to fix asap [15:26:11] (changing the pass for cassandra and keeping it somewhere safe) [15:26:28] let's discuss it during standup [15:27:03] mforns will probably need access too since he's working on the legacy pageview that will require him to create new keyspaces and populate them, right? [15:27:16] oh yes please :] [15:27:23] he is in aqs-users, he should be able to loging [15:27:26] *login [15:27:32] as well as nuria [15:28:21] elukey, yes I can ssh into aqs machines, if that is what you say [15:29:07] ah right, sorry [15:29:32] elukey: how should I do to do it for now (and even, can I ?) [15:31:58] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Deployment of Maniphest panel - https://phabricator.wikimedia.org/T138002#3088023 (10Lcanasdiaz) Data collection is already done, we are polishing up the enrichment process (where raw data is converted to enriched data), so the panel wil... [15:35:24] joal: anybody with ssh access can loging as cassandra (that is bad) [15:35:34] but he/she needs to know the pass.. [15:36:06] are you not able to ssh using cqlsh -u cassandra? (sorry I am not getting the issue :) [15:36:46] 10Analytics-Tech-community-metrics: Add "Ticket Openers" to equivalent of "Activity by contributors" in kibana - https://phabricator.wikimedia.org/T105634#3088033 (10Aklapper) I'm going to merge this into T28 as it already lists "Top task creators in last month". T28 is currently blocked on T138002. [15:36:52] 10Analytics-Tech-community-metrics: Add "Ticket Openers" to equivalent of "Activity by contributors" in kibana - https://phabricator.wikimedia.org/T105634#3088037 (10Aklapper) [15:36:55] 10Analytics-Tech-community-metrics, 10Phabricator, 06Developer-Relations (Jan-Mar-2017): Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#7527 (10Aklapper) [15:36:58] elukey: the thing is: I don't know the pass ;) [15:37:07] which is actually good ! [15:37:51] 10Analytics-Tech-community-metrics: Add remaining KPIs to Overview once available in kibana - https://phabricator.wikimedia.org/T116572#3088045 (10Aklapper) [15:37:58] 10Analytics-Tech-community-metrics, 10Phabricator, 06Developer-Relations (Jan-Mar-2017): Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#11655 (10Aklapper) 05Open>03stalled Waiting for {T138002} here to see what would be the 'default' panel and displayed widgets.... [15:48:36] fdans: Thanks to some elukey magics, I have the needed grail: "test_fdans_pageviews_per_project_v2" (with expected data table) [15:49:47] elukey, ottomata: thoughts on https://phabricator.wikimedia.org/T149225? I think I'll decline this, changing the existing groups would be massive churn for hardly any gain [15:50:59] +1, at the moment we don't have the time in my opinion to migrate all the users to a new scheme.. and we have improved a lot the documentation [15:53:08] 06Analytics-Kanban: Create cron job in puppet sqooping prod and labs DBs - https://phabricator.wikimedia.org/T160083#3088109 (10JAllemandou) [15:53:30] ottomata: just created --^, and assigned it to you [15:54:18] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Go through default Kibana widgets; decide which ones are not relevant for us and remove them - https://phabricator.wikimedia.org/T147001#3088127 (10Aklapper) Asked Bitergia for the naming scheme for custom widgets via email on 20170210.... [15:54:41] moritzm: yeah makes sense, i'll comment and close [15:55:13] k, thanks [15:56:33] elukey, ottomata: also https://phabricator.wikimedia.org/T149222, I don't see the point here either, the private ones are intentionally private after all, while the others are accidential at best [15:56:42] I'd say also close this [15:56:50] joal, elukey: thank you so much!! [15:57:02] (currently sweeping through auth bugs as part of the Q ops goal) [15:58:08] moritzm: +1 [16:00:53] moritzm: i'll explain and close [16:01:08] a-team: standddupppp [16:01:21] joal, ottomata : standdduppp [16:01:21] (03PS1) 10Joal: [WIP] Update static wiki projects list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/342030 [16:02:51] ok, thanks [16:03:13] agh! [16:07:43] 06Analytics-Kanban, 10ChangeProp, 06Operations, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3088162 (10Ottomata) Ya. @elukey should we do varnishes before or after this? I can add librdkafka 0.9.4 to ou... [16:57:17] joal: another question, so I'm trying to load all granularities with a single job, however it seems I have to specify a constant value for granularity. is there any way of going around this that doesn't involve changing the reducer/creating three jobs? [17:09:58] fdans: can you point me to the job is it in gerrit? [17:15:20] hey fdans, you should review the data formats you're dealing with [17:15:48] fdans: If I don't mistake, in your use-case granularity will be a new field to add to the list of hive fields [17:16:48] 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3088359 (10Aklapper) Yay, thanks for working on this! (Note to myself, just adding more test cases: After data is refreshed, Andre to check whether https... [17:17:08] (03PS22) 10Joal: Add mediawiki history spark jobs to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) [17:19:52] fdans: do you want us to spend some time on this now? [17:35:30] joal: sorry, had to pop out for a second [17:35:36] yes, batcave? [17:36:02] fdans: yes ! [18:02:45] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: EventStreams - https://phabricator.wikimedia.org/T130651#3088529 (10Jdlrobson) [18:02:48] 10Analytics, 10Wikimedia-Stream: old revision length missing in /v2/stream/recentchange - https://phabricator.wikimedia.org/T160030#3088527 (10Jdlrobson) 05Open>03Resolved Confirmed. I think I was looking at some non-edit events. Thanks for the follow up. [18:09:20] * elukey afk! [18:59:07] (03PS4) 10Joal: Add oozie jobs for mw history denormalized [analytics/refinery] - 10https://gerrit.wikimedia.org/r/341030 [19:09:39] * halfak looks around for ottomata [19:10:05] I'm around for the analytics/devops/research checkin [19:13:32] halfak: Hey ! [19:13:40] o/ joal [19:13:42] Sorry missed the thing - anyone with you? [19:14:08] halfak: ---^ [19:14:34] Nope. All by myself [19:14:48] I don't actually have anything pressing, but thought I should make myself available :D [19:14:49] halfak: seems like it's a 'cancel' signal [19:14:53] yup [19:14:54] :D [19:15:04] ok - sorry for not showing up in time [19:15:09] no worries. :) [19:16:22] Have a good end of day, halfak, see you later ! [19:17:21] Gone for today a-team ! [19:17:48] nite jo [19:18:28] fdans: your job is still running :) [19:29:06] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3088833 (10Jdlrobson) We may need to revisit the config values for max_pages The fact I'm seeing older art... [19:57:58] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3088917 (10mobrovac) >>! In T156411#3088833, @Jdlrobson wrote: > We may need to revisit the config values... [20:04:49] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3088961 (10Pchelolo) @Jdlrobson What you could do for quick testing of ideas and outputs is the following:... [20:19:42] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089015 (10Jdlrobson) Right now the fact is that older articles are not showing up in results. Theres onl... [20:30:10] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089068 (10mobrovac) There are 10 edits/second across all wikis. If we assume we are tracking all of them... [20:45:20] ottomata: I added you to CR for UA metrics, we think we are ready to merge that cc Krinkle [20:45:40] k [20:45:42] ottomata: https://gerrit.wikimedia.org/r/#/c/341723/ also :) [20:45:52] re: webperf, I'm okay with landing. [20:46:54] Actually, if you could wait 30min or so to give the train a bit more time to roll out and for us to catch regressions in dashbaords as new data comes in. Its' a hot time of the week for us. [20:46:57] just went to all wikis [20:47:19] Krinkle: wouldn't it be better to do the scap deploy first? [20:47:25] then you'll have your own eventlogging/webperf [20:47:29] and can just set PYTHONPATH=... [20:47:55] i guess it doesn't really matter [20:48:00] oh, i guess you haven't deploy the trebuchet one in a long time [20:48:08] so maybe better to stick with what you have, rather than change code [20:48:09] ok ok ok [20:48:10] :) [20:48:11] I'd rather do this step first :) [20:48:26] But.. wait 30min with merging (unless you can make sure it won't hit hafnium until 30min) [20:48:34] naw, i'll wait [20:48:41] would also like feedback on https://gerrit.wikimedia.org/r/#/c/341724/3 [20:48:58] actually, i think i'm going to get kicked out of this cafe, and i am going to peace out for the day in about an hour, so maybe we can merge tomorrow? [20:49:11] oo reading... [20:50:39] yeah, no rush [20:52:07] ottomata: ya, totally, no rush [21:50:41] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089513 (10Jdlrobson) @mobrovac that sounds like a good starting point and wouldn't hurt. I'll let you kno... [21:55:05] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089544 (10Jdlrobson) Thanks for the discussion - I've captured all this in a spike - T160127. [22:04:29] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089588 (10mobrovac) I have bumped `max_pages`, let's see if that helps. The memory has increased but the... [22:19:23] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 4 others: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#3089628 (10Jdlrobson) <3 [23:29:20] 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Change userAgent field to user_agent_map in EventCapsule - https://phabricator.wikimedia.org/T153207#3089847 (10Nuria) @Tbayer and @Nemo_bis FYI that we will be deploying this next week, after our work with @Krinkle we have also included brows...