[09:19:13] (03CR) 10DCausse: [C: 031] Lucene Stemmer UDF (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/326168 (https://phabricator.wikimedia.org/T148811) (owner: 10EBernhardson) [11:50:01] Hi A-team, I'm teaching this afternoon, will be here at standup :) [11:51:48] o/ [12:05:25] joal elukey hellooo A-team! [12:06:53] o/ [12:20:59] * elukey lunch! [12:49:07] hi team! [12:49:17] mforns: hola! [12:49:29] hola que tal :] [12:50:07] mforns buenas! [12:50:31] fdans, hola! :] como estas? [13:00:33] great! first day :) you're in bcn right? [13:01:18] I was born in bcn, but I live in Mallorca today :] [13:01:36] you live in Madrid, no? [13:02:55] ahhh now I got who fdans is! Hello! [13:03:08] sorry I was working on kafka and my brain didn't connect [13:03:16] (all excuses I know :P) [13:03:22] hi :) no problem! really excited to be here [13:03:29] yeah mforns I'm in Madrid [13:03:35] super happy to have you here! [13:03:48] let me know if you need anything related to ops [13:03:59] yes, we're excited to have you here too! [13:04:18] I think you're the only one I haven't spoken to so far in the team [13:04:52] bad guy, don't trust him! [13:05:04] hahah will keep that in mind [13:05:07] have you had a hangout today already with someone in the team? [13:05:08] xD [13:05:31] well for now I'm just setting up my stuff and I do have a hangout later with Nuria to kick things off [13:05:55] awesome [13:08:24] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864613 (10mschwarzer) Not rebooting is not really a suitable solution when using the VM for development, since I also need to enable other roles or change port-forwarding. For instanc... [13:08:42] fdans, so feel free to ping me whenever you want to talk about anything, questions or whatever :] [13:09:28] thank you, I appreciate that! [13:27:59] 06Analytics-Kanban: Webrequest dataloss registered during the last kafka restarts - https://phabricator.wikimedia.org/T152674#2864639 (10elukey) List the restarts and their timings (UTC): ``` Dec 7 18:45:38 kafka1012 sudo: otto : TTY=pts/1 ; PWD=/home/otto ; USER=root ; COMMAND=/usr/sbin/service kafka rest... [13:32:32] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864652 (10Physikerwelt) I think it might be a good idea to play with https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/role/manifests/hadoop.pp#L7 Maybe puttin... [13:37:43] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864657 (10Physikerwelt) However, I still don't really understand that particular error message. If you `export VAGRANT_LOG=DEBUG` you'll see that ``` INFO subprocess: Starting process... [13:38:10] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864659 (10mschwarzer) Due to the error with --provision the Hadoop ports weren't set up correctly: ``` $ vagrant forward-port -l Local port => VM's port ----------------------- ``` T... [13:41:00] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864667 (10mschwarzer) Vagrant seems to call the lxc-attach help function: ``` INFO subprocess: Starting process: ["/usr/bin/sudo", "/usr/bin/env", "lxc-attach", "-h"] ``` I don't see... [13:45:40] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864669 (10Physikerwelt) I have no idea what you could try next. Probably run vagrant destroy and adjust the config. Are you trying the analytics role or just the hadoop role? I wonder... [13:51:55] mforns: do you have 10 minutes? [13:52:07] elukey, sure [13:52:11] thanks :) [13:52:11] batcave? [13:52:15] sure [13:52:34] mforns: mmm maybe better in here [13:52:37] for the moment [13:52:43] I need to write you some things :D [13:52:44] elukey, yes ok [13:53:03] so I am checking logs for T152674 [13:53:03] T152674: Webrequest dataloss registered during the last kafka restarts - https://phabricator.wikimedia.org/T152674 [13:53:30] so after digging in kafka/varnishakfka I tried to execute the select_missing_sequence_runs.hql for other partitions [13:53:33] like misc and maps [13:53:38] aha [13:53:44] because all the segments are alarming [13:53:49] err were alarming [13:53:51] for the same hour [13:53:55] yes [13:54:28] so I copied the script for the refinery to my home on stat1004 (adding the ADD JAR etc.. to the top of the script) [13:54:32] and then tried hive -f select_missing_sequence_runs.hql -d webrequest_source=text -d year=2016 -d month=12 -d day=7 -d hour=18 [13:54:53] that works [13:54:54] BUT [13:54:59] if I use misc or maps, not result [13:55:07] mmm [13:56:45] elukey, this query though is not the one that the alarms use, right? [13:57:48] mforns: probably, it is old.. what I wanted to ask you is if we could review both and check the last alarms, to figure out if we got false positives (for some reason) or if there is a weird kafka transport issue [13:57:53] that I can't find in any metrics [13:58:16] elukey, of course [13:58:30] 10Analytics, 10Analytics-Wikistats: Design new UI for Wikistats 2.0 - https://phabricator.wikimedia.org/T140000#2864687 (10MelodyKramer) Yup, I added it because the UI there really helped lay folks understand exactly what metrics they were looking at (because they were framed as questions/answers.) The content... [13:58:32] looking [14:22:34] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864758 (10mschwarzer) Same problem when only enabling the hadoop role :/ [14:27:36] mforns: let me know if you want help! [14:27:43] you went dark for 30 minutes :D [14:30:16] elukey, xD, yes still checking [14:30:25] it's weird [14:31:18] elukey, the percentages in the query that populates the stats table do not add up to the amount shown in the error file [14:36:25] mforns: but there is indeed data loss registered in the stats table right? (still need to check) [14:37:28] ah yes [14:41:46] actually elukey, the webrequest_sequence_stats table does have matching percentage_lost, but how it got there I don' t know [14:42:07] hiiii [14:42:57] o/ [14:43:03] 10Analytics, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2864815 (10Physikerwelt) @mschwarzer Maybe you find someone using the analytics vagrant role on [IRC](https://webchat.freenode.net/?channels=#wikimedia-analytics). [14:45:17] mforns: I tried to run select * from webrequest where webrequest_source = 'text' and year = 2016 and month = 12 and day = 7 and hour = 18 and sequence >= 2641357262 and sequence <= 2655443912 and hostname = 'cp3030.esams.wmnet' limit 100 [14:45:39] elukey, aha [14:45:49] and I can see reqs [14:45:55] logged correctly [14:46:28] let me count them [14:46:29] to see [14:46:44] the sequences are the ones reported missing in webreq-stats [14:49:46] so I get 14086628 [14:49:58] and it should be 14086650 [14:50:09] so about ~300 reqs missng [14:51:14] and webreq stats report as count_missing 384218 [14:51:31] sorry, count_different [14:51:39] 10Analytics, 10Analytics-Wikistats: Design new UI for Wikistats 2.0 - https://phabricator.wikimedia.org/T140000#2864849 (10Milimetric) @Nemo_bis: The design right now is revolving around "Content, Contributing, and Reading". So that's the direction, still the same as the current wikistats but trying to add mo... [14:51:45] mmmmmm [14:52:08] hi fdans, welcome! [14:52:34] fdans: do you have the onboarding guide? [14:52:44] hi milimetric, thank you! [14:53:05] I do, the only thing is that yesterday I managed to lock myself out of my wikimedia.org google account [14:53:26] oh ok, np [14:53:38] you can still do the labs / ssh / gerrit stuff, right? [14:53:43] by resetting the password but not setting 2FA, so I'm waiting to hear from it about that [14:54:04] elukey: can you help with that or is that just OIT? [14:55:22] (OIT: office IT, they run the office infrastructure, meetings, equipment, our google accounts and more) [14:55:46] fdans: HIIII! :D [14:55:49] I think that office IT should be able to help during SFO daytime [14:56:07] I do have accounts for all of this, but the credentials are in the gmail account [14:56:16] ottomata hi!!! [14:57:03] oh, makes sense [14:57:09] soon as I changed the password yesterday I realised I didn't have any more 8-digit access codes and it was like OH NO [14:57:18] elukey: count different could be duplicates [14:57:32] are you selecting your seqs from wmf_raw webrequest? or refined? [14:57:38] raw [14:57:41] hm [14:58:11] also I tried to run the select_missing_sequence_runs.hql script for upload/misc/maps [14:58:24] but I nothing pops up for maps/misc [14:58:28] wait, so seq stats is wrong? [14:58:43] fdans: well, so you'll be locked out for at least another 2-3 hours until OIT comes online. We can hang out and talk about the projects going on [14:58:55] sure! [14:59:13] fdans: do you have the batcave linked? :) https://plus.google.com/hangouts/_/wikimedia.org/a-batcave [14:59:21] ottomata: not sure, after digging into logs and related things (without getting anything out of it) I want to make sure that our alarms are good and not throwing false positives [14:59:38] (we all bookmark that, it's what we mean when people go: "batcave!" or "cave" or "hang?" or any such thing) [14:59:44] ottomata: IIRC the last alarms are the first ones with the new alarming [14:59:57] milimetric I'm requesting access, since I'm entering with my personal address [15:00:02] and we got ALL the segments throwing errors [15:00:06] ottomata, elukey, I think there is a problem with generate_sequence_stats_hourly.hql [15:00:27] elukey: yeah, but it did correspond to when i restarted broker on 1012 that day [15:01:15] ottomata: yep, but I really don't see any proof that kafka1012 caused this mess :( [15:01:42] mforns: what did you find? [15:02:32] elukey, I think webrequest_sequence_stats is correct, I can see a couple losses, but they do not add up to the threshold [15:02:59] but somehow, the webrequest_sequence_stats_hourly table has a percent loss that is over threshold [15:03:29] I'm analysing the queries right now to see [15:03:43] mforns: so it could be that the kafka1012 restarts caused some loss, but not up to the WARNING threshold [15:03:57] yes, that is the hypothesis [15:04:05] nice [15:05:31] but why would seq stats be wrong? that's strange [15:05:56] ottomata, not seq stats, seq stats hourly [15:06:05] maybe the query is wrong? [15:06:13] hm [15:07:09] ottomata: if you check /mnt/hdfs/wmf/data/raw/webrequests_data_loss/text/2016/12/7/18/WARNING/000000_0 it seems that 6490442 requests (2.726% of valid ones) were lost, but I don't count such a big number from the results that you posted [15:07:57] there is indeed a big hole for cp3030 (not sure why) up to ~1M, but the alarm says 6M [15:07:59] oh from the hole checker, yeah I didn't count [15:09:29] Hi everyone. Anybody using the Vagrant analytics role? [15:10:40] not for a long time :/ [15:13:38] I'm trying to get it running to see how I can use the oozie workflow to write data to ES. but it seems a bit buggy: https://phabricator.wikimedia.org/T151861 [15:19:46] elukey, ottomata, has that hour been rerun after kafka restarts? [15:20:30] rerun? no [15:20:35] did the oozie jobs fail? [15:20:50] nope it was a warning [15:21:09] so no stops [15:21:20] mmmmm: i'll try to spin it up today [15:21:24] and see if i have simliar problems [15:21:29] but the contents of the webrequest_sequence_table for that hour are different from if I run the query right now... [15:21:41] ottomata: thanks very much! [15:21:44] mmmmm: real quick, did you increase RAM for vagrant? [15:22:27] ottomata: no. see ticket. it fails already when provisiong the hadoop role. [15:23:37] mforns: batcave? [15:23:44] elukey, yea.. [15:24:12] ok [15:24:49] elukey, https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave-2 [15:25:04] mforns: joining, need to grab some headphones :) [15:39:20] 10Analytics-Dashiki, 06Analytics-Kanban, 13Patch-For-Review: Remove dependency on available-projects.json file hosted in labs - https://phabricator.wikimedia.org/T136120#2864952 (10Milimetric) p:05Triage>03Normal a:03Nuria [15:45:31] milimetric: just to followup from last time, found the problem. It was that unix_timestamp() is considered non-deterministic, because it has a 0-argument version that returns the current timestamp. using to_unix_timestamp made everything happy [15:49:05] Morning!! :) Any thoughts on where I should put a shared directory for fundraising, to share a bit of code and aggregated data (in csv) among fr-tech on stat1002? (mostly to be used with ipython notebooks)? thx in advance! [15:53:28] ottomata: something super weird has happened [15:54:11] mforns has re-run the query to generate the stats table percent loss etc.. (without the insert) and it is not showing any loss, or veery veeery tiny amoung [15:54:15] *amount [15:54:36] that is very weird!!! [15:54:38] HMMMMMM [15:54:39] meanwile, the stats table shows for each partition a constant amount of loss (text is 2.x percent, etc..) [15:54:56] so it seems like something cut the flow at 98% or something [15:54:57] could the stats table have possibly been populated before camus finished? [15:55:05] this is the theory! [15:55:14] you restarted kafka1012 two times close to the hour [15:55:22] back in a bit (will get backscroll...) [15:55:27] so it might be that big batches of data (the last ones) got delayed [15:55:40] and imported/processed during the subsequent hour ? [15:55:51] this is what Marcel and I are thinking [15:56:11] and it would explain why it happened when your restarted and why we didn't get more alarms [15:56:36] yeah, but hm, the load job isn't supposed to be launched until camus writes a file into the dirs, and that is based on the camus offset files and timestamps [15:56:37] HMMMM [15:56:38] mayybe [15:56:46] 06Analytics-Kanban: Webrequest dataloss registered during the last kafka restarts - https://phabricator.wikimedia.org/T152674#2856720 (10elukey) a:03elukey [15:56:46] maybe some reordering issues because of retrires from varnishakfak [15:56:46] ? [15:56:51] like, if i restarted near the hour [15:57:05] some messages with dts from after the hour were succesfully produced to kafka before the hour [15:57:15] so it had earlier offsets [15:57:54] and camus had some race condition where it finished importing, wrote data for hte next hour, so the camus offset files had timestamps with stuff for the next hour, which would cause the camus job to write the _IMPORTED flag so the load oozie job would run? [15:58:15] before camus ran the next time, when it would import the remaining messages for the previous hour? [15:58:21] that is crazy, but possible i guess [15:58:44] ottomata: so we checked sequence_expected and sequence_actual. In current stats table they show a big difference, meanwhile in the re-run of the query they are the same [16:00:06] elukey: sounds like data is present then [16:00:27] must be some crazy varnishkafka retries + camus + oozie race condition [16:06:51] (03PS3) 10Nuria: Fixing tests on master [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) [16:07:31] 06Analytics-Kanban: Dashiki tests broken on master - https://phabricator.wikimedia.org/T152631#2855497 (10Nuria) https://gerrit.wikimedia.org/r/#/c/326152/ [16:37:36] 10Analytics, 10Pageviews-API, 07Easy, 03Google-Code-In-2016: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2865132 (10Milimetric) Just reporting back about the current way this is being implemented (mini request for comment): The per-article API is... [16:43:54] elukey: mforns, the loss from nodes like cp3030 was real, right? [16:44:28] ottomata: mmmm not sure, from what Marcel found out no [16:44:29] ottomata, what do you mean? [16:44:45] I'd say there was no loss, only delay no? [16:44:46] but we can check [16:44:51] easily [16:44:59] yes [16:50:58] fdans: can you join batcave ? with wikimedia address? [16:53:49] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 13Patch-For-Review: Implement server side filtering (if we should) - https://phabricator.wikimedia.org/T152731#2865189 (10Nuria) [16:56:54] 10Analytics: de-duplicate archive records matching revision records in mediawiki_history - https://phabricator.wikimedia.org/T152546#2852209 (10Nuria) p:05Triage>03Low [16:57:12] ok [16:57:21] i guess a few hosts just were delayed more than others [16:58:08] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2865209 (10Nuria) [17:03:33] 10Analytics, 14Trash: --- Items above are triaged ----------------------- - https://phabricator.wikimedia.org/T115634#2865231 (10Nuria) [17:08:21] 10Analytics: Productionize loading of edit data into Druid (contingent on success of research spike) - https://phabricator.wikimedia.org/T141473#2865240 (10Nuria) [17:08:23] 06Analytics-Kanban: Productionize Edit History Reconstruction and Extraction - https://phabricator.wikimedia.org/T152035#2865239 (10Nuria) [17:13:04] 06Analytics-Kanban, 10GitHub-Mirrors, 07Documentation, 07Easy: Mark documentation about limn as deprecated - https://phabricator.wikimedia.org/T148058#2865249 (10Nuria) [17:13:15] 06Analytics-Kanban, 10GitHub-Mirrors, 07Documentation, 07Easy: Mark documentation about limn as deprecated - https://phabricator.wikimedia.org/T148058#2713665 (10Nuria) a:03mforns [17:13:50] Hi a-team, sorry to have missed standup ! [17:13:59] np joal :] [17:14:06] we are finishing grooming [17:14:17] 10Analytics, 07Spike: Research spike: load enwiki data into Druid to study lookup table performance - https://phabricator.wikimedia.org/T141472#2865271 (10Nuria) [17:14:19] I' have backloged the chan on kafka issues, interesting theory about reordering close to the hour [17:14:20] 10Analytics, 07Spike: Spike - Slowly Changing Dimensions on Druid - https://phabricator.wikimedia.org/T134792#2865269 (10Nuria) [17:15:19] This type of reordering could definitely mess up with the partition-checker (or at least |I really think so) [17:15:37] joal: good news is that we didn't loose data and our alarming works :) [17:15:49] (but it was a weird race condition) [17:16:12] elukey: cool :) [17:16:34] 10Analytics-Cluster: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#2865276 (10Nuria) 05Open>03Resolved [17:17:02] joal: Andrew and Marcel are going to update the task with findings and theory, if you want to chime in and add your opinion/review it would be great :) [17:17:29] joal, elukey, I was doing it, but grooming started, will post in a couple minutes [17:17:49] sure elukey, mforns :) [17:18:29] elukey: i'm trying to pull out sequence vs offsets from kafka at that hour [17:18:42] things seem to be so out of order that its actually hard to do [17:18:47] mforns, milimetric: a couple of minutes for me, to explain on the idea about code separation? [17:18:49] which, supports our hypothesis [17:19:00] but, usually i can search for an offset close to the time in question [17:19:01] join us in the cave joal, we're just wrapping up grooming [17:19:02] consume a bunch of messages [17:19:03] ottomata: riiiight [17:19:04] sort them [17:19:09] sure, joining ! [17:19:12] and then examine them to see what's up [17:19:27] thanks to everybody :) [17:19:28] but, in this case, when I consume just the couple of minutes around the hour border [17:19:46] i see lots of holes in sequences [17:20:25] so, i dunno, i think we are correct, but i can't really verify from kafka, without consuming and sorting a LOT of dat [17:20:26] data [17:20:38] it'd be nice if we had kafka partition offsets in the webrequest raw data :/ [17:21:41] ottomata, elukey: should we think of a way to improve the camus-partition-checker? [17:21:57] That could be a nice actionable out of that task [17:22:16] hmmm [17:22:21] joal: i'm not sure what we could do in this case [17:22:25] we could delay [17:22:30] running the checker [17:22:38] until +1 camus runs later [17:22:42] or some fixed amount of time later [17:23:00] but the camus checker did the best it could with camus offset file timestamps [17:23:14] ottomata: hm, I need to think about that a bit more [17:23:52] joal: probably a cool actionable would be to get partition offset info in the webrequest data we save to hdfs [17:23:52] (03PS1) 10Addshore: ExactValueQuantityProcessor use short property IDs [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/326477 [17:24:00] that would make verifying problems like this much easier [17:27:40] ottomata: camus update, after the avro stuff we've been through, must be an easy one ;) [17:30:52] :) [17:30:58] what would you change though? [17:31:34] mforns: come back [17:31:40] xD [17:31:43] not sure ottomata, would need to read the code again [17:33:32] 06Analytics-Kanban: Webrequest dataloss registered during the last kafka restarts - https://phabricator.wikimedia.org/T152674#2865351 (10Ottomata) Ok, from Luca and Marcel's research today, here's the current hypothesis: Kafka restart caused varnishkafka messages to be delayed. Since one of the restarts was ve... [17:37:20] joal: this would be fun, ehhHH???? http://sf.flink-forward.org/registration/ [18:03:16] 10Analytics, 10Pageviews-API, 07Easy, 03Google-Code-In-2016: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2865478 (10Nuria) On 1) I think aggregating partial months might be confusing, it is hard to relate to data that you do not see often and when... [18:07:25] (03PS1) 10Addshore: Count per property for the 3 groups (ExactValueQuantityP..) [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/326481 (https://phabricator.wikimedia.org/T152615) [18:09:02] (03PS1) 10Addshore: New build [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326482 (https://phabricator.wikimedia.org/T152615) [18:14:51] (03PS2) 10Addshore: New build [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326482 (https://phabricator.wikimedia.org/T152615) [18:17:07] ottomata: THis would indeed be fun !! [18:18:06] joal: if you go, I go :) [18:18:16] * elukey afk! [18:18:45] Aha ottomata :) [18:19:32] ottomata: unfortunately SF is not super easy for me :S [18:19:44] ottomata: I'll discuss it with nuria :) Cuold my 2017 conf ;) [18:19:48] (03PS1) 10Addshore: Add ExactValueQuantityProcessor [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326485 [18:19:54] (03CR) 10Addshore: [V: 032 C: 032] Add ExactValueQuantityProcessor [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326485 (owner: 10Addshore) [18:20:11] (03PS1) 10Addshore: In quantities units of "1" are converted to "" [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326486 (https://phabricator.wikimedia.org/T152615) [18:20:16] (03CR) 10Addshore: [V: 032 C: 032] In quantities units of "1" are converted to "" [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326486 (https://phabricator.wikimedia.org/T152615) (owner: 10Addshore) [18:21:14] ottomata: "if you go i go".. wait where? [18:21:41] nuria: http://sf.flink-forward.org/ [18:21:59] ottomata: ooohhhh [18:22:13] (03PS3) 10Addshore: New build (with 2 changes) [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326482 (https://phabricator.wikimedia.org/T152615) [18:23:57] hey joal, yt still? [18:24:10] (03PS4) 10Addshore: New build (with 2 changes) [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326482 (https://phabricator.wikimedia.org/T152615) [18:24:11] if so, could you vet new pivot instance on thorium? [18:24:14] i want to move DNS soon [18:24:17] as part of stat1001 replacement [18:24:28] ottomata: I'm here [18:24:37] testing [18:25:04] (03PS1) 10Addshore: New build (with 2 changes) [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326487 (https://phabricator.wikimedia.org/T152615) [18:25:32] ok so [18:25:37] (03CR) 10Addshore: [V: 032 C: 032] ExactValueQuantityProcessor use short property IDs [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/326477 (owner: 10Addshore) [18:25:42] (03CR) 10Addshore: [V: 032 C: 032] Count per property for the 3 groups (ExactValueQuantityP..) [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/326481 (https://phabricator.wikimedia.org/T152615) (owner: 10Addshore) [18:25:56] joal [18:25:58] ssh -Nv thorium.eqiad.wmnet -L 9090:thorium.eqiad.wmnet:9090 [18:25:59] then [18:26:07] visit http://localhost:9090 [18:26:09] (03CR) 10Addshore: [V: 032 C: 032] New build (with 2 changes) [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326482 (https://phabricator.wikimedia.org/T152615) (owner: 10Addshore) [18:26:12] (03CR) 10Addshore: [V: 032 C: 032] New build (with 2 changes) [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/326487 (https://phabricator.wikimedia.org/T152615) (owner: 10Addshore) [18:26:15] if that looks good to you, we should be good to go [18:26:27] sorry for the spam (should be over now...) [18:27:25] ottomata: circadian rythms and sun : http://localhost:9090/#pageviews-hourly/line-chart/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkc86FWs7AKVOoxZt1XTDnxEylJQaat2nbub7VBS4XPwiFSrQ43Kqv74MmIASsTkAObiSgAmAK4MxNq8AAoAjAAiVMxg1OYAtADs8mWKANrotkbBTrwCQqLi+IwYAFbUqj7AzBgMYACCQSaaIzp9A/r0dsaOZg2uTZ7AAG6k1ADuEhAYCeRgPXGkTOO8cSwQ1OTH0QqK1TO1owv4je7NktKy4orA [18:27:31] AEYJCAAa2oQzOpj8JmAoWoACEgaDDvEkilgsA0nAMgAJHqTcGpSEQ3r9ZE0J72F48N5LD4rKSwo4nLqE5TMK43ChRe4AXWQ5ASABtBUp1lsdnsDigKmsNttdvtDnyKnyBcK0MBjmJyHNeG1Ot0lILZKRDrgAKzyIA== [18:27:34] So cool [18:27:36] sorry ottomata [18:28:23] joal: ? [18:28:35] ottomata: link is aweful [18:28:39] oh [18:29:37] ottomata: What I have seen look correct :) [18:30:39] great [18:38:01] 06Analytics-Kanban: Webrequest dataloss registered during the last kafka restarts - https://phabricator.wikimedia.org/T152674#2865627 (10mforns) Adding to what @Ottomata said: If today you execute the query[1] that populates the wmf_raw.webrequest_sequence_stats table, its results are not the same as the ones e... [18:41:02] mforns: do you know if the data is missing in refined table yet? [18:41:21] having an odd error, i'm creating a table with stat1002.eqiad.wmnet:~/create_query_clicks.hql, and then populating it with stat1002.eqiad.wmnet:~/query_clicks.hql. Trying to query the table after the query to population 1 partition runs i get: Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file ... [18:41:22] ottomata, query is running right now [18:41:58] i tried using a CREATE TABLE ... AS SELECT ... to double check the schema i created is correct, and it fully matches my create_query_clicks.hql. Not sure whats wrong [18:43:00] ottomata, the host I tried for the misc source is fine! [18:43:05] let me try others [18:45:09] ebernhardson: are you accessing wmf db? [18:45:17] not wmf_raw [18:45:55] ebernhardson: cause there should not be any need to read parquet files, right? [18:46:42] ottomata, all hosts for misc are good [18:46:57] ebernhardson: ah, you are storing as parquet [18:47:10] ebernhardson: but that should not be needed, it can be plain text right? [18:47:10] nuria: the original query joins wmf.webrequest against wmf_raw.cirrussearchrequestset, the table i'm creating uses 'STORED AS PARQUET' in the definition. I suppose when i use CTAS it might not be using parquet [18:47:38] nuria: not sure it can be plain text, i'm storing complex data types (array> [18:47:53] ebernhardson: aham [18:48:25] ebernhardson: does your query work w/o the writing part (so we know whether the exception comes from writing) [18:49:38] ebernhardson: i *think* i remember parquet has different compressions [18:49:47] nuria: yeas, changing the query to use `create table .. as select ...` creates an appropriate table i can read from, the table i can't read is ebernhardson.query_clicks, to one i can read from is  ebernhardson.query_clicks_ctas [18:49:52] mforns: cool [18:49:57] then i think things are probably fine [18:49:59] ottomata, looking at text [18:50:08] especially since misc refine job usually runs before text [18:50:11] the test query i'm using is just a `select * from ebernhardson.query_clicks where year=2016 limit 2;` [18:50:13] since sequence stats takes less time there [18:50:38] ebernhardsonhow are you telling your select it has to read parquet? [18:50:42] ebernhardson: sorry [18:50:52] nuria: no, i was assuming it would find that in the table definition [18:51:16] ebernhardson: let me see that might have changed, i though before we had to be specific and pas the parquet jar [18:51:31] actually it looks like text might work, it seems my CTAS is storing as org.apache.hadoop.mapred.TextInputFormat [18:51:44] i dont know i need parquet, it just seems to be the default used elsewhere in our setup [18:52:18] 06Analytics-Kanban, 13Patch-For-Review: Replace stat1001 - https://phabricator.wikimedia.org/T149438#2865664 (10Ottomata) Announcement out. I plan do do this 15:00 UTC tomorrow Dec 13th. [18:52:34] ebernhardson: for raw data yes but for "created" data we do not use it much [18:54:51] nuria: ok, i'll try changing my create table to use the same storage config as the CTAS is creating then [18:55:27] the size on disk looks to be about 1/2 for parquet though [18:55:48] ebernhardson: I'm against nuria's opinion here [18:55:54] ebernhardson: parquet is usually better [18:55:59] unless you want to export the data out of hdfs [18:56:06] joal: ah sorry, i might be totally off cc ebernhardson [18:56:19] ebernhardson: parquet is very helpfull is you're planning on analytics-oriented queries more than row -by-row reading [18:56:20] joal: do we use parquet for tables with agreggated data? [18:56:25] nuria: every [18:56:55] nuria: pageview, projectview, webrequest, mediawiki-history [18:57:12] joal: wait ... [18:57:27] joal: i am totally confuse then i though all those were text [18:57:34] ebernhardson: SORRY! [18:57:34] (back in 10) [18:57:54] nuria: raw is text in sequence files, almost all the rest is parquet: ) [18:58:11] joal: oohhhh, my mistake ebernhardson [18:58:14] nuria: no worries :) back to wondering though why hive can't read in the tables create with insert over table ... partition(...) select ... [18:58:21] or the partitions actually [18:58:25] ottomata, text has some minor differences: 5 hosts with 237, 160, 87, 64, 27, 17 respectively. It looks like a minor difference that we would have experienced anyway. [18:58:41] joal: doesn't parquet have different compression modes? [18:58:51] nuria: it does [18:59:12] ahh, i set parquet.compression = SNAPPY in the query (because i was looking at webrequest refine), maybe my issue is happening there? [18:59:22] nuria: we usually do snappy, which is default [18:59:28] oh hmm, then maybe not [18:59:45] ebernhardson: ? [18:59:57] oh sorry, missed a line [19:00:27] ebernhardson: I wonder if things could not come from your complex structures more than the actual feeding query [19:00:40] ebernhardson: What in case of empty, or nulls for instance? [19:00:43] ebernhardson: that is some query [19:00:53] nuria: yes :( [19:02:20] it joines search logs against webrequest logs, records which results were clicked, then sessionizes. this will be used to a) feed a clickmodel that predicts relevance based on query behavior and b) start looking into generating session-based metrics instead of per-query metrics [19:03:37] ebernhardson: I need to leave now for dinner, will backlog the chan tomorrow morning and see ifI can solve if not yet done :) [19:03:49] joal: thanks! i'll poke at this some more as well [19:03:59] have a good evening a-team ! [19:04:07] bye ebernhardson :) [19:04:08] bye joal! [19:04:15] bye! [19:04:34] laters! [19:13:56] ebernhardson: and (asking honestly) you prefer sql against writing similar things in scala? [19:14:21] nuria: i dunno if prefer is the right answer, but i've found it quite annoying to figure out how to handle memory usage in spark [19:14:50] ebernhardson: right, and hive just figures it out [19:15:07] mforns_brb: ok, so that looks fine then [19:15:37] writing in sql certainly has it's warts, and not being able to break it up into functions makes it harder to read, probably much harder to debug in 6 months after forgetting exactly how it all works though [19:16:27] i suppose i could create temporary tables to make it a little easier to read through, not sure what that does to query planning though [19:16:59] ebernhardson: use spark instead? [19:17:08] ebernhardson: right, i think teh type of syntax we use for last access is a lot easier to read: see: [19:17:23] (oh sorry, should read all backlog before I comment) [19:17:26] ottomata: well yes i think that's what nuria was suggesting as well :) [19:18:42] ebernhardson: with/ as see: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/last_access_uniques/monthly/last_access_uniques_monthly.hql [19:19:05] ebernhardson: the with /as syntax kind of helps document query [19:19:09] nuria: oh interesting, i didn't know you could do that in hive [19:19:15] (or really in any sql) [19:19:27] that does look easier to comprehend [19:20:02] ebernhardson: that way is easy (er) to troubleshoot every chunck on its own , specially useful for self joins [19:56:48] 06Analytics-Kanban, 10Mobile-Content-Service, 06Wikipedia-Android-App-Backlog, 13Patch-For-Review, 07Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2866027 (10Mholloway) back from the dead: {T153001} [20:46:09] 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), 03Interactive-Sprint: Add Maps tile usage counts as a Data Cube in Pivot - https://phabricator.wikimedia.org/T151832#2866245 (10Yurik) The most important aspect missing here is if the usage (referer?) came from: external site, wmflabs (... [20:52:18] mmmmm: you still around? [20:52:35] yes [20:54:11] hey, so, i'm trying to set this up too, and having trouble with my local vagrant and internet dling all the packages, buti had an idea for you [20:54:18] do you need hive/spark, etc? [20:54:31] probably not, right? you just want hadoop? so you can launch flink in yarn? [20:55:04] mmmmm: ^ [20:55:54] first a working hadoop role would be enough. but in general i also need cirrussearch, eventlogging, wikidata, ... [20:56:51] ottomata [20:56:56] aye, well, the analytics role runs a lot of stuff other than just hadoo[ [20:57:03] you might be able to get away with vagrant roles enable hadoop [20:57:09] intsead of all of analyitcs [20:57:10] also [20:57:19] if the hdfs fuse mount is causing you problems [20:57:27] you could just comment that out in the hadoop role class before provisioning [20:58:03] with only the hadoop role enabled it's the same problem [20:58:13] see you tomorrow a-team! [20:58:20] (03PS1) 10Phantom42: Monthly request stats per article title [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) [20:58:29] bye fdans! see you :] [20:58:40] nite fdans [21:01:09] mmmmm: what exactly is the problem? there were lots of issues in that thread [21:01:13] what's your current issue? [21:01:30] the fuse mount? [21:02:55] ottomata: yes and some weird lxc-attach error [21:03:05] oh [21:03:16] i don't know much about lxc-attach, but maybe its causing problems with fuse? [21:03:28] not sure if it depends on the lab instance or in generally on lxc-attach [21:03:33] mmmmm: can you edit the hadoop.pp role file? [21:03:57] comment out this stuff: https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/role/manifests/hadoop.pp#L89-L98 [21:04:07] currently not. but tomorrow i'll have a look at it. thanks for the support. [21:04:42] ok [21:05:00] will comment this suggestion on ticket [21:05:14] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2830159 (10Ottomata) I don't know what is up at all with lxc-attach, but if fuse is causing you problems, try commenting it out and re-provisioning: https://github.com/wikimedia... [21:06:54] Krenair: any more thoughts on this? https://gerrit.wikimedia.org/r/#/c/325589/ [21:07:42] ottomata, haven't had a chance to look at it [21:07:45] likely won't get a chance [21:07:57] I'm drowning in email [21:08:19] haha ,ok [21:08:20] hmmm [21:08:24] need to find a reviewer! [21:17:57] any best practices for taking data that is calculated hourly, and collapsing it into daily partitioned tables? Best i can figure my problem from earlier today is due to having an array> where all elements are null for ~80% of data due to a left join, so removing the left join but then the hourly data is 10k-50k records depending on time of day which is quite small [21:18:21] i suppose i could also just re-do the aggregation to do daily, but it seems aggregating a day worth of webrequest is a pretty heavy operation to do all at once [21:19:19] (when i tried it on an unloaded cluster, hive took ~1TB of memory. Although i imagine it will work with fewer instances and more time) [21:20:20] ebernhardson: that's what our druid job does for pageviews_daily, but what do you mean best practices, it's a simple group by in the case of hourly -> daily [21:20:34] milimetric: so just have two tables then, and delete from the hourly table when not needed anymore? [21:20:55] yeah, that works, yep [21:20:59] ok sounds reasonable [22:37:41] (03PS1) 10MaxSem: Add stuff from logging phases 2 and 3 [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/326831 (https://phabricator.wikimedia.org/T152559) [23:32:46] (03CR) 10Nuria: "Couple comments, looks good. I think more work is needed in tests and error messages." (033 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) (owner: 10Phantom42) [23:46:02] (03CR) 10Milimetric: [C: 04-1] Fixing tests on master (034 comments) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) (owner: 10Nuria)