[07:39:19] * elukey reads emails [07:40:04] * elukey has also read nuria_'s statement about "some experiments" :D [08:58:28] so aqs100[123] seems to have again an auth timeout issue [08:58:36] and this doesn't make any sense [09:57:14] Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2581800 (elukey) [11:37:32] hi a-team :] [11:41:59] o/ [11:42:17] mforns: when you have time can you give me a summary of what has been done with AQS? [12:32:23] hey elukey, sorry missed you message [12:33:03] * elukey is sad since Marcel doesn't pay attention to him [12:33:10] :D [12:33:13] sure! basically, we loaded a couple months with success, the way joseph was doing it before entering vacation [12:33:14] xDDD [12:33:54] and as everything was going fine, nuria had the idea to test how loading and compaction would interact together [12:34:24] we though this was interesting so launched a loading job while the compaction of the last job was still running [12:35:14] nuria_, also did a couple network load tests with satisfactory results, you have received the emails she sent to interanal [12:35:24] yes yes [12:35:33] elukey, do you want more detail in the batcave? [12:35:36] but we already knew the answer about loading + compacting sadly [12:35:42] nono all good [12:36:01] something weird happened on aqs100[123] so I wanted to know if anything was done while I was away [12:36:14] elukey, is there any problem with loading and compacting at the same time? [12:37:44] in aqs100[123]? I don't recall, what did you see? [12:37:53] the main problem is when the two things are "big", like in this case.. we use levelled compaction now and it takes a lot of time by itself, because it has to guarantee certain constraints (as Nuria observed the avg number of sstables read are much less than in te current cluster). [12:38:23] so each time that you load while compacting, you cause more work for cassandra to keep consistency and constraints [12:38:27] *keep up [12:38:43] this is why it it taking a huge amount of time now [12:38:48] it will finish eventually [12:39:03] anyway, nothing problematic :) [12:39:05] elukey, oh so the problem you see is just time? ok [12:39:18] ah yes time, not consistency (at least, afaik) [12:39:40] about aqs100[123]: READ timeouts rised again causing 503s, starting from the 17th [12:39:43] and I can't find a reason [12:39:44] :( [12:40:42] elukey, I can not see in the cassandra compaction dashboard that combining loading and compacting is slowing down the process... [12:41:15] it looks like when loading compaction stops, and then immediately after loading compaction resumes [12:41:56] I think what makes compaction slower is the amount of data in the cluster [12:42:06] and the amount of data loaded, of course [12:42:48] a month of data is very big and overflows a lot of compaction levels, so they need to be fully recompacted, and this takes a lot of time, no? [12:43:06] yes but afaiu we are doing it for two months right? [12:43:23] we are loading one month at a time [12:43:48] ok ok maybe I wrote too less [12:44:05] the last two months have been loaded one at the time [12:44:20] but the second one was loaded when compaction of the former was ongoing right? [12:44:28] this is why I said two at the time [12:44:33] elukey, yes, I see [12:44:38] I meant the whole loading+compaction process [12:44:45] aha ok [12:44:56] yes, and that's right for the last 3 months [12:45:26] what I fear is that compacting two months will take a longer time than doing one at the time [12:45:43] I see [12:45:57] do you want to look at the graph together in da cave? [12:47:12] mforns: nah don't worry we can discuss them during standup so we'll get everybody's opinion [12:47:20] elukey, makes sense [12:48:07] elukey, anyway, welcome back! :] how was vacation? [12:49:27] mforns: thanks!! It was really good.. Lisbon and Porto were lovely.. Food and drinks too! [12:50:08] elukey, didn't know you were visiting Portugal, yes... food is amazing there [12:51:39] Analytics, Analytics-Wikimetrics: Cannot remove invalid members from cohort - https://phabricator.wikimedia.org/T113454#2582137 (Lokal_Profil) Tried it on Chrome now. It works there so seems to be a Firefox specific bug. Running Ubuntu 14.04 [12:52:00] Analytics, Analytics-Wikimetrics, Browser-Support-Firefox: Cannot remove invalid members from cohort - https://phabricator.wikimedia.org/T113454#2582138 (Lokal_Profil) [14:12:38] Analytics, Beta-Cluster-Infrastructure, Services, scap, and 3 others: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#2582429 (elukey) Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soon I would prefer to n... [14:13:45] brb, coffee! [14:26:32] elukey: yt? [14:27:27] elukey: yt? we can talk about compacting on batcave, just let me know [14:28:54] nuria_: hola! Don't worry Marcel already explained everything, we can discuss about it as post-standup.. :) [14:41:02] elukey: ok [15:08:39] Analytics: Provide a ua-parser service using the One True Wikimedia UA-Parserâ„¢ - https://phabricator.wikimedia.org/T1336#2582645 (Milimetric) p:Triage>Normal [15:10:40] Analytics-Kanban, Analytics-Wikimetrics, Browser-Support-Firefox: Cannot remove invalid members from cohort - https://phabricator.wikimedia.org/T113454#2582653 (Milimetric) p:Triage>Normal a:Milimetric [15:51:02] Analytics: solving issues with scalability , querying and high cardinality dimensions (maybe hosting in Druid, clickhouse or other) - https://phabricator.wikimedia.org/T138269#2582765 (Nuria) [16:18:27] Analytics: Show pageviews prior to 2015 in dashiki - https://phabricator.wikimedia.org/T143906#2582806 (Nuria) [16:19:46] Analytics: Show pageviews prior to 2015 in dashiki - https://phabricator.wikimedia.org/T143906#2582821 (Nuria) [16:19:50] Analytics: Deprecate reportcard - https://phabricator.wikimedia.org/T130117#2582820 (Nuria) [16:20:48] Analytics: Deprecate reportcard - https://phabricator.wikimedia.org/T130117#2126320 (Nuria) We will stop updating reporcard pretty soon. https://analytics.wikimedia.org/dashboards/vital-signs/#projects=all,eswiki,itwiki,enwiki,jawiki,dewiki,ruwiki,frwiki/metrics=Pageviews FYI, reportcard doesn't get much tr... [16:52:09] Analytics-Kanban: Productionize Pivot UI - https://phabricator.wikimedia.org/T138262#2582925 (Nuria) a:elukey [16:55:50] Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2582962 (Milimetric) p:Triage>Normal [16:57:46] Analytics: Make reportupdater support passing the values of an explode_by using a file path - https://phabricator.wikimedia.org/T132481#2582966 (Milimetric) p:Triage>Normal a:mforns [17:00:16] Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2581800 (Nuria) SSTtables read is increasing: https://grafana-admin.wikimedia.org/dashboard/db/aqs-cassandra-cf-sstables-per-read?from=1469552346187&to=1472144346187&var-node=aqs1001&var-node... [17:11:08] Analytics, Easy: Implement re-run script for reportupdater - https://phabricator.wikimedia.org/T117538#2583012 (Milimetric) a:mforns [17:15:28] (PS1) Bearloga: Add DuckDuckGo as a search engine [analytics/refinery/source] - https://gerrit.wikimedia.org/r/306698 (https://phabricator.wikimedia.org/T143287) [17:17:07] (PS1) MaxSem: Import from GitHub [analytics/discovery-stats] - https://gerrit.wikimedia.org/r/306699 [17:18:40] (CR) Bearloga: "recheck" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/306698 (https://phabricator.wikimedia.org/T143287) (owner: Bearloga) [17:20:12] Analytics-Kanban, MW-1.28-release-notes, Patch-For-Review, WMF-deploy-2016-08-23_(1.28.0-wmf.16): Update MediaWiki hooks to generate data for new event-bus schemas - https://phabricator.wikimedia.org/T137287#2583033 (Nemo_bis) [17:20:35] Analytics-Kanban, EventBus, MW-1.28-release-notes, Patch-For-Review, WMF-deploy-2016-08-23_(1.28.0-wmf.16): Update MediaWiki hooks to generate data for new event-bus schemas - https://phabricator.wikimedia.org/T137287#2363796 (Nemo_bis) [17:20:35] (PS2) MaxSem: Import from GitHub [analytics/discovery-stats] - https://gerrit.wikimedia.org/r/306699 (https://phabricator.wikimedia.org/T143048) [17:35:38] (CR) Nuria: [C: 2] Add DuckDuckGo as a search engine [analytics/refinery/source] - https://gerrit.wikimedia.org/r/306698 (https://phabricator.wikimedia.org/T143287) (owner: Bearloga) [17:38:29] milimetric, debug? :] [17:38:39] omw [17:42:00] nuria_: thanks! :D [17:47:53] going afk team! talk with you tomorrow! [18:28:17] Analytics: Replacing vital signs metrics in dashiki with data from new edit data depot - https://phabricator.wikimedia.org/T143924#2583423 (Nuria) [18:30:59] Analytics-Kanban, EventBus, Wikimedia-Stream: Productionize Public Event Stream Prototype - https://phabricator.wikimedia.org/T143925#2583449 (Nuria) [18:52:32] Analytics: Global Unique Devices Counts - https://phabricator.wikimedia.org/T143927#2583508 (Nuria) [18:54:51] Analytics: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2583527 (Nuria) [18:54:52] Analytics: Global Unique Devices Counts - https://phabricator.wikimedia.org/T143927#2583526 (Nuria) [18:55:07] Analytics: Hive code to count global devices - https://phabricator.wikimedia.org/T143928#2583528 (Nuria) [18:56:26] Analytics: Hive code to count global unique devices per project - https://phabricator.wikimedia.org/T143928#2583528 (Nuria) [18:56:57] Analytics-Kanban: Make reportupdater support passing the values of an explode_by using a file path - https://phabricator.wikimedia.org/T132481#2583544 (Nuria) [18:57:18] Analytics-Kanban, Easy: Implement re-run script for reportupdater - https://phabricator.wikimedia.org/T117538#2583545 (Nuria) [19:04:20] milimetric, so I've been trying a couple things, and it seems the large number of stages was due to the subgraph algorithm [19:04:42] but it ended up working and finishing the partitioning, [19:05:27] I got another error now, which seems to be related to the ordering function that sorts move events first (before delete events) [19:06:03] The error message makes sense, but the thing is... how the hell did that work before [19:07:08] I fixed that and am executing it again, let's see [19:18:38] Analytics: Hive code to count global unique devices per top domain (like *.wikipedia.org) - https://phabricator.wikimedia.org/T143928#2583578 (Nuria) [19:25:29] leila: do we have anywhere the coms material about wikipedia's ranking and such? [19:25:47] leila: I *think* you were working on that with Erik Z [19:33:40] mforns: sorry missed your ping [19:34:08] I see, so after you added .count and print lines, it didn't time out anymore, without changes? [19:35:35] hey nuria_. when you say "coms" you mean from the "coms" team? [19:35:55] "comms" [19:36:06] or you mean the draft proposal that Erik and I had? [19:36:19] and we talked to Toby, Wes, Comms team about? [19:38:20] leila: sorry, yes comms [19:38:20] leila: right [19:38:32] mforns: FYI, ash updated friday meeting, i will not be able to attend [19:38:43] mforns: but that is fine [19:43:56] (CR) Gehel: "LGTM. Very minor comment (inline). Note that my PHP-fu is close to non existant." (1 comment) [analytics/discovery-stats] - https://gerrit.wikimedia.org/r/306699 (https://phabricator.wikimedia.org/T143048) (owner: MaxSem) [19:46:19] nuria_: we don't have access to any documentation from the Comms team. [19:46:41] we met with them and they shared some background verbally, but no documents, as far as I know. [19:46:46] leila: sorry, I meant the "draf proposal" [19:46:53] "draft proposal" [19:47:02] ah, nuria_. that I can share with you. [19:47:07] fine in few hours? [19:47:16] leila: ya, sure [19:47:33] (I need to write a bit for you about what state it's in, so you don't end up spending a lot more time on it that it deserves at this stage) [19:47:38] I'll do that, nuria_, then. [19:48:29] milimetric, yes, after I added the count and print, no more timeouts [19:48:45] milimetric, and after fixing the comparison problem, it run until the end. [19:49:11] milimetric, I'm going to remove the count and try it again [19:49:27] nuria_, no problem [19:49:54] nuria_, let me know if you want to shift it or just skip it [19:50:10] mforns: I will skip it, that is np [19:50:15] mforns: we can talk after [19:50:40] interesting, hope it was just the comparison problem [19:51:04] milimetric, yes I also hope that, but the thing is, this comparison function always was like that [19:51:30] nuria_, ok [19:53:35] (PS3) MaxSem: Import from GitHub [analytics/discovery-stats] - https://gerrit.wikimedia.org/r/306699 (https://phabricator.wikimedia.org/T143048) [20:24:46] milimetric, it worked [20:24:52] * mforns vets data [20:24:57] oh no! [20:25:06] that's the worst, now we'll never know what was wrong :) [20:25:56] xD, my guess is executing everything alltogether was masking the comparison error within other more criptic errors [20:26:40] and the condition that made the comparison function fail contract was comparing 2 move events with the same timestamp [20:27:03] within the same subgraph! [20:27:18] so this was probably not happening in the tests joseph did [20:28:14] and now that we tried all sqooped wikis together, this happened [20:29:40] the comparison function was failing contract because if A and B are move events with the same timestamp, the comparison function would evaluate A < B as true, and B < A as true as well [20:30:24] it's surprising that scala checks that in execution time... [20:41:44] nuria_, the candidate moved the meeting to 4pm spanish time, so I guess you won't be there. Is there any questions about the task you want me to pass to them, cc milimetric [20:41:52] ? [20:42:01] mforns: nah, your take on it is fine. [20:42:11] ok [20:42:33] yeah mforns, I'll just read your notes [20:42:38] ok [21:39:53] milimetric, the page history data looks good! I'll move the task to done and when we meet again we can work on the mega-query, no? [21:40:09] thanks mforns, looking forward to it [21:40:25] ok, logging of a-team! see you tomorrow [21:40:32] nice weekend milimetric! [21:40:36] the latest join query I have is in my patch (one sec) [21:40:39] https://gerrit.wikimedia.org/r/#/c/306292/1/hive/mediawiki/edit-history/rebuild-history-from-intermediate.sql [21:40:42] nite mforns [21:40:47] see you monday [21:40:54] ok milimetric, I'll have a look [21:41:05] thanks! [21:41:13] (we can look together, just letting you know where it is) [22:26:45] (PS2) Milimetric: [WIP] Script sqooping mediawiki tables into hdfs [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) [22:26:47] (CR) Milimetric: "* Let's default to using underscores in filenames that are not executables, unless there is a reason to do so. E.g. create_project_namespa" (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) (owner: Milimetric)