[11:23:22] (PS20) Milimetric: Oozify sqoop import of mediawiki tables [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) [11:54:02] (PS21) Milimetric: Oozify sqoop import of mediawiki tables [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) [12:46:10] (CR) Milimetric: "This is now ready for review." [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) (owner: Milimetric) [14:53:45] joal, elukey: this may or may not be of interest to you (since you have one small cluster), but: https://wikitech.wikimedia.org/wiki/Cassandra/Tools/cdsh [14:54:47] joal, elukey: https://phabricator.wikimedia.org/T132958 and https://github.com/eevans/cassandra-tools-wmf are actually probably of more interest [15:00:29] urandom, joal and elukey are ooo :] [15:00:48] mforns: k, thanks! [15:04:56] Analytics-Kanban, Editing-Analysis, Documentation: Remove outdated docs regarding dashboard info - https://phabricator.wikimedia.org/T137883#2560828 (Nuria) a:Nuria [15:11:59] am i dropped from hangout? [15:11:59] milimetric: anyway: https://github.com/ottomata/kasocki [15:28:50] Analytics-Wikimetrics: Scheduled reports 404 despite alleged success - https://phabricator.wikimedia.org/T143218#2560966 (Nemo_bis) [16:26:33] mobrovac: heya yt? [16:33:24] ottomata: meetingsssss [16:33:35] ayyye,k i leave you with a question, answer whneev [16:33:45] does this work? [16:33:46] https://github.com/wikimedia/operations-puppet/blob/production/modules/role/lib/puppet/parser/functions/kafka_cluster_name.rb#L30 [16:33:48] as documented? [16:33:50] if you do [16:34:04] kafka_cluster_name('prefix', 'site') [16:34:09] prefix = args.pop [16:34:13] will make prefix == 'site' [16:40:37] Analytics-Kanban, Editing-Analysis, Documentation: Remove outdated docs regarding dashboard info - https://phabricator.wikimedia.org/T137883#2561282 (Nuria) I have spend couple hours updating some of those pages. On my opinion we should plain delete all outdated content from mediawiki under Analytic... [16:42:44] mobrovac: ^^^ when you have a free min [16:42:57] i think it might be a bug, that somehow we haven't run into [16:43:00] i think it shoudl be args.shift [16:57:24] Analytics-Kanban, Editing-Analysis, Documentation: Remove outdated docs regarding dashboard info - https://phabricator.wikimedia.org/T137883#2561312 (Nuria) I have also moved loads of mediawiki pages to archive. [17:07:46] indeed ottomata, you are correct, sir! [17:07:48] it's a bug! [17:08:16] ok just checking, that's a tricky thing to change, but i guess if we have no problems with it, it isn't affecting us [17:08:18] will fix. thanks! [17:10:00] thnx for noticing! [17:49:40] Analytics-Kanban, Editing-Analysis, Documentation: Remove outdated docs regarding dashboard info - https://phabricator.wikimedia.org/T137883#2561453 (Nuria) Again, the majority of these pages should be deleted, calling this ticket done for now [17:52:04] Analytics-Cluster, Analytics-Kanban: Make cross DC zookeeper_hosts hiera lookups possible for Kafka - https://phabricator.wikimedia.org/T143232#2561463 (Ottomata) [18:22:58] Analytics-Kanban: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2561623 (Nuria) Ping @Danielsberger [18:23:59] (CR) Ottomata: "Looks like we should talk about reorganizing the oozie/mediawiki directory a bit, eh?" (10 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) (owner: Milimetric) [18:57:36] (PS7) Nuria: Bug fixes on datepicker [analytics/dashiki] - https://gerrit.wikimedia.org/r/303693 (https://phabricator.wikimedia.org/T141165) [18:58:03] mforns: pushed fix for timeshift issue. our dates are absolute so no need to conver them to utc offsets [18:58:16] mforns: tested in FF so ready to merge [19:04:26] nuria_, will look at it now [19:04:31] mforns: k [19:10:12] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/dashiki] - https://gerrit.wikimedia.org/r/303693 (https://phabricator.wikimedia.org/T141165) (owner: Nuria) [19:12:24] mforns: ok, will take couple hours to see if i can came up with a bookmark for dates , if not i will just deploy this chnageset [19:12:27] *changeset [19:19:14] nuria_, ok, let me know if you need to brainbounce [19:19:23] mforns: will do [19:20:33] mforns: looks like compaction is proceeding well, it is identical to last time https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-compaction [19:20:46] nuria_, yes I've seen [19:21:03] mforns: we will see but it should be done tomorrow [19:21:28] nuria_, mmmm... I'd say it will take a bit more [19:21:44] mforns: ya, cause slope is less pronounced 2nd time arround [19:21:47] right? [19:22:02] nuria_, note that the real compaction has not started yet, it's only preparing data [19:22:16] the slope has not started yet [19:22:37] and I guess when it starts, this time will take around 28 hours [19:22:41] mforns: true true [19:23:11] mforns: which ahem ... scary thought [19:23:19] as things progress [19:23:36] nuria_, I guess it will finish in like 40 hours [19:23:46] yes... [19:24:23] it looks like it's proportional to the size of the whole data, ~7 hours per month [19:25:11] and as we have 13 months to backfill... [19:26:05] nuria_, it looks like the last compaction will take 4 days, no? [19:26:54] mforns: that seems a lot even if it is linear with # of months [19:27:15] nuria_, I wonder how this compaction is done on regular input (not backfilling) [19:27:26] mforns: data is loaded by day [19:27:31] not by month like now [19:27:51] nuria_, yes, but... it looks like all the data is recompacted after loading no? [19:31:08] that'd be scary because it would mean it would have 4 days of compaction to do every day. So that can't be right... [19:32:07] milimetric, aha [19:37:23] urandom: yt? [19:37:47] nuria_: maybe? [19:37:51] :) [19:37:52] jaja [19:38:21] urandom: available for questions that is, it is ok to say no, but WE WILL FIND YOU [19:38:46] GOOD LUCK [19:38:48] :) [19:38:58] nuria_: no, i can answer questions [19:38:58] sooo [19:39:05] nuria_: should i have a lawyer present for this? [19:39:12] urandom: on the topic of cassandra compaction [19:39:32] jaja, nah you know this from the top of your head i bet [19:39:49] urandom: take a look at the dashboard: https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-compaction [19:40:11] ok [19:40:27] urandom: compaction is taking longer and longer as we load more data (we are loading months 1 by one). loading is fast but compaction is progressively slower [19:41:42] urandom: look at the last week so you can see compaction since we rebuild the array [19:41:56] nuria_: when did you rebuild the array? [19:42:07] ~8th? [19:42:20] urandom: 8/10 [19:43:13] urandom: I guess i do not understand why compaction will take longer and longer when the amount of data compacted is the same (1 month) [19:44:07] leveled compaction is... well, leveled [19:44:21] and the more data you have, the more levels are used [19:45:36] when you go over the threshold in 0, some tables are merged into the tables in 1, and when 1 has too many, some are merged into 2, and so on [19:45:47] each level has an order of magnatude more tables than the last [19:45:54] all tables are the same size [19:46:22] so the bigger your data set, the more compaction across levels you get [19:47:03] urandom: but wait ... [19:47:41] * urandom wishes he had a whiteboard [19:47:57] urandom: the spillover into a newer level is always approximately the same (as the data we are loading for 1 month is approximately of the same size) [19:48:55] there is an amplification, you push data into level 0, and require some of it to be merged into level 1, exceeding the threshold and causing some of that to be merged into level 2 [19:49:37] urandom: and what defines the levels in which we are pushing? [19:49:46] everything goes into level 0 [19:50:05] and everything moves through the levels sequentially [19:51:53] nuria_: this is from aqs1004 right now [19:51:58] nodetool-a tablestats -- local_group_default_T_pageviews_per_article_flat.data [19:52:03] SSTables in each level: [14/4, 10, 101/100, 1045/1000, 328, 0, 0, 0, 0] [19:52:18] they start at left [19:52:28] level 0 has 14 of 4 [19:52:37] level 1 has 10 [19:53:25] when compactions in level 0 manage to get the number of tables at or below the target (of 4), it'll be because it merged data into level 1 [19:53:29] which current has 10/10 [19:53:52] so that will violate the threshold of level 1, and cause data to be merged into level 2, which currently has 101/100 (already too many) [19:54:26] level 2 (101/100) is compacting now, merging tables into level 3 (1045/1000) [19:54:43] level 3 is already over by 45 tables, and is compacting into level 4 [19:55:02] and, one thing to keep in mind, there is no concurrency within a level [19:55:15] you have one compactor thread, which is likely cpu bound due to compression [19:56:28] nuria_: when you started, you didn't have this many levels [19:56:43] and with time the number of them will grow [19:56:53] urandom: ahem.. does this means that even small data pushes might trigger huge compaction times, correct? [19:57:06] yeah, it could [19:57:27] nuria_: it's doing size-tiered compaction in level 0 though [19:57:50] so getting behind for a while might not be the end of the world [19:58:32] obviously once you've back-filled, you're going to want to be able to get caught up, and stay current [19:59:11] urandom: do levels exist per cassandra instance then? [19:59:17] yes [19:59:35] urandom: so less cassandra instances less time in compaction [19:59:56] no, the opposite [20:00:11] urandom: wait how so? [20:00:25] more instances mean a smaller dataset size per-instance, which means less required compaction throughput (per instance) [20:00:41] nuria_: because you're apply the algorithm on a smaller set of data [20:00:47] applying, even [20:01:04] urandom: ok, yes, in our case since data is replicated and we have 6 instances [20:01:37] urandom: i think is replicated 3 times.. [20:01:46] yeah [20:02:04] urandom: we should see lower compaction times when we add the couple nodes we want to add [20:02:25] urandom: seems that our array is much too small in terms of instances of cassandra for teh replication level we have [20:02:28] *the [20:02:37] your array? [20:02:38] urandom: in terms of instances and boxes [20:02:44] urandom: our cluster, sorry [20:02:50] well [20:03:06] i wonder if leveled compaction is the right set of trade-offs [20:03:56] nuria_: oh, btw, this is the metric you want to keep an eye on: https://grafana-admin.wikimedia.org/dashboard/db/aqs-cassandra-cf-sstables-per-read?panelId=5&fullscreen&from=1471377813796&to=1471464213797&var-node=aqs1004-a&var-node=aqs1004-b&var-node=aqs1005-a&var-node=aqs1005-b&var-node=aqs1006-a&var-node=aqs1006-b&var-keyspace=local_group_default_T_pageviews_per_article_flat&var-columnfamily=data&var-quantiles=99percentile [20:04:02] (PS22) Milimetric: Oozify sqoop import of mediawiki tables [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) [20:04:24] (CR) Milimetric: Oozify sqoop import of mediawiki tables (10 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) (owner: Milimetric) [20:05:07] nuria_: aside from garbage collection, compactions main job is optimizing the data at rest for reads, and this dashboard will tell you how well that is working [20:05:50] nuria_: and in your case, the 99p is for 3 tables to be consulted for a read [20:05:56] (which is really good) [20:06:24] if compaction gets too far behind, you'll see that number climb [20:06:48] urandom: i see, on current prod cluster that number is 15 [20:06:56] right [20:07:02] which is... not as good [20:07:15] urandom: but given that newer doesn't have traffic it cannot be compared other than with my load tests i did yesterday [20:07:28] nuria_: ok [20:07:39] urandom: remember new cluster is not doing nothing [20:07:46] yeah, this might not be The Number then [20:07:49] urandom: so the metric is not such [20:08:12] but it's still what you want to use to determine the efficacy (when it becomes relevant) [20:08:32] urandom: but i will load test for longer once compaction is over and try to keep an eye on that, thank you! [20:08:45] Analytics-Wikistats: Ireland in Tagalog, Bengali and Urdu Wikipedia traffic breakdown - https://phabricator.wikimedia.org/T143254#2562053 (Nemo_bis) [20:09:00] nuria_: you all might do better with TWCS (time-windowed compaction) [20:09:16] urandom: i think joal looked into that and it had several issues [20:09:24] oh? [20:09:39] i know they tested with data-tiered [20:09:55] that might be what the 3 prod nodes are running now [20:09:57] urandom: ah yes [20:10:06] urandom: it requires a new vs of cassandra right? [20:10:13] twcs is out-of-tree for 2.2.6, we'd need to build a jar [20:10:25] it's being incorporate upstream though [20:10:33] incorporated, i mean [20:10:47] urandom: ya, see , it doesn't seem production ready yet [20:10:54] and there are quite a lot of people using at this point, even out-of-tree on 2.2.x [20:11:08] i'd chance it [20:11:43] time-permitting of course, which is i think where we ran afoul [20:11:55] it's been quite heavily tested by a lot people [20:13:12] urandom: ideally for us if you guys test it with some of our data it will be best, cause at this point seems a bit uncertain. [20:13:13] honestly, it has a much better track record than leveled compaction ;) [20:14:10] nuria_: how long before your backfill is complete? [20:14:29] how long before we're looking at a steady-state daily import? [20:14:55] urandom: at this rate? A LOT, we need to backfil 10 more months [20:15:10] urandom: so two days per compaction will put us at 20 days [20:15:23] do you need to wait for compaction? [20:15:38] it's meant to be async [20:15:46] urandom: to load more data? seems wisest no? [20:16:07] ¯\_(ツ)_/¯ [20:16:13] urandom: but you are teh expert [20:16:16] *the [20:17:01] if it's only compaction we're talking about, then more data will just push out what is behind further [20:17:11] it's already playing catch up [20:17:17] Analytics, Beta-Cluster-Infrastructure, Services, scap, and 3 others: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#2562093 (Milimetric) @elukey is on vacation, and I'm not really sure what changed. But if this is urgent for anyone, just ping me on IRC in #wikimedia-analytics [20:17:45] urandom: ok, we will consider to load more data after this compaction is over and we load test [20:17:47] nuria_: pushing it harder will also tell you what that does to read performance as it gets further behind [20:18:22] urandom: yes, understood [20:18:27] nuria_: worst-case, start with two months at a time and ramp up [20:18:50] nuria_: oh, also, that number "pending compactions", is awful [20:19:04] for leveled compaction anyway [20:19:05] urandom: problem is the bigger the loading job teh harder it is to rerun with success [20:19:07] *the [20:19:12] urandom: how so? [20:19:35] nuria_: it's meant to be an estimate, but it's almost completely bogus [20:19:52] if it's non-zero, then you do have some pending [20:20:16] but it's not worth much more than that [20:20:40] urandom: ok [20:23:14] huh [20:23:39] on 1004-a, you have only one compaction task running [20:23:52] but something like 71 tmp files [20:24:07] urandom: tmp files? [20:24:36] yeah, the working files for an on-going compaction [20:24:40] Analytics-Kanban, Trash: ---- DISCUSSED BELOW ---- - https://phabricator.wikimedia.org/T114124#2562150 (Milimetric) p:Triage>Lowest [20:25:08] nuria_: mind if i restart this instance to see if they go away? [20:25:13] urandom: please [20:25:19] urandom: so 71 is too may? [20:25:23] *too many? [20:25:40] i was expecting 1, to correspond with the only on-going compaction [20:27:42] apparently i don't have permissions for that [20:27:52] Analytics-Kanban, Trash: ---- DISCUSSED BELOW ---- - https://phabricator.wikimedia.org/T114124#2562155 (Milimetric) We would make another column but we have a lot of columns already on boards like this: https://phabricator.wikimedia.org/tag/analytics/. We work our boards pretty aggressively and this hac... [20:28:20] Analytics-Kanban, Trash: ---- DISCUSSED BELOW ---- - https://phabricator.wikimedia.org/T114124#2562156 (Milimetric) Open>stalled [20:28:45] Analytics-Kanban, Trash: --- DISCUSSED BELOW --- - https://phabricator.wikimedia.org/T114124#1685280 (Milimetric) [20:29:06] Analytics-Kanban, Trash: --- RUBICON --- - https://phabricator.wikimedia.org/T104390#2562187 (Milimetric) Open>stalled p:Low>Lowest [20:29:22] nuria_: gah, because the sudo rules haven't been fixed to account for instancs [20:29:28] instances, even [20:29:38] Analytics, Trash: --- Immediate Above --- - https://phabricator.wikimedia.org/T115634#2562190 (Milimetric) Open>stalled p:Triage>Lowest [20:29:59] urandom: ah , i see [20:30:24] urandom: you have no permits on the new cluster to restart the two cassandra instances right? [20:30:53] nuria_: because the sudo rule looks like: 'ALL = NOPASSWD: /usr/sbin/service cassandra *' [20:31:13] and we need /usr/sbin/service cassandra-{a,b} * [20:31:15] Analytics-Kanban: Need permits on New pageview API cluster to restart cassandra - https://phabricator.wikimedia.org/T143259#2562198 (Nuria) [20:31:23] urandom: can you add that here? https://phabricator.wikimedia.org/T143259 [20:31:35] urandom: otto can help us change that tomorrow [20:31:51] yeah [20:32:18] Analytics: Better publishing of Annotations about Data Issues - https://phabricator.wikimedia.org/T142408#2562210 (Milimetric) Thanks, these are all great points. We're aiming to get to this next quarter. [20:35:52] (PS1) Nuria: Merging fixes for datepicker on browser dashboard [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/305355 (https://phabricator.wikimedia.org/T141165) [20:36:19] (CR) Nuria: [C: 2 V: 2] "Self merging to deploy to analytics.wikimedia.org" [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/305355 (https://phabricator.wikimedia.org/T141165) (owner: Nuria) [20:37:04] Analytics-Kanban: Need permits on New pageview API cluster to restart cassandra - https://phabricator.wikimedia.org/T143259#2562226 (Nuria) The sudo rule looks like: 'ALL = NOPASSWD: /usr/sbin/service cassandra *' and we need /usr/sbin/service cassandra-{a,b} * [20:37:51] msg lzia : yt? [20:48:20] a-team logging off, have a good night!