[09:42:23] Analytics-Wikistats: Determine total number of external links in all Wikipedias - https://phabricator.wikimedia.org/T137984#2393122 (PleaseStand) >>! In T137984#2390918, @PleaseStand wrote: > Here is one possible way to count external links: Now I ran a modified version of my queries, which break the (de-du... [10:16:24] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393189 (leila) [10:20:30] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393196 (leila) [10:27:21] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393233 (leila) [10:28:24] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2393240 (elukey) We finally tracked down all the sources of null/missing end timestamps coming from Varnish: 1) Varnish Pipe logs,... [10:35:41] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393275 (leila) [10:36:36] joal: o/ [10:36:46] http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema [10:36:49] looks nice [10:37:16] hi joal. Question. :D [10:38:02] joal: Erik Zachte and I are discussing T117221, and we are wondering can we see the druid environment you showed in the meeting with WMDE the other day? (you -> Nuria) [10:38:16] it can help us understand how much technology will be available. [10:38:48] Analytics-Kanban: Test cassandra compactions on new AQS nodes - https://phabricator.wikimedia.org/T135145#2393276 (elukey) [10:44:43] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393278 (leila) [11:02:29] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393294 (leila) [11:04:38] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393297 (leila) [11:42:57] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393363 (leila) @Heather: Some background: Erik and I are working on a list of metrics and qualifiers for the Comms team. We need your help to make... [11:53:14] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393385 (leila) [13:05:24] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393525 (leila) [13:14:40] Analytics: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2393562 (BBlack) I haven't been able to find the ticket ref, but I know in the past there was some longer-term question around the basic utility of the WMF-Last-Access data (as in, whether th... [13:17:11] urandom: hi! Whenever you have time I'd like to chat about cassandra-rackdc.properties :) [14:37:19] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2393734 (leila) [14:37:44] Analytics, Research-and-Data, Research-consulting: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#1769033 (leila) [14:59:37] elukey: sup? [15:00:32] hello! [15:01:06] during the offsite we were wondering if it would be good to add rack awareness to aqs100[456] to avoid having two replicas on the same node [15:01:12] (for the same data) [15:01:34] oh, you definitely do need to, yes [15:01:49] now they are all rack1 [15:01:57] that is the default for the cassandra class afaiu [15:02:05] right [15:02:30] you need to use the network topology strategy, and make each machine its own 'rack' [15:02:32] so I was wondering what is the correct procedure to enable rack awareness.. just add the rack/dc keys to each instance one at the time in Hiera? [15:02:56] have you already initialized the cluster? [15:03:03] you can't change it after [15:03:51] by after i mean, after loading with data [15:04:27] * elukey starts crying [15:04:46] uh oh [15:04:58] we have loaded some data but we are still in the load testing phase :) [15:05:14] so it is fine to tear everything down [15:05:19] even if we'll loose time [15:05:52] ok [15:06:37] so I'd need to follow https://wikitech.wikimedia.org/wiki/Cassandra#Bootstrap_a_brand_new_cluster to flush everything [15:07:00] but before that I'd need to add hiera config for rack awareness [15:07:59] yeah [15:08:14] joal will be soooo happy [15:08:18] :D [15:09:24] so, if you're resetting the whole cluster, you'd have the rack/dc info setup, and then you'd bring up the first node with `auto_bootstrap: false`, then add each new one with `auto_bootstrap: true`, and then load your data [15:10:06] and you'll have to wipe /srv/cassandra-[a-z/ on each, yeah [15:10:30] maybe I'll let joal to finish his tests [15:10:37] and then we'll wipe everything [15:11:37] or, you can decommission each node in turn, and then rebootstrap it with the right rack info [15:11:41] or you can do repair [15:12:03] the latter being very undesirable [15:12:37] might try the one node at the time re-boostrap [15:13:18] yeah, i dunno, i guess it depends on how long it takes you to reload [15:13:27] might be faster than a decomm/bootstrap [15:13:46] yeah [15:13:57] can I send you a code review for the rack awareness? [15:14:03] you guys really need to move to bulk loading [15:14:31] s/move to/explore/ [15:14:39] we load in bulk daily from what I know.. what do you mean? [15:14:45] (surely missing the point) [15:15:04] there is a mechanism for bulk-loading SSTable files [15:15:15] that uses streaming [15:15:28] and there are tools to create the SSTable files to stream [15:15:51] woa okok :) [15:17:35] elukey: https://phabricator.wikimedia.org/T126243 [15:18:18] subscribed :) [15:21:50] oh, also, yes (of course) to the code review [15:23:06] urandom: something like https://gerrit.wikimedia.org/r/#/c/295233/1 ? [15:24:38] yup! [15:25:01] good! So this one can be safely merged now since I [15:25:11] I'd need to do a decommission first to apply it [15:25:15] or should I wait? [15:26:44] yeah, you'd want to decomm, apply to an affected node, and rebootstrap [15:26:51] also [15:27:07] are you using network topology strategy? [15:27:19] I'd need to double check but I think so [15:30:11] checked: it's good [15:30:24] also, your superuser password is still the default [15:30:27] :) [15:30:54] * elukey knows it with a lot of shame [15:33:06] * urandom shrugs [15:33:52] it's one of those depth-of-defense things i guess, it makes it slightly harder to do something nasty if you have local access and are already in a position to do very nasty things. :) [15:34:01] but only slightly harder [15:34:14] mildly inconvenient :) [15:34:41] elukey: are you going to wikimania? [15:34:52] * urandom probably already asked this [15:37:37] urandom: nope I am not! Need to take care of some family doctor appointment (my grandma mainly :) so I decided not to go after the offsite.. Even if it is very close to home! [15:37:58] are you going? [15:38:02] yup [15:38:20] i'm traveling atm, as a matter of fact [15:38:38] first leg complete (in newark waiting for the flight to milan) [15:38:55] sorry you won't be there, but i understand [15:39:05] best of luck with your grandma! [15:39:26] are others from analytics going to be there? [15:41:31] thanks! Dan and Leila will be there! [16:00:26] a-team: going to the ops meeting, will send the E-Scrum in a bit! [16:00:34] hola elukey [16:00:54] o/ [16:01:03] joal: yt? [16:01:11] joal: my hangouts no work [16:01:14] I am ! [16:01:21] Just arrived, backfilling irc [16:01:26] joal: does hangout work? [16:01:32] s/filling/logging [16:01:35] It seems to [16:01:44] I am connected [16:01:49] elukey: you here ? [16:02:02] yess [16:18:35] Hi :) [16:18:56] Just read the discussinton with urandom [16:19:16] It's a shame we'll have to reload, but there seems to be no other way ;0 [16:25:35] :( [16:25:50] we could do each node one at the time [16:25:57] but I think it might be long and messy [16:26:06] (messy because I'll do it :) [16:26:10] elukey: I agree [16:26:36] elukey: I might try to work on urandom suggestion about bulk [16:26:52] that one seems awesome [16:27:36] elukey: Will change a bit, but mostly ion cassandra cpu-load at write time [16:57:50] joal: I've also discovered some nice things about vk today :( [16:57:58] and varnish api [16:58:13] elukey: Arf, so what's up? [16:58:54] Analytics: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2394211 (Nuria) @BBlack: short answer: yes. The Last-access cookie has helped us quantify, for example, the shift to mobile in all our projects. More so for data split by country. Some info o... [17:00:21] so the timeout seems to be in the varnish log api.. by default varnish flushes a log (even incomplete) if a Begin tag doesn't see its buddy End after 2 mins, [17:01:12] elukey: told it was a timeout :-P [17:01:31] nso it's vetry difficult for us to get to know what is actually happening [17:02:04] yeah we thought it was a timeout but the problem was figuring out where :) [17:02:15] indeed :) [17:02:24] I think it is a safe net for Varnish [17:02:30] And bye the way, thanks for the amount of effort you put into that ! [17:02:51] elukey: It's really great to have one person in the team knowing good on that part of the process ! [17:02:59] still a bit frustrated by no real progress, but we'll get to the end eventually :) [17:03:12] elukey: knowledge is the first step to action ;0 [17:03:29] * elukey knows that joal is a wise man [17:03:52] hmm, not so sure of that depending of the days ;) [17:04:02] But I'll take the compliement ;) [17:04:11] joal: one thing that I could add to vk would be a formatter like %{VSL:timeout}x [17:04:37] that would output something if vk sees the api timeout somewhere [17:05:07] elukey: That could actually help us a lot I think [17:05:45] and maybe expose a tunable parameter for the timeout? [17:05:52] it shouldn't be super difficult [17:07:13] elukey: Yeah, would involve a small change in camus, but nothing dramatic, and would help quantify the real errors from the timeout [17:07:45] Also, if we modify the raw log format (adding timeout timestamp), we could take the opportunity to add an error field? [17:08:19] yeah the formatter should be an error message [17:08:31] something like "timeout" [17:08:44] (that is what Varnish returns) [17:10:23] elukey: Arf, I was expecting a timestamp [17:10:33] elukey: both value would be greast [17:11:08] the timestamp might be a bit painful to implement, since I'd need to pick one, format it, etc. [17:11:27] not really a huge deal but an error message would be much better :P [17:12:53] elukey: I really think we need both :) [17:13:17] elukey: The idea is that without timestamp, the rows need to be removed (which is a shame) [17:13:33] elukey: But if not feasible, we'll do without ;) [17:16:37] the problem would be what timestamp to pick :) [17:24:23] elukey: I don't feel like starting a cassandra talk now, tomorrow morning would be ok for you? [17:24:37] sure sure! I have to go in a bit :) [17:24:46] ok awesome [17:24:49] and I need to think about vk [17:25:31] elukey: Don't bgreak your head over that, we can make decision on best fit [17:25:39] joal: just one question - would the start timestamp be ok instead of the end: one in case of a timeout? [17:26:15] elukey: hmm ... Need to think about that, but seems reasonnable [17:26:32] it is the only way to make something that always is there [17:26:56] and without adding any more time tags [17:27:36] elukey: makes sense [17:27:52] the only drawback would be getting some misalignment again [17:30:00] elukey: I'd take the 'error' logs out of the sanity check ;) [17:30:45] elukey: Like that we have coherent logs from a time point of view, no need to remove lines (and we can help Brandon if data is needed cause we have some) [17:31:01] And we still have a good enough sanity metric [17:32:38] all right so if the VSL marker is "timeout" then it will not be used [17:33:17] ok I'll try to work on a solution tomorrow [17:33:26] :) [17:33:29] o/ [17:33:31] byeeeee [17:33:35] Bye elukey ! [17:42:52] ciaooo [17:42:56] so lonelyyyyy [17:43:11] nuria_: hehe [20:43:31] Analytics-Cluster, Analytics-Kanban, Operations, ops-eqiad: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2395046 (Nuria) Open>Resolved [20:44:00] Analytics-Kanban, Patch-For-Review: Announce analytics.wikimedia.org - https://phabricator.wikimedia.org/T136426#2395047 (Nuria) Open>Resolved [20:44:59] Analytics, Analytics-Kanban, Patch-For-Review: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#2395060 (Nuria) Open>Resolved [20:46:18] Analytics-Kanban, DBA, Editing-Analysis, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2395067 (Nuria) Open>Resolved [20:48:20] Analytics-Kanban: Prototype Data Pipeline on Druid - https://phabricator.wikimedia.org/T130258#2395068 (Nuria) Open>Resolved [20:49:32] Analytics: Productionize Druid Pageview Pipeline - https://phabricator.wikimedia.org/T138261#2395082 (Nuria) [20:49:52] Analytics: Puppetize pivot UI - https://phabricator.wikimedia.org/T138262#2395083 (Nuria) [20:50:37] Analytics: Puppetize Zookeeper - https://phabricator.wikimedia.org/T138263#2395096 (Nuria) [20:52:04] Analytics: Productionize Druid loader - https://phabricator.wikimedia.org/T138264#2395115 (Nuria) [20:53:20] Analytics: Puppetize pivot UI - https://phabricator.wikimedia.org/T138262#2395134 (Nuria) Access should be restricted by LDAP [20:55:03] o/ Hey folks [20:55:06] I'm looking at https://en.wikipedia.org/wiki/User_talk:EpochFail#Improving_POPULARLOWQUALITY_efficiency [20:55:29] And trying to figure out why we can't get a larger top N list of articles by pageviews without running into privacy issues? [20:55:35] Analytics: Upgrade Kafka (non-analytics cluster) - https://phabricator.wikimedia.org/T138265#2395136 (Nuria) [20:56:42] Analytics: Puppetize MirrorMaker - https://phabricator.wikimedia.org/T138267#2395161 (Nuria) [21:01:26] Analytics-Kanban: Mediawiki changes to publish data for analyrtics schemas - https://phabricator.wikimedia.org/T138268#2395185 (Nuria) [21:05:20] Analytics: Host edit data on Druid for all wikis. - https://phabricator.wikimedia.org/T138269#2395206 (Nuria) [22:40:27] joal: can you double check my comment here: https://en.wikipedia.org/wiki/User_talk:EpochFail#Improving_POPULARLOWQUALITY_efficiency