[02:43:29] Analytics, Engineering-Community, MediaWiki-API, Research-and-Data, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1523207 (Tgr) RESTBase goes through Varnish, right? So we can just apply the same Kafka-Hadoop-Hive-dashboard data flow to it... [02:46:51] Analytics, Engineering-Community, MediaWiki-API, Research-and-Data, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1523209 (GWicke) @Tgr: Yes, indeed. We should already have all the information we need in the Varnish logs. [08:16:59] Analytics, Labs, Labs-Infrastructure, Labs-Sprint-108, Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1523508 (akosiaris) Hey @Halfak, glad you joined us on this one. >>! In T107576#1519641, @Halfak wrote: > He... [08:21:46] Analytics, Labs, Labs-Infrastructure, Labs-Sprint-108, Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1523509 (yuvipanda) My understanding is that halfak is suggesting that we automatically sync datasets.wikimed... [08:22:46] Analytics, Labs, Labs-Infrastructure, Labs-Sprint-108, Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1523511 (akosiaris) >>! In T107576#1523509, @yuvipanda wrote: > My understanding is that halfak is suggesting... [10:13:59] Analytics, Labs, Labs-Infrastructure, Labs-Sprint-108, Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1523627 (ArielGlenn) which directories do you want synced over? [13:24:26] (PS1) Ottomata: Make camus import webrequest_maps from new maps varnish cluster [analytics/refinery] - https://gerrit.wikimedia.org/r/230535 (https://phabricator.wikimedia.org/T105076) [13:29:18] Analytics-Cluster: Make varnishkafka produce using dynamic topics - https://phabricator.wikimedia.org/T108379#1523927 (Ottomata) https://gerrit.wikimedia.org/r/#/c/230173/ [13:33:55] ottomata1: Hi ! [13:34:06] ottomata1: let me know when available :) [13:34:12] hiay! [13:34:14] 5 mins! [13:34:18] np [13:38:41] Analytics, Labs, Labs-Infrastructure, Labs-Sprint-108, Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1523941 (Ottomata) datasets.wikimedia.org lives on stat1001. The contents of it are rsynced from various /sr... [13:39:35] whoa bits. joal i think bits might be dead! [13:39:52] hellooo team :] [13:39:54] ottomata: you read my mind :) [13:40:03] mforns is back ! [13:40:09] hello mforns :) [13:40:12] hello! [13:40:17] hello joal [13:40:21] hey ottomata [13:40:52] so, ottomata, do you think bits is dead, or is there any issue ? [13:41:14] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [13:41:37] it is gone, [13:41:41] the data exists [13:41:45] but it is all in webrequest_text now :( [13:41:58] hm ... How come, is that expected ? [13:42:15] yeah, its going to happen to mobile too [13:42:29] i think they're doing it for maintenance and performance reasons [13:42:30] or something [13:42:47] hence: i did this on friday [13:42:47] https://gerrit.wikimedia.org/r/#/c/230173/ [13:42:57] hm, bcause something else than kafka is based on those data definition I guess [13:43:11] joal, yeah, its jsut varnihskafka that sets the topic [13:43:15] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [13:43:17] and varnishkafka can only produce to a single topic [13:43:27] so, if they move bits traffic to the text varnish instance [13:43:34] the text varnishkafka instance will produce that data [13:43:53] also, we are about to get a new webrequest_source: maps [13:44:01] https://phabricator.wikimedia.org/T105076 [13:44:05] right, and therefore no need to have multiple varnishkafka instances based on topics [13:44:15] Yes, I have seen that [13:44:18] yeah we don't want to do that. [13:44:26] but ja, if we merge and use that patch of mine [13:44:34] i think we shoudl reevaluate how we partition the webrequest traffic [13:44:45] since cache cluster looks like it is going to be a bad way to do so in the future [13:44:55] maybe we should partition on domain? dunno. [13:45:09] i'd rather the list of webrequest topics be known ahead of time [13:45:21] makes sense [13:45:34] anyway, joal i want to do part one of kafka upgrade! want to do it with me? [13:45:44] ottomata: let's go :) [13:45:50] batcave [13:46:03] haha, cool, wait gimme 10 mins, still working on email a bit, want to move upstairs [13:46:10] no prob [13:46:12] i've edited this slightly if you want to look over it [13:46:13] https://etherpad.wikimedia.org/p/kafka_0.8.2.1_migration2 [13:50:02] I had 199 emails [13:51:51] mforns: Wow, if you prefer round numbers, I can send another one ;) [13:51:59] xD [13:52:15] 199 is fine, it seems less work [13:54:14] ok ... I could have done an easy one though: One email to group them all ... And in the darkness bin them ! [13:54:17] :D [13:58:19] ah joal, before we begin, i need to build some packages and put them in apt, few more mins [13:58:46] ottomata: no prob [14:14:59] (CR) BBlack: [C: 1] Make camus import webrequest_maps from new maps varnish cluster [analytics/refinery] - https://gerrit.wikimedia.org/r/230535 (https://phabricator.wikimedia.org/T105076) (owner: Ottomata) [14:15:23] (CR) BBlack: "https://gerrit.wikimedia.org/r/#/c/230539 has the other side of this" [analytics/refinery] - https://gerrit.wikimedia.org/r/230535 (https://phabricator.wikimedia.org/T105076) (owner: Ottomata) [14:20:02] Ooook joal, i am ready! [14:20:18] ottomata: In da cave :) [14:20:29] me too [14:46:35] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1524083 (Lcanasdiaz) Guys, any news about the details of the Bugzilla migration? Is it going to be possible to have this "closing" event in the migrated tickets? [15:29:36] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1524211 (Aklapper) I don't think anybody currently plans to add those //John Doe closed this task as "Resolved".// events to the resolved (etc.) tickets imported from Bugzi... [15:34:01] (CR) Madhuvishy: [C: 2 V: 2] "LGTM" [analytics/dashiki] - https://gerrit.wikimedia.org/r/230117 (https://phabricator.wikimedia.org/T108337) (owner: Milimetric) [15:56:05] joal: https://plus.google.com/hangouts/_/wikimedia.org/kafka-upgrade [15:59:52] milimetric, mforns, madhuvishy: I'll miss the beginning of tasking trying to help ottomata [16:00:04] joal, ok [16:00:11] joal: no worries [16:18:53] Analytics-Backlog, Analytics-EventLogging: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} - https://phabricator.wikimedia.org/T108339#1524338 (Milimetric) [16:21:06] Analytics-Backlog, Analytics-EventLogging: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1524339 (Milimetric) [16:31:44] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [16:33:55] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [16:35:00] Analytics-Backlog: Provide the Wikimedia DE folks with Hive access/training {flea} [8 pts] - https://phabricator.wikimedia.org/T106042#1524369 (Milimetric) [16:35:45] Analytics-Backlog: Provide the Wikimedia DE folks with Hive access/training {flea} [8 pts] - https://phabricator.wikimedia.org/T106042#1524372 (kevinator) We'll create some documentation. Some notes on the documentation: * section on getting access, with pre-requisites & links (shell account, ssh configurati... [16:38:09] Analytics-Kanban: {flea} Teaching people to fish - https://phabricator.wikimedia.org/T107955#1524384 (Milimetric) [16:38:31] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1524386 (Qgil) I agree, I don't think any further work will be done to fix this at this point. I think we should decline this task and move forward with what we have. If th... [16:53:01] Analytics-Backlog: Check and potentially timebox limn-flow-data reports {tick} - https://phabricator.wikimedia.org/T107502#1524456 (kevinator) p:Triage>High [16:53:20] Analytics-Backlog: Check and potentially timebox limn-language-data reports {tick} - https://phabricator.wikimedia.org/T107504#1524457 (kevinator) p:Triage>High [16:56:46] Analytics-Backlog: Make reportupdater support weekly granularity {tick} - https://phabricator.wikimedia.org/T108593#1524467 (Milimetric) NEW [16:58:44] Analytics-Backlog: Make reportupdater support weekly granularity {tick} [8 pts] - https://phabricator.wikimedia.org/T108593#1524492 (Milimetric) [16:59:26] Analytics-Backlog: Check and potentially timebox limn-flow-data reports {tick} [5 pts] - https://phabricator.wikimedia.org/T107502#1524494 (Milimetric) [16:59:33] Analytics-Backlog: Check and potentially timebox limn-language-data reports {tick} [5 pts] - https://phabricator.wikimedia.org/T107504#1524495 (Milimetric) [17:18:12] Analytics-Kanban: Create Hadoop Job to load data into cassandra [21?? pts] {slug} - https://phabricator.wikimedia.org/T108174#1524557 (Milimetric) [17:18:57] Analytics-Backlog: Backfill data for the API from the historic pageview dumps - https://phabricator.wikimedia.org/T108596#1524562 (Milimetric) NEW [17:26:17] Analytics-Backlog: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1524589 (Milimetric) NEW [17:26:28] Analytics-Backlog: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1524596 (Milimetric) [17:26:45] Analytics-Backlog: Communicate the WikiBot convention {hawk} [5 pts] - https://phabricator.wikimedia.org/T108599#1524599 (kevinator) NEW [17:27:09] Analytics-Backlog: Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts] - https://phabricator.wikimedia.org/T108598#1524608 (Milimetric) p:Triage>Normal [17:27:49] Analytics-Backlog: Communicate the WikiBot convention {hawk} [5 pts] - https://phabricator.wikimedia.org/T108599#1524612 (kevinator) p:High>Normal [17:49:30] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1524708 (chasemp) I mean in theory we could do this I think as we have the data: ` select header from bugzilla_meta where id=2002\G;` ```header: {"cf_hugglebeta": "---",... [17:56:39] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [17:58:38] * milimetric lunching [17:58:59] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [18:01:27] hm, joal, yeah the mean messages in rate is really fluctatey, probably related to the metadata polling period [18:01:41] ottomata: Was looking at that [18:04:18] I don't get it though [18:05:23] ottomata: Plus, seems we don't have the correct omount of messages [18:05:59] oh? [18:06:07] according to hadoop? [18:06:26] no hadoop yet, just number of meesage seems low [18:06:37] oh we don't [18:06:37] yeah [18:13:56] Analytics-Kanban: Make reportupdater support weekly granularity {tick} [8 pts] - https://phabricator.wikimedia.org/T108593#1524838 (mforns) a:mforns [18:15:46] halfak: here is the bigquery that exceeded the quota: [18:15:49] SELECT SUM(requests) FROM [fh-bigquery:wikipedia.pagecounts_201505] WHERE LEFT(TITLE, 8) = 'List_of_' [18:32:31] rats, joal. metadata timeouts happen for both librdkafka 0.8.5 and 0.8.6 [18:58:31] ottomata: so it's a known bug, is it ottomata ? [18:58:48] no [18:59:15] hm, you have created a new varnishkafka with the new version ! [18:59:32] ottomata: What are our options ? [19:00:47] joal: librdkafka is a shared lib, so insalling that and restarting vk makes it pick up the new version [19:01:02] joal: am sorta chatting with magnus he said he'd be back around in a couple of hours to help [19:01:10] i'm trying to collect relevant debug logs [19:01:15] ok [19:02:02] Can I help on something? [19:02:20] I'll get back in any case in a few hours to see projectview counts [19:03:33] ottomata: --^ [19:05:35] k [19:05:41] thanks, yea not sure joal, thank you [19:06:17] No worries, let me know if I can do more :) [19:11:08] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [19:13:06] milimetric: I'm at the office - we can chat whenever [19:13:18] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [19:13:29] Analytics, Engineering-Community, MediaWiki-API, Research-and-Data, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1525037 (Spage) T102079#1418217 == fascinating stats, @anomie ++! (Do you have any api.log analysis code on fluorine worth s... [19:20:19] Analytics, MediaWiki-API, Research-and-Data: log user agent in api.log - https://phabricator.wikimedia.org/T108618#1525053 (Spage) NEW [19:28:52] madhuvishy: give me about 30 minutes, trying to finish something up [19:29:13] milimetric: ya no problem [19:44:59] ottomata: hours 16/17 UTC we have about 20% loss from the day before [19:46:22] yeh SIGHHH [19:47:38] not sure what I should do here joal, revert? I think this is fixiable, we haven't reproduce in labs, magnus will find me shortly (I hope). [19:58:24] Analytics, Engineering-Community, MediaWiki-API, Research-and-Data, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1525152 (Tgr) >>! In T102079#1525037, @Spage wrote: > If the API request is a `POST` does the web request log have the inform... [20:02:41] madhuvishy: cave? [20:03:05] milimetric: joining [20:06:05] hm, idea, meeting with kevin. [20:17:51] off for today guys [20:18:09] ottomata: Let me know by email if there is anything I can do tomorrow morning [20:19:15] k, thanks joal [20:25:12] Analytics-Engineering, VisualEditor: VE-related data for the Galician Wikipedia - https://phabricator.wikimedia.org/T86944#1525331 (Jdforrester-WMF) p:Triage>Normal [20:46:04] (PS1) Mforns: Add support for weekly frequency and granularity [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/230649 (https://phabricator.wikimedia.org/T108593) [21:05:26] (CR) Mforns: "I considered the week starting on Monday as defined in ISO 8601 (https://en.wikipedia.org/wiki/ISO_8601#Week_dates). But I have no problem" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/230649 (https://phabricator.wikimedia.org/T108593) (owner: Mforns) [21:15:02] hey team, see you tomorrow, good night! [21:34:43] Analytics, MediaWiki-extensions-TimedMediaHandler, Multimedia, Wikimedia-Video: Record and report metrics for audio and video playback - https://phabricator.wikimedia.org/T108522#1525639 (brion) Should also track usage of inline vs embedded iframe mode [21:36:27] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [21:38:37] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [21:41:37] Hi ottomata, got a second? At the last minute I realized that... sadly... event logging is a poor fit for the CN banner history data we want to log, because of the requirement of a flat data structure :( Trying to figure out the best cheap solution to get an MVP out the door. [21:45:00] milimetric: Hi! ^ ? any thoughts? [21:45:28] AndyRussG: how "not flat" is it? [21:45:37] ottomata: milimetric: Here u can see an example of the data structure we have: https://www.mediawiki.org/wiki/Extension:CentralNotice/Notes/Campaign-associated_mixins_and_banner_history#Data_and_logging [21:45:41] * milimetric looks [21:45:47] The meaty part is an array of objects [21:46:39] I have never done EL querying, nor do I know how much Ellery can munge ugly data, but in essence it shouldn't be unqueriable, either... [21:46:43] AndyRussG: that's true that it saves space to factor out common data like that, but would it break something important to just flatten out by denormalizing? [21:47:16] EL querying is just SQL querying right now. Or Hive querying that's basically just SQL [21:47:44] Ellery's plenty good at that, and I don't think from an analytics perspective it much matters the shape of the data as long as the quality is good [21:48:03] but will the quality be affected somehow if you flattened that schema out? [21:48:21] milimetric: the top-level stuff can be made repetivite. The objects in the array in log: have to be associated with each other because they're from a single device [21:48:25] as in, make each event represent each item inside the log array and duplicate the other data (like project, device, etc.) [21:48:51] AndyRussG: gotcha, is there any way to assign a token so that you can associate those together? [21:48:58] yeah, we'd have to add a unique ID for a group of EL logs [21:49:01] AndyRussG: for eventlogging as is now, milimetric can advise better than me [21:49:02] Editing team does this for their wikitext and VE instrumentation [21:49:11] for future cool awesome system, unflat should be fine...i think [21:49:37] heh /me imagines a semantic web triple store :) [21:50:16] Yeah I was thinking making a bunch of event logs rather than one was pretty hacky, but if that's the recommended solution [21:50:32] AndyRussG: so if you used a token to group these log events together, would it kill quality if that token got reset? Which it probably will at some point depending on how you generate it [21:51:07] It also would mean a lot more round trips to the server, but since it doesn't have to be at any particular time, it wouldn't degrade user performance, just use up our bandwidth [21:51:31] yes, it uses up bandwidth, but you were saying this is like a few per second right, nothing major [21:51:32] Yeah tokens can be reset. We can just choose a random one and run a bunch of events with it [21:51:59] AndyRussG: k, then you're all set. Just make sure Ellery or whoever's going to analyze this is familiar with how / when exactly the resets should be expected [21:52:10] so they can know what assumptions they can make and let you know how that will affect the analysis [21:52:57] Analytics-Kanban, RESTBase-API: create third RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107055#1525710 (Milimetric) a:Milimetric>madhuvishy [21:53:01] Analytics-Kanban, RESTBase-API: create second RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107054#1525714 (Milimetric) a:Milimetric>madhuvishy [21:53:07] Analytics-Kanban, RESTBase-API: create first RESTBase endpoint [8 pts] {slug} - https://phabricator.wikimedia.org/T107053#1525715 (Milimetric) a:Milimetric>madhuvishy [21:53:14] Yeah... Also consider that the number of objects in the array can be many more than in the example. As in, maybe 20 or 30 or 40... (Hmm in fact we haven't set an upper limit on number of items, just age, but we should do that, too.) [21:53:41] Is it kosher to fire of 40 EL events in sequence from a single client? [21:57:14] Analytics-Backlog: Restart Pentaho - https://phabricator.wikimedia.org/T105107#1525724 (Milimetric) Open>Resolved a:Milimetric This Pentaho instance was never meant to be commissioned in the first place. The service was running from my home folder, which is probably why it's completely dead now as N... [22:00:13] milimetric: ottomata: what would be upper limit of EL events fired in sequence from a single client on a single page view be? [22:05:57] * awight cocks ear at the crickets [22:06:45] AndyRussG: I liked your suggestion that we package a few dozen EL records and send using a bulk API, which unpacks on the server side. [22:10:41] AndyRussG: there should be no limit per client as far as EL is concerned [22:11:02] milimetric: cool thx! [22:11:19] but AndyRussG were your previous estimates about events per second based on bundling? [22:11:30] either way, what would your new estimates be? [22:11:43] we do have to make sure our total event throughput doesn't grow too much until we migrate completely to Kafka [22:12:00] milimetric: awight ^ mentioned another idea we had, to send just one request and unbundle server-side. The new estimate would be... arg, hard to guess. Somewhere between 3 and 10 times that? [22:12:30] AndyRussG: unpacking on the server and sending there would only buy you complexity [22:12:35] EL has to deal with the events anyway [22:13:01] well, a few dozen round trips from the client is not peanuts [22:13:19] Hmmm... yeah and I was hoping to sell... On that topic, another option would be just to have an ad-hoc database, though there is some reticience at FR having to own more infrastructure [22:13:50] awight: the EL endpoint just responds with 204, so the round part of the trip is not too bad [22:13:58] * AndyRussG pines for communism [22:14:17] it does add a bit of bandwidth to denormalize the schema, but if it's just the few fields you had on that example, it doesn't seem that bad [22:15:07] AndyRussG: communism is nice, and EL will get better, but I think we should at least try the basic approach on the beta cluster before we solve problems that may not exist [22:15:28] awight / AndyRussG: EL also uses sendBeacon to send stuff if it's available on the browser [22:15:32] AndyRussG: ah, I just realized we can send all the records in parallel, if the browser supports it [22:15:33] K yea that makes sense [22:15:33] so that should help with performance [22:15:48] awight: ? [22:16:06] https://developer.mozilla.org/en-US/docs/Web/API/Navigator/sendBeacon [22:16:14] On the other hand, milimetric it's not bandwidth we're worried about, it's potentially blocking all page javascript for 30 x round-trip lag, which could easily add up to a minute [22:16:36] oh none of it would be blocking, it's all AJAX, regardless of sendBeacon support [22:16:37] yeah, we're using sendBeacon for the doomed Special:RecordImpression [22:16:40] it's fire and forget essentially [22:17:46] so this data all becomes available at once? It can't be sent one message at a time earlier? [22:18:05] I don't think it works quite that way... I understand that sendBeacon is a big step forward, but not widely supported yet. Even with AJAX, there are only so many js interpreter threads (one in firefox) [22:18:37] It can't be sent one at a time, because we've decided not to assign individuals unique IDs [22:19:11] fair enough, so is the data all available only at the time you'd send the big batch? Or would it be available earlier? [22:20:13] milimetric: awight: ottomata: we could hyper-denormalize it and have a schema with historyEntryProp00, historyEntryProp01, historyEntryProp02 [22:20:32] It's available earlier, but we send it all at once so it can be linked together in the database. [22:20:53] can you generate the token to link it all early, send the events as they happen, and link together later via the token? [22:21:08] milimetric: it'd complicate things to send it at different times, since we're sampling banner display histories and we want to include single-visit or occaisional-visit users equally in the sample [22:21:08] FYI sendBeacon experiment to see support in browsers hitting us: https://www.mediawiki.org/wiki/Extension:EventLogging/SendBeacon [22:21:26] I see... [22:22:22] milimetric: here is our most recent WIP patch: https://gerrit.wikimedia.org/r/#/c/229560/ [22:22:43] the meat... err protien is in resources/subscribing/ext.centralNotice.bannerHistoryLogger.js [22:22:51] milimetric: Nice to know there aren't many borken sendBeacon implementations out there, but what's the total share of browsers that support it? [22:23:10] That study seems to only look at browsers that claim to support it, not those that don't support at all... [22:23:15] but that sampling is a level higher than the events, right? Like the sampling just decides include / exclude per client? [22:23:35] awight, yeah :) I was thinking the same thing [22:23:45] milimetric: yeah the sampling is at the page-view level [22:24:26] So summary of what happens: people go to the sight and see banners. Every time a banner is shown (or hidden from them, for example because they clicked the clsoe button last time) we store an event in LocalStorage [22:24:53] After storing the event, we randomly decide to send back the whole log in localStorage for a smallish percentage of users [22:24:55] milimetric: This says it's just straight up not in IE or Safari: https://developer.mozilla.org/en-US/docs/Web/API/Navigator/sendBeacon [22:28:02] AndyRussG: several other schemas follow a similar pattern, but they use event logging instead of local storage and they decide on the send / don't send based on the unique token they assign [22:28:34] What about hypernoramlization? ^ Say, an EL schema with 70-80 properties (we'd be limited to say, 6 events from the log, though) [22:28:57] * awight hides in a cupboard [22:29:06] So: 1. assign token, 2. call their log event code always, 3. if token meets random criteria, send to EL [22:29:16] Analytics-Backlog, Research-and-Data, Research collaborations: Meet with Felipe Hoffa: Google BigQuery + Wikimedia PV data - https://phabricator.wikimedia.org/T107911#1525886 (DarTar) @kevinator I am closing this as the scope was limited to today's meeting. Do you want to set up a separate task for the... [22:29:43] Yeah, hyper everything is cool and all but I think your problem is not different enough from others we've seen [22:30:06] milimetric: Doesn't that require a persistent unique ID per reader? [22:30:44] that's why i was asking earlier if the token could be reset [22:31:01] In other words, if you were trying to sample sessions or users [22:31:14] ah, no this data is only worth anything if we get a longitudinal thing per individual [22:31:27] sampling users. [22:31:41] gotcha, ok [22:32:13] then yes, but maybe you can do that by storing the token in local storage [22:33:08] AndyRussG: ^ interesting [22:33:19] sorry, not trying to cause you more work. Trying to prevent this batch API or hypernormal thing :) [22:33:27] haha I don't blame you [22:33:44] ottomata, around? do you know if the kartotherian is now logged? [22:34:42] ha, i don't know what kartotherian [22:34:43] is [22:34:43] ! [22:34:58] their fancy map service thing [22:35:14] milimetric: awight: in that case we'd decide as soon as someone hits the site whether they are in the sample or not, and only in those cases set a unique token, and basically send back events continuously. I confess I don't like it at all :( If we put the unique token in local storage, there is no guarantee to the user that it's not being used to track them for something else [22:36:29] putting all their events in local storage seems just as if not more invasive, no? [22:36:43] If a unique token is needed, I'd rather have a one that you can see by the code is never stored locally and that you can pretty muc gurantee what it's used for and what it isn't [22:37:24] this will of course be a moot point when we get uniques with proper opt-outs, but that's in the future [22:37:26] hmmm maybe on the surface, but it lets users know what we're storing about them [22:37:37] ottomata, maps :) [22:37:52] ah, yurik, i'm not sure what has been merged, i'm still working on this kafka migration and haven't looked at anything else all day [22:37:56] AndyRussG: true [22:38:00] it will be easy to log, but i haven't merged any camus stuff yet [22:38:08] so it isn't going into hadoop yet for sure [22:38:23] ottomata, https://maps.wikimedia.org/static/#7/19.435/-99.146 [22:38:49] but you could call it "token to randomly decide analysis inclusion" or something. And our code is open source [22:39:16] hmmm [22:39:33] I dunno, I'm leaning towards buying complexity at this point... [22:39:57] buying? [22:40:48] (mmm jk, above you said "unpacking on the server and sending there would only buy you complexity") [22:40:52] AndyRussG: what about an intermediary service that adds an id and splits up messages? [22:41:44] cwdent: ^ yeah exactly... So one server call, no uniqs on the client [22:42:29] ah, I see... Well, I mean, do what you think is right for FR but for EL the approaches above are all fairly similar. Spikes of events aren't awesome but as long as it's around 20 events per second or so, we're cool [22:43:01] if the intermediary service was an MQ of some sort you could also throttle the output [22:43:27] and with the move to kafka we'll lift that limit to "as long as you get more value out of higher sampling rates" [22:44:07] I'm eyeing Kafkatee and related but I don't know much about it [22:44:16] what rate would this unsampled stream be coming in at though? [22:44:50] Heh unsampled, well every pageview for users in a CentralNotice campaign that has the featuer activated... Though I'm not sure that'd be even useful [22:45:28] Could we send the whole history directly from the client to Kafka and use that for code that splits it into several EL events? [22:46:00] we are going to kafka fairly soon and we could just blacklist your schema from hitting the bottlenecked mysql part of the pipeline. Then you could have everything in hdfs and not worry about this. [22:46:33] but so I understand, you're basically talking like 100k per second or so?! [22:46:59] 100k per second? [22:47:16] I'm not clear yet, yeah, we couldn't handle that [22:47:49] ottomata: milimetric: well if we break it into several events, we don't really know how much traffic. Or I should say, we really know even less [22:47:59] AndyRussG just said every request in a campaign, which could be "most requests" during fundraisers? [22:48:06] Or I should say, we really know nothing and sort of know even less [22:48:20] ... :) [22:49:21] i think AndyRussG said every request in a campaign that has a feature activated [22:49:45] um, we need to start with the rate. So if you want to potentially send unsampled, we need to know how many that is. Standing up your own infrastructure to handle that is probably a bad road to go down [22:49:51] AndyRussG: are you trying to do funnel type analysis for donations? [22:49:56] ottomata: yeah, but that will be every fundraiser... We're just starting small because it's August, not December} [22:50:18] i was hoping 'feature activated' targeted a small set of users [22:50:32] right, so in December it would be a lot. And building infrastructure for that seems hard [22:50:37] ottomata: yes, to start, 'cause a small set of users are targeted by fundraising at this time of year [22:50:55] it's so that the effectiveness of banner analyses can take into account the history of banners a user has seen [22:51:10] right, but that thinking would paint us in a corner. It's only 12 weeks to December [22:51:29] We can always downsample until the volume is manageable. [22:51:39] ottomata: ^ [22:51:55] The sample rate is adjustable on a dime in this feature [22:52:28] the point is, sending everything to your own API to then denormalize and downsample seems hard too [22:52:51] or were you going to sample on the client still? [22:53:17] Sampling is on the client, configured by PHP globals. [22:53:38] milimetric: Can you give us a ballpark for how many EL requests/second we can send, then we can engineer our stuff to play nice? [22:53:53] awight: better yet! configuered on a campaign-by-campaign basis in the CN admin UI [22:54:03] rad, thanks for clarifying! [22:54:06] yeah, around 20-40 per second without changing anything and that works right now [22:54:22] milimetric: for EL overall, or that's an allowance for just our stream? [22:54:36] just for you guys [22:54:44] 8D [22:54:55] milimetric: also, is that the same if the request comes over the Web from a client vs. from a process within our cluster? [22:54:56] :D [22:54:59] it's basically mostly limited by mysql tables getting huge if you store much more [22:55:30] cool. So if we have bursty traffic, we're not causing excessive wear on anything? [22:55:39] it's the same regardless of source, yes. Probably less reliable from the server because the client goes through kafka [22:56:32] bursty is ok as long as it stays under the overall capability I think, which we haven't pushed too much but we thin is around 1k events per second right now [22:56:40] ah hmm k thx :) [22:56:49] of which we're usually using about 400 per second [22:57:54] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1526018 (Aklapper) p:Low>Lowest [22:58:06] BTW out of curiousity, can I look at the Kafkatee code somewhere that munges Special:RecordImpression into a format that's the same as what's output by udp2log? Just mainly to satisfy my own stubbornness that doing something in a similar way doesn't make sense here... [22:58:34] ottomata might know more on that ^ [22:58:42] AndyRussG: I can get u that [23:00:34] AndyRussG: don't understand the question [23:01:25] AndyRussG: http://git.wikimedia.org/blob/operations%2Fpuppet.git/97dfd5745af7294c15b652750f87c0d0b875a298/manifests%2Frole%2Flogging.pp#L481 [23:04:51] awight: thx! ottomata: it was just that bit that awight just sent ^ [23:05:14] Mmmm what I was thinking, is that a bit of infrastructur where we can put any additional logic? [23:05:30] * AndyRussG wears his ignorance like a badge of dishonour [23:06:23] kafkatee just lets you split off parts of a stream [23:07:14] ah hmmm K I was deluded... heh thx [23:09:42] What about processing inside Hive/Hadoop? If say we were OK with only getting the data there? Isn't there code, reduce-map, or map-boil or something, that you can run? Could we send it all raw to Hive and split it into useful bits there? [23:10:50] * AndyRussG claws at steel bars with broken fingernails [23:11:48] * awight shares a shard of his saltpeter [23:13:59] ottomata: milimetric: k just one last question about the many-propertied schema option--besides ugliness, is there a limit to how many properties a schema could have? [23:14:03] awight / AndyRussG: yeah, you could send it raw, but in December that'd be nearly duplicating the current clickstream [23:14:08] so we'd have to seriously increase capacity [23:14:33] if you can make a business case, we'll get 'er done [23:16:34] milimetric: ? I'm not talking about unsampled, or what do u mean? [23:16:39] AndyRussG: We'd be running into the URL length limit, if we tried the hypernormalized case. [23:16:42] one sec guys [23:16:47] talking 'bout stuff that's down [23:16:52] awight: ohh could point [23:16:53] hehe [23:17:25] heh we just scared the infrastructure [23:25:58] AndyRussG: yes, awight is right sort of, it's not the URL length limit [23:26:13] it's a weird length limit imposed somehow by varnish ? [23:26:30] I think marcel knows exactly the limit, but it's smaller than the URL length [23:26:49] awight: milimetric: that's of course if we do EL directly. If it were something else, we could POST, no? [23:26:50] in short, long events would fail, I can get you the limit when I'm less scrambled, perhaps write me an email [23:27:10] AndyRussG: um, there's no other way right now to get into the EL pipeline [23:28:00] the EL thingy reads out of varnish logs [23:28:11] and so our whole pipeline is subject to the weirdness there [23:28:53] milimetric: cool, got it :) [23:29:31] but that's a good point, and we could allow people to produce directly into kafka topics, eliminating this problem in the future [23:30:22] hmmmm [23:30:48] people can produce to kafka topics :) [23:30:50] if they ask nicely! [23:31:01] search folks are getting ready to do it from Cirrus in mediawiki [23:31:38] milimetric: https://wikitech.wikimedia.org/wiki/EventLogging#Log_size_limit this? [23:32:06] madhuvishy: thanks! Wow, that's a surprisingly bonsai length limit [23:32:21] thanks madhuvishy! [23:32:45] i thikn that is fixable if someone finds the time to fix that [23:33:13] https://phabricator.wikimedia.org/T91347 [23:33:25] no one has pushed it since may, cause folks who were affected just changd their data size [23:33:57] hmmmm... [23:34:32] Probably not worth dealing with that--I read that IE comes with a 2083 character limit for the full URL, anyway. [23:37:45] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526151 (awight) It seems like wasted effort to increase the length limit beyond 2,000 chars. See... [23:40:03] awight: hmmm... I wonder if SendBeacon can POST... [23:42:50] AndyRussG: sendBeacon uses POST [23:42:57] (behind the scenes) [23:43:09] get sucks too much, one reason being the limitation awight just pointed out (URL length) [23:44:35] milimetric: ah cool [23:45:32] milimetric: ottomata: awight: cwdent: in a couple minutes I have to dinner and stuff... all this has been really helpful!!! Realll appreciate it! [23:46:30] been very educational! got like 100 browser tabs open [23:49:30] for me too! [23:49:41] cya soon, thx again! [23:50:52] hope i wasn't too confusing. Do keep bouncing ideas off me if it helps, we're trying to figure out how to make this infrastructure better and this helps a lot [23:51:50] milimetric: it was great! K will do fer sure :) [23:57:43] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526242 (Tgr) >>! In T91347#1526151, @awight wrote: > It seems like wasted effort to increase the...