[03:18:12] (03CR) 10Milimetric: "Changes in response to Marcel's comments look good, but you only replied to one of my comments from PS7" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) (owner: 10Nuria) [04:52:44] (03CR) 10Nuria: [WIP] Table and workflow for features computations per session (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) (owner: 10Nuria) [07:07:23] !log kill/re-run pageview 2019-12-10-17, stuck in whitelist check for hours (https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_171800 for more info) [07:07:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:11:16] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) Announced the shutdown of stat1007 for Thu Dec 12th 15:30 CET (more or less) since it is a more crowded and used node. Since stat1... [08:42:35] 10Analytics: Create user quotas for notebook100[3,4] to limit home directory size - https://phabricator.wikimedia.org/T240422 (10elukey) [09:57:18] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: an-coord1001 hive metastore not listening on ipv6 - https://phabricator.wikimedia.org/T240255 (10elukey) [10:58:06] 10Analytics: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10elukey) [10:59:09] 10Analytics: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10elukey) [10:59:11] 10Analytics: Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10elukey) [11:02:34] 10Analytics: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10elukey) [11:03:12] 10Analytics: Create user quotas for notebook100[3,4] to limit home directory size - https://phabricator.wikimedia.org/T240422 (10elukey) [11:03:14] 10Analytics: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10elukey) [11:06:45] 10Analytics: Add CPU quota to stat and notebook hosts - https://phabricator.wikimedia.org/T240440 (10elukey) [11:08:09] 10Analytics: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10elukey) [11:08:13] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) [11:08:15] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: an-coord1001 hive metastore not listening on ipv6 - https://phabricator.wikimedia.org/T240255 (10elukey) [11:09:23] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10User-Elukey: Documentation improvements for Eventstreams - https://phabricator.wikimedia.org/T240181 (10faidon) >>! In T240181#5724161, @Ottomata wrote: > BTW, I don't think the IRC recentchanges stuff needs to consider historical consumption. The curr... [11:24:32] going afk, just sent e-scrum! [11:24:32] o/ [12:34:35] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) a:05jcrespo→03ACraze ` Notice: /Stage[main]/Admin/Admin::Hashuser[accraze]/Admin::User[accraze]/User[accraze]/ensure: created Notice:... [12:35:23] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) p:05Triage→03Normal [13:48:31] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10jlinehan) [14:05:25] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10jlinehan) [14:16:36] o/ [14:20:28] (03CR) 10Milimetric: [C: 03+2] "My bad, I missed the rename and didn't see the corrections. Looks good to me." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) (owner: 10Nuria) [14:28:06] 10Analytics: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10Ottomata) I will also need to do this for the schema service too! https://phabricator.wikimedia.org/T233630#5644875 [14:35:57] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10User-Elukey: Documentation improvements for Eventstreams - https://phabricator.wikimedia.org/T240181 (10Ottomata) > The problem is that the SSE connection breaks every now and then It does? I've seen some bug reports now and then about this, but they'v... [15:22:48] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10jlinehan) [15:23:12] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10jlinehan) [15:26:32] (03CR) 10Nuria: "Can this patch be abandoned?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/549876 (owner: 10Fdans) [16:01:01] ottomata: yt? [16:03:55] nuria: ya [16:05:16] ottomata: on the client side error logging patch [16:05:26] ya [16:05:42] ottomata: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/553376/ [16:06:22] ottomata: you cannot do utc dates client side in teh browser with a native api for all browsers we support [16:06:41] ottomata: so the dt field needs to be "non mandatory" [16:07:19] ottomata: as it will only be filled for some browsers (unless we add bunch of js for date manipulation, which we will most likely not do) [16:07:28] nuria: what will set the date then? [16:07:44] ottomata: it cannot be done client side for all browsers [16:08:14] i think hip said he looked at it and said maybe it was supported by enough? not sure (just asked him to join here) [16:08:19] ottomata: only for some, thus in some events will not be set or [16:08:23] ottomata: it is not [16:09:04] nuria: , it is possible to make eventgate do some stuff, and for dt, actually kind of easy because of https://github.com/epoberezkin/ajv-keywords#dynamicdefaults [16:09:05] but [16:09:29] that will mean that eventgate is required to do events anywhere, including dev envs, for apps as well. [16:09:40] and means that we open up the door to server side modifications of events. [16:09:46] ottomata: https://caniuse.com/#search=ISOString [16:10:22] interesting, so around 12% of browsers can't use it? [16:10:36] is there no way to do it manually though? [16:10:38] ottomata: browsers and traffic is adifferent thing [16:10:51] sorry, 12% of 'global usage'? [16:10:53] i mean [16:10:57] 12% of traffic? [16:10:59] ottomata: there is, with quite a bit of js [16:11:45] nuria: there is also teh probelm of id and user agent [16:11:47] ottomata: for us for some locations is quite a bit more because of KaiOS and UC browser, if you look at it overall it would seem smaller than it really is [16:11:53] i don't mind not havign id, but it'd be nice if we did [16:11:54] and [16:12:07] user agent sucks because it'd be sent twice in the payload [16:12:09] ottomata: user agent to be useful in an event needs to be parsed so there needs to be refinement server side [16:12:12] (once in the header, and once in the event body) [16:12:19] that's different [16:12:26] refinement is different, that's downstream processing [16:12:37] (03CR) 10Mforns: "I think I found a typo. Otherwise +1!" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) (owner: 10Nuria) [16:12:40] we're talkign about fields that only the client knows about, so has to add [16:12:48] (or be added by the server receiving the client HTTP request) [16:13:04] ottomata: right right, but that being the case why would we send it on teh event (for analytics events, for error s i can see that we need to) [16:13:13] ? [16:13:18] how else will it get there? [16:14:07] ottomata: the user agent? in analytics events with post processing of UA header [16:16:09] nuria: the header is only available during the http request/response [16:16:10] in eventgate [16:16:30] only the event body is produced to kafka [16:16:51] ottomata: how would we get our parsed UAS in eventlogging events then? [16:16:59] ottomata: i might be missing something here... [16:17:16] currently? varnishkafka + eventlogging-client-side (raw) processor [16:17:26] ottomata: no, in the event gate case [16:17:35] in eventgate, the client has to set it in the event body [16:17:39] ottomata: the requirement of having geo info and user agent is teh same one [16:17:51] ottomata: that is not going to work for geoinfo [16:17:54] hmmm [16:17:55] true. [16:17:58] ottomata: we need the IP [16:18:00] but we don't have varnish to help us here [16:18:05] ottomata: not teh geo cookie [16:18:12] no, the client ip [16:18:20] ottomata: js does not have access to that [16:18:30] indeed. [16:18:35] ottomata: that is strong requirement [16:18:39] so we need some server side stuff anyway. [16:18:42] ottomata: ya [16:18:52] i'm ok with that, but we need to be careful how we do it and limit it [16:19:00] there are a couple of ways to do it. [16:19:07] we could make some custom default handlers for jsonschema with ajv. [16:19:22] which is cool; and would allow schema designers to specify how the default values for fields are filled in. [16:19:29] works great for dt, since that is built into ajv [16:19:44] ottomata: ya, that makes sense , i can see that for UA and IP is a must cause incresing the payload by 100 chars when those are sent on the headers anyways is also not good practice for poerf [16:19:46] for headers, we'd need custom functions in eventgate, and config to map from the jsonschema defaults to those functions. [16:19:48] *performance [16:20:05] but, that means we're doing tightly coupling the jsonschemas with eventgate [16:20:07] ottomata: re: schema designers specifying default value functions - well, I think that's just enough rope for them to hang themselves in some cases [16:20:33] alternatively, i could implement custom code in eventgate-wikimedia that new about fields that need default filling, like user agent [16:20:37] id, ip [16:20:47] but, that means that WHICH fields those are is hardcoded into code. [16:20:48] ottomata: i do not think ANY client would not want IP and UA [16:20:56] instead of being specified by schema designers [16:21:05] ottomata: so those are not "custom" in that sense, all schemas would want them [16:22:08] I think it would be good if the things that did this were opaque to the caller, so for example the caller throwing the event isn't setting 'dt', the library is. Same for id, etc. UA is more ambiguous since right now the lib isn't setting that in any patchset, but I'd argue it might be beneficial because of how people tend to try and pre-process UA sometimes. IP obviously would not be something anybody could set client-side. [16:22:33] Maybe if these were all cordoned off (whether within meta.* or not, I don't know) in some way to make it clear that they are not things you need to worry about, but you will get them [16:22:42] some of them from going through the client, and the final refinement after going through EG [16:23:05] maybe in the future we can support custom default computations in the schema, I'm just wary of doing it at first [16:23:21] for any of them, EG will only have defaults; if client sets them in body those will be kept [16:23:30] right, that should be fine [16:23:34] i think EL lib should do the same [16:23:37] but fill in what it can [16:23:40] like dt if it can [16:23:43] if not EG will do it [16:23:54] 10Analytics-Kanban, 10Discovery-Search (Current work), 10Patch-For-Review: "Wikidata Query Service Updater" should have 'bot' in the user agent to indicate is a tool - https://phabricator.wikimedia.org/T238106 (10Gehel) 05Open→03Resolved [16:24:13] we might need to do that for other things anyway, we talked about this on the patch and I've been thinking about it, it already came up when we were debugging and I think if nowhere else that when debugging people will want to be able to override [16:24:16] ottomata: I agree with hip that UA< IP are things that givene every use needs and wants [16:24:23] ottomata: shoudl be set by default [16:24:25] *should [16:24:52] ottomata, hip: sorry [16:25:00] ok, i do like the idea of not being tied to a specific field by EG code if we don't have to...which would mean schema mods, buuuuut, lemme see how ugly that is. [16:25:05] "given that are things that everybody needs and wants" [16:25:06] if it is too ugly we will just do code [16:25:11] and hardcoded event fields [16:25:12] "should be set by default" [16:25:21] lemme make a ticket. [16:25:25] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10EBernhardson) IMO the important things to check beyond physical space: * Does the PSU have enough overhead to support GPU's (current wx 91... [16:27:06] ottomata: ok, ticket sounds good [16:27:13] ottomata: for the time question [16:27:20] ottomata: thanks [16:27:25] ottomata: is doable in some browsers but not others (with native api) [16:27:29] cc hip [16:27:52] ottomata: now, that means that the schema needs to admin that field as being optional [16:28:23] nuria: yeah, just a heads up that there's a ticket for the time question now https://phabricator.wikimedia.org/T240460 [16:28:33] nuria: it will be a little weird if some dts are client-side and some are server-side [16:28:54] nuria: but I don't see a way around it unless we just don't send events if they can't make an ISO string, or we make some polyfill in MW somewhere [16:29:07] hip: just like in the EL case, we need to move forward with this patch before we get an answer to that question [16:29:27] nuria: fully agree the patch will just do its best, it's not going to be blocked by this kind of thing, that's why I put it on a ticket to triage for later [16:30:04] hip: agreed. In the near term let's make the field optional and check for support of api [16:30:15] nuria: no schema doesn't need it to be optional if eventgate will fill it in [16:30:23] nuria: I think the problem is that EG can't tolerate the field being optional, so unless we have server-side generation in time [16:30:28] nuria: we're back to square one [16:30:54] ottomata, hip: the other option is non-utc times sent client side which in teh case of errors will be hugely confusing [16:30:56] ottomata: do you think we'd get server-side generation in the next few days? [16:31:10] hip, ottomata : on meeting, can talk later [16:31:18] hm i can work on that, sure; do you need it in next few days? oh for client errors you mean [16:31:25] for errors yeah, not for EL [16:31:37] hip: i really doubt we are going to get that deployed, deploy freeze is coming up, and all of this requires review from MW devs, etc. [16:31:47] Yeah I know, that's what I'm thinking too [16:31:49] filippo said as much in the SRE meeting this past monday [16:32:13] Okay, well nuria and I said we'd try our best and I think we did but maybe he will have to downgrade his OKR to vagrant by EOQ [16:32:15] ottomata: deployed seems unlikely, yes, but we should have our part done as much as we can [16:33:27] nuria (or anyone else familiar with the webrequest referer classification): Referers that don't include "http(s)://", such as 'en.wikipedia.org' and 'google.com', are classified as unknown. Is this appropriate behavior or should these be classified as internal and external (search engine) respectively? [16:33:47] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: eventgate-wikimedia should fill in defaults for some important fields - https://phabricator.wikimedia.org/T240477 (10Ottomata) [16:34:21] hm who knows about referrers... mforns ? [16:41:42] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog: Explore sending batches of events from EPC libraries - https://phabricator.wikimedia.org/T239996 (10LGoto) a:03jlinehan [16:45:58] ottomata, mmmmmmmmaybe? what's the question? [16:46:14] I see [16:46:16] see lex's q above [16:46:17] ya [16:47:22] lexnasser, I was not aware that there were referrers without the http(s):// prefix, are those a big share of the total? [16:47:54] I mean, I was not aware there was both with and without prefix [16:49:41] mforns: Not sure about the proportion, but roughly 20% of all referer_class=unknown referers are without http(s) [16:55:41] lexnasser, hm, I'd say if a request comes from the outside (not *.wiki*.org) then we're probably OK, but if it comes from within, then we might want to improve parsing no? [16:56:39] lexnasser, do you have an idea of how many *.wiki*.org referrers without http(s):// vs non *.wiki*.org? [16:57:15] mforns: Sounds good, what would you guess is the source of the non-http wikipedia referers? [16:58:39] maybe very old browsers? Depends... Are there all kinds of projects there? Like do those http-less wikimedia referrers belong all to the same project, say en.wikipedia.org? Or are they varied? [16:59:25] mforns: I think this answers your question: There are roughly 10 times as many external referers as non-http wiki.orgs [17:00:02] aha [17:00:25] mforns: I just queried the top 500 unknown, and the only project without https internally is english wikipedia [17:00:34] aha [17:02:31] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog: Explore sending batches of events from EPC libraries - https://phabricator.wikimedia.org/T239996 (10mpopov) p:05Triage→03Lowest Marking as lowest priority since this is not critical to any work or blocking anyth... [17:02:58] 10Analytics, 10Analytics-Kanban: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10Nuria) 05Open→03Resolved [17:03:25] lexnasser, let's post-standup this! [17:03:59] mforns: Definitely [17:28:06] 10Analytics: Check home leftovers of dfoy - https://phabricator.wikimedia.org/T239571 (10Milimetric) I tracked down the other zero files I saved for @DFoy: /wmf/data/archive/backup/zero-raw-logs-for-dan-foy Let's ask @kzimmerman if anyone in PA is interested in analyzing this data, and delete if not. Kate - f... [17:44:40] 10Analytics, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10nettrom_WMF) [17:46:51] 10Analytics, 10GrowthExperiments-Homepage, 10NewcomerTasks 1.0 , 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Add start_startediting_state to the whitelisting of the HomepageVisit schema - https://phabricator.wikimedia.org/T240481 (10nettrom_WMF) [17:47:07] 10Analytics, 10GrowthExperiments-Homepage, 10NewcomerTasks 1.0 , 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Add start_startediting_state to the whitelisting of the HomepageVisit schema - https://phabricator.wikimedia.org/T240481 (10nettrom_WMF) a:05MMiller_WMF→03nettrom_WMF [17:55:12] joal: maybe you should make hdfsrsync its own project! This might be cool to open source for others to use too [17:56:20] ottomata: why not :) hdfs-tools, which would include the HDFS-cleaner I guess [17:56:38] hm, yeah i guess that is useful too.... [17:56:39] hmmm [17:56:43] i dunno though this might be better standalone [17:56:50] i'm sure people google for 'hadoop rsync' ALL the time [17:57:11] ottomata: I have not looked close enough but I bet there are functions/traits we could make generic accross those two [17:57:28] ottomata: Ok let's go for that :) [17:57:32] joal: once you are free from the scala deploy talk let's talk about wdqs metrics [17:57:39] sure nuria [17:57:58] ottomata: is sbt an option, or maven? [17:58:33] if standalone project, sbt is fine if you want to.... [17:58:39] as long as it works with archiva [17:58:49] none of us here know sbt though [17:58:56] and whenever i look at it i get scared [17:59:08] 10Analytics, 10Product-Analytics: Many special pages missing from pageview_hourly dataset starting on July 23, 2019 - https://phabricator.wikimedia.org/T239672 (10Neil_P._Quinn_WMF) >>! In T239672#5729650, @Nuria wrote: > @Neil_P._Quinn_WMF > Indeed it makes sense to include things such us Special:Book. Can y... [18:01:27] ottomata: will think about it :) [18:01:56] ottomata: I don't have rights on gerrit to manage stuff - Do you mind creating it for me? [18:02:06] I guess it needs some name bikeshedding first :) [18:02:25] nuria: what's up about wdqs? [18:02:45] joal: did you decided with dcausse in a set of metrics? [18:02:49] joal: let's do it on github; i like doing things there that we want to get more exposure [18:03:16] i was talking to gehel and we thought it will be good to surface (as part of MTP metrics on "programatic access to structued content") [18:03:16] go ahead and start it under your account, we'll move it to wikimedia after/during review, etc. [18:03:17] ok ottomata [18:03:23] naming... [18:03:28] should w do hdfs-tools? [18:03:29] maybe? [18:03:31] and put cleaner there? [18:03:33] maybe that's ok...? [18:03:33] joal: the request throughput of wdqs and timeout ratios [18:04:05] ottomata: sounds good - org.wikimedia.analytics.hdfstools [18:04:24] nuria: we have not decided on a set of metrics yet [18:04:55] ok [18:05:20] nuria: for the moment I need to do some data exploration to get familiar with the dataset (should be quick), then we can go for metrics, it's a good idea (report in grafana hourly maybe? [18:06:00] joal: [18:06:00] https://github.com/BROADSoftware/hsync [18:06:33] AH! I had not found one when looking :( [18:06:34] oh [18:06:39] that is only from local to hdfs i think [18:07:52] hm [18:08:28] ottomata, nuria: dropping for diner with the kids, will be back after [18:08:32] k [18:10:44] joal: k, let's talk when you are back [18:23:50] 10Analytics, 10Product-Analytics: Many special pages missing from pageview_hourly dataset starting on July 23, 2019 - https://phabricator.wikimedia.org/T239672 (10Nuria) @Neil_P._Quinn_WMF Erik's initial definition counted all special pages and did not do much in filtering for bots, this created issues aroun... [18:28:54] (03CR) 10Mforns: "@Nuria I think it's very difficult to delete something by mistake with this script. You'd have to run the script (with --skip-trash) once," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/556231 (https://phabricator.wikimedia.org/T237124) (owner: 10Mforns) [18:51:06] Back [18:51:16] Hi nuria - wanna talk? [18:52:12] * gehel is reading backlog... [18:52:23] Hi gehel :) [18:52:26] * gehel is also scared of sbt [18:52:38] (03PS1) 10Lex Nasser: Modified external webrequest search engine classification and added tests. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/556449 (https://phabricator.wikimedia.org/T239625) [18:52:40] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/556449 (https://phabricator.wikimedia.org/T239625) (owner: 10Lex Nasser) [18:53:15] looks like I was mentioned, but you don't have any immediate need for me... [18:53:52] nope gehel - Except if you already have a list of metrics you'd like us to implement :) [18:54:21] micro chat with _nuria before, we had a few very general idea [18:54:37] I'm not entirely sure in which context we need those metrics [18:54:45] nuria, milimetric: Hopefully I did this right! Just sent to Gerrit. Take a look when you have time: https://gerrit.wikimedia.org/r/556449 [18:55:17] lexnasser: will look in a couple minutes [19:12:29] (03CR) 10Mforns: [V: 03+2] "@Nuria, I responded to your comments or prior patch sets inline :]" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) (owner: 10Mforns) [19:16:42] (03CR) 10Ottomata: [WIP] Add HdfsRsync scala tool (0317 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/556237 (owner: 10Joal) [19:24:57] lexnasser: real nice , will take a look later on today [19:25:00] joal: hola [19:25:09] heya nuria [19:25:16] joal: regarding metrics [19:25:30] joal: i think there are two audiences [19:25:53] joal: one is the search team and how can they benefit the most from the data [19:26:17] joal: the other is surfacing programatic access to structured content on the tunning session [19:26:29] nuria: can talk for the next 1.5 hours about SDC metrics [19:26:36] anytime you like [19:27:42] nuria: I picture the first one, but lack info on the second [19:27:45] milimetric: give me a sec so i can talk to joal about metrics [19:28:10] joal: teh second i think is going to be for us to define [19:28:35] joal: but i was thinking very simply reporting throughputs and timeouts (as a sign of not being able to provide enough service) [19:28:48] joal: and for the first one? [19:28:56] joal: are there any ideas floating arround? [19:30:14] nuria: about the second, it means we have queries for SDC (by opposition to classic WDQS I mean) - I guess we have then since you mention [19:30:36] nuria: queries for SDC? no [19:30:50] joal: no, we have queries just for wdqs [19:31:02] nuria: about the first, the basics you mentioned: throughput and timeouts, maybe more (need to know more about data) [19:31:11] joal: so "structure data" here means 'wikidata+ SDC" [19:31:39] nuria: then I'd like to do query analysis using a sparql parser, and see if there are not-too-complex ways to cluster the queries [19:31:49] and/or emerging patterns [19:32:37] nuria: in my understanding, SDC queries are on their own wikibase - I might be super wrong [19:33:11] joal: no you are correct, wdqs does not show results for SDc [19:34:00] joal: but the MTP metrics work for when it does i think if we do "programatic access to structured data" which is inclusive of SDC(when it comes) and wikidata (for now), makes sense? [19:34:57] ehhhh - not really :) [19:35:01] nuria: --^ [19:35:10] joal: bc? [19:35:13] sure [19:52:50] 10Analytics: Ingest wdqs metrics into druid - https://phabricator.wikimedia.org/T240498 (10Nuria) [19:55:13] 10Analytics, 10Analytics-Kanban: Estimate percentage wise the number of requests on mediarequest dataset that are previews - https://phabricator.wikimedia.org/T240362 (10fdans) This jupyter notebook shows that about 0.7% of all mediarequests are prefetches caused by Media Viewer: ` /user/fdans/notebooks/Propo... [20:00:23] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: eventgate-wikimedia should fill in defaults for some important fields - https://phabricator.wikimedia.org/T240477 (10Ottomata) Ok, to do this we need to standardize and bikeshed the client_ip and user_agent fields, since this is not yet in any official... [20:05:38] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: eventgate-wikimedia should fill in defaults for some important fields - https://phabricator.wikimedia.org/T240477 (10Nuria) +1 to including UA in http.request_headers field [20:06:25] milimetric: bc? [20:06:41] omw [20:11:47] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: eventgate-wikimedia should fill in defaults for some important fields - https://phabricator.wikimedia.org/T240477 (10Ottomata) Hm, actually, we get X-Client-IP as a request header from varnish VCL right now. I guess that will continue to be the case i... [20:31:26] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: eventgate-wikimedia should fill in defaults for some important fields - https://phabricator.wikimedia.org/T240477 (10Ottomata) Oh, we already made a an `http.client_ip` field for api/request and cirrusearch/request. I guess that's where it goes! Hm.... [20:35:36] ottomata: About ConfigHelper, I wondered about using it but didn't want the refinery-core import [20:35:43] yeah [20:35:51] ottomata: shall I copy it to hdfstools? [20:35:52] makes sense now that we've talked about making it standalone [20:35:54] naww [20:35:56] let's keep it without [20:36:24] ok - I'd have loved the config through properties for free though :) [20:36:52] Thanks for the first comments by the way ottomata :) [20:38:05] ottomata: another question: is the 4-spaces important for you? If so I'll do it, but really prefer the 2-spaces for scala :) [20:39:04] joal: if it is another project we maybe can get away with it [20:39:04] but [20:39:09] i think we should probably just be consistent everywhere [20:39:15] and do 4 spaces [20:39:24] ok - changing :) [20:39:27] ty [21:03:58] (03PS1) 10Milimetric: Document JDK version requirement [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/556470 [21:04:19] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Document JDK version requirement [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/556470 (owner: 10Milimetric) [21:09:25] (03CR) 10Nuria: "Thanks for doing this." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/556470 (owner: 10Milimetric) [21:10:02] (03CR) 10Nuria: [C: 03+1] "Convinced, +1 to changes" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/556231 (https://phabricator.wikimedia.org/T237124) (owner: 10Mforns) [21:11:02] thx! [21:14:14] (03CR) 10Nuria: "Deffering to @milimetric for naming question, I think data_quality_stats makes clear what this table is about." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) (owner: 10Mforns) [21:14:54] lexnasser: one more thing we need to do is to make sure teh UDFs build with your new code work well [21:15:39] lexnasser: although you did not modify the UDF we need to see it continues to work like before, see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/QueryUsingUDF#Testing_a_UDF_you_just_wrote [21:16:16] lexnasser: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetRefererSearchEngineUDF.java [21:17:32] lexnasser: I think this is teh one: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql#L56 [21:18:03] lexnasser: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/SmartReferrerClassifierUDF.java [21:23:51] (03PS12) 10Nuria: [WIP] Table and workflow for features computations per session [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) [21:24:49] (03CR) 10Nuria: [C: 04-1] "Testing job now that we think we got naming." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) (owner: 10Nuria) [21:26:27] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) [21:26:44] 10Analytics, 10Analytics-Kanban: Estimate percentage wise the number of requests on mediarequest dataset that are previews - https://phabricator.wikimedia.org/T240362 (10Nuria) Nice, let's make sure the main docs for API call out the fact that about 0.7% of the data represents purely just requests that we kno... [21:28:47] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) [21:31:27] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) >>! In T236327#5729983, @Jclark-ctr wrote: > @RobH > Server New Rack Switchport > kafka-jumbo1001 a4 39 > kafka-jumbo1003 b2... [22:38:17] 10Analytics: Ingest wdqs metrics into druid - https://phabricator.wikimedia.org/T240498 (10Nuria)