[00:22:26] (PS2) Madhuvishy: Add oozie job to schedule restbase metrics generation job [analytics/refinery] - https://gerrit.wikimedia.org/r/235519 (https://phabricator.wikimedia.org/T110691) [03:37:32] Analytics-Kanban, Reading-Admin, Patch-For-Review, Wikipedia-Android-App: Update definition of page view and implementation for mobile apps {hawk} [8 pts] - https://phabricator.wikimedia.org/T109383#1605362 (bearND) @dr0ptp4kt @Nuria I'll be ooo tomorrow already. Just reiterating what I said in T110... [06:48:52] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1605475 (Qgil) Thank you to all the parties involved in solving this task (mostly out of personal interest)! I was wonder... [13:31:51] hi a-team! [14:06:21] good morning ottomata [14:17:05] morning! [14:18:02] gwicke: hiya [14:19:01] ottomata: will be in a call the next ~30 minutes, but happy to chat after [14:19:17] ok, let's chat some this afternoon with other interested folks are around too [14:19:35] yup, those sleepy heads ;) [14:23:54] Analytics-Kanban: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1606513 (mforns) a:mforns [14:24:26] Analytics-Kanban, Reading-Admin, Patch-For-Review, Wikipedia-Android-App: Update definition of page view and implementation for mobile apps {hawk} [8 pts] - https://phabricator.wikimedia.org/T109383#1606518 (Nuria) Ok, since @bearND is OOT I will take care of code changes cc @dr0ptp4kt [14:24:41] ottomata: holaaa [14:24:51] hello [14:25:25] ottomata: i tried yesterday to curl confluent-event-bus.services.eqiad.wmflabs:8081 [14:25:50] ottomata: but it seems that was not reachable, can i reach it from anywhere? [14:27:28] from services project [14:29:04] ottomata: whatayamean? [14:30:17] nuria, that instance is in services labs project [14:30:23] so you ahve to be logged into some host there [14:31:28] ottomata: can you give me permits to services labs? cause i cannot ssh to confluent-event-bus.services.eqiad.wmflabs [14:31:56] gwicke: can I add nuria to services? i'm going to assume yes :) [14:32:21] k did nuria :) [14:32:54] ottomata: then ssh-ing to the instance I should be able to access that endpoint via http [14:33:02] ottomata: rightttt/ [14:34:15] yes [15:04:41] (CR) Milimetric: "let me know if you want to chat about it, Baha, I want to make sure this code is easy to work with." [analytics/dashiki] - https://gerrit.wikimedia.org/r/231424 (https://phabricator.wikimedia.org/T104261) (owner: Milimetric) [15:10:07] (PS1) Joal: Correct bug in refine oozie job definition [analytics/refinery] - https://gerrit.wikimedia.org/r/236027 [15:20:29] ottomata: and .. where are logs of registry instance on labs? cannnot find them under /var [15:25:24] (CR) Joal: [C: 2] "LGTM !" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/234453 (https://phabricator.wikimedia.org/T109547) (owner: Madhuvishy) [15:26:13] joal|night, yt? [15:26:19] I am ! [15:26:27] * joal|night should change nick [15:26:42] can you batcave for 3 mins? [15:26:45] sure [15:26:48] :] [15:34:26] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Fix auto_offset_reset bug in pykafka and build new .deb package {stag} [8 pts] - https://phabricator.wikimedia.org/T111182#1606640 (Ottomata) [15:34:41] nuria: standup! [15:37:01] i recently got access to hive and was trying to query out some information about how often the opensearch api action is used, but i'm getting access control errors: Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException(Permission denied: user=ebernhardson, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x [15:37:06] anyone know which other rights i should be requesting from ops? [15:38:01] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: eventlogging tests in mainline error - https://phabricator.wikimedia.org/T111438#1606670 (ggellerman) a:Nuria [15:38:02] ebernhardson, did you ask Ops for Hive access or for stat1002 access? [15:38:59] Ironholds: hive [15:39:31] looks like i was added to statistics-private-users, is there another group i should ask for? [15:40:29] ebernhardson: can you ssh to 1002? [15:41:06] ebernhardson: like > ssh your-user@stat1002.eqiad.wmnet [15:42:49] nuria: yea, and thats where i've run hive from [15:43:35] ebernhardson, there is. Who added you? We try to get ottomata to handle these requests for precisely this reason [15:44:17] and unrelated to this I'm being told the HTTP/HTTPS web proxy isn't resolving isself on 1002 - I've tried adding them with export HTTP_PROXY=http://webproxy.eqiad.wmnet:8080 but bupkis. Any ideas? [15:46:13] ebernhardson: you ned to be in analytics-privatedata-users to use hive and webrequest logs [15:46:14] are you? [15:46:38] i would have been added by whoever was on RT duty for ops. looks like akosiaris [15:46:56] ottomata: yes, and i can start up hive and issue a query with no fancy conditions (just partition equality) [15:47:23] but once i add anything that looks to need intermediate state, i get the above access control exception [15:47:26] s/add/query/ [15:47:38] ebernhardson: [15:47:39] on stat1002 [15:47:41] yes [15:47:43] groups ebernhardson [15:47:48] or your username [15:47:49] whatever it is [15:47:57] ebernhardson : wikidev statistics-privatedata-users [15:48:02] yeah, you are not in the right grou [15:48:09] haha, i think i even talked with alex about this! [15:48:10] sigh [15:48:16] doh. There is also smalyshev [15:48:25] we both asked/got access same time [15:48:30] ticket? [15:48:35] (s)? [15:48:48] https://phabricator.wikimedia.org/T109356 and https://phabricator.wikimedia.org/T109357 [15:51:54] (CR) Joal: "One small comment, but looks good otherwise !" (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/235519 (https://phabricator.wikimedia.org/T110691) (owner: Madhuvishy) [16:03:03] Hi a-team, my phone died in the night for no reason and I overslept :/ I feel silly [16:03:15] it's ok madhuvishy [16:03:23] we're still hanging out but winding down [16:03:26] :] np [16:05:45] Analytics-Backlog, Analytics-EventLogging, Privacy: Opt-out from logging some of the default EventLogging fields - https://phabricator.wikimedia.org/T108757#1606740 (csteipp) > > For schemas where tracking a user across log entries isn't needed, I think this is a very good idea. The more schemas we hav... [16:35:45] !log deployed Event Logging [16:39:12] nuria: joal http://spark.apache.org/docs/1.3.0/running-on-yarn.html Under debugging your application [16:39:56] Thx madhuvishy [16:46:24] milimetric: I double checked parquet with columns vs map (we have the is_zero field extracted from the map already) [16:46:53] field only is 30% faster than map --> faster, but not that much [17:20:59] (CR) GWicke: Add oozie job to schedule restbase metrics generation job (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/235519 (https://phabricator.wikimedia.org/T110691) (owner: Madhuvishy) [17:26:35] (PS3) Madhuvishy: Add oozie job to schedule restbase metrics generation job [analytics/refinery] - https://gerrit.wikimedia.org/r/235519 (https://phabricator.wikimedia.org/T110691) [17:26:58] 30% for a query over a month might be a lot though, joal. I guess we'll wait and see if the map causes their specific jobs to be slow [17:28:59] ebernhardson: let us know if you need anything once you gain access [17:29:27] nuria: hopefully be alright. thanks for looking into it [17:32:05] milimetric: good for me :) [17:44:14] madhuvishy or mforns , wnat to CR my eventlogging patch? [17:44:17] *want [17:44:23] nuria, sure [17:44:29] is it urgent? [17:44:39] it doesn't even deserve the name patch... [17:44:43] mforns: nah [17:44:47] ok [17:44:47] not at all actually [17:44:54] let me see [17:45:07] next week will work too, no rush [17:45:49] nuria, I thought it was longer :] [17:46:08] question: how do you normally run eventlogging server testS? [17:47:11] mforns: tox -e py27 [17:47:16] ok [17:48:06] mforns: i foerget so i wrote it here: https://www.mediawiki.org/wiki/Extension:EventLogging#How_to_run_tests [17:48:55] *i forget [17:49:00] nuria, ok, I asked because I always used nosetests and flake8 [17:49:32] and that way never had the problem that you fixed, but I also tried tox and yes, got errors [17:54:24] nuria: have a minute for me ? [17:54:32] joal of course! [17:54:36] batcave? [17:54:43] omw [17:57:30] nuria, merged it looks good! [17:57:50] mforns: thank you! [17:57:53] :] [17:58:55] joal: I fixed the trailing white space [17:59:56] ok madhuvishy, give me a minute [18:11:47] madhuvishy: do I merge, or will you do it later ? [18:12:35] joal: I don't want to self merge, can you :) [18:12:43] I will :) [18:13:24] But, thinking of that, I will merge the source repo, then wait for a deploy to merge the refinery repo [18:13:44] (CR) Joal: [C: 2 V: 1] "Good for me, deploy needed before merge." [analytics/refinery] - https://gerrit.wikimedia.org/r/235519 (https://phabricator.wikimedia.org/T110691) (owner: Madhuvishy) [18:14:19] (CR) Joal: [V: 2] "Good for me :)" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/234453 (https://phabricator.wikimedia.org/T109547) (owner: Madhuvishy) [18:14:27] joal: alright, do we deploy now? [18:14:38] or friday night so we do monday? [18:14:42] Nope, not on friday :) [18:14:53] joal: cool :) monday after tasking then [18:15:00] Sounds good ! [18:15:14] Ooooh, monday --> holiday for you, no ? [18:15:19] madhuvishy: --^ [18:15:29] joal: oh yeah [18:15:39] i will probably work that day though [18:15:45] are you planning to work? [18:15:55] I plan to, yes [18:16:04] joal: ah cool i'll ping you then [18:16:09] ok :) [18:16:10] i'm saving up my holidays [18:16:26] because I'm going to India in October [18:16:30] longer bundled time to get home I guess [18:16:34] right :) [18:16:42] :) [18:16:49] Let me recall, when do you get mqrried ? [18:16:59] is it that time ? [18:20:12] joal: ha ha [18:21:01] Announcement time I guess, I was gonna say in next staff meeting, I'm going to India this october to get engaged. Wedding will be a year later :) [18:21:15] !!!! [18:21:19] Nov 5th is the engagement :) [18:21:22] WooooOOOOOOOTTTTTTT [18:21:28] :D [18:21:33] I am sorry to have pushed, I didn't mean :) [18:21:59] ha ha no I was gonna tell myself :) [18:22:42] milimetric: nuria mforns ^ [18:22:54] yesss [18:23:03] owowowowow [18:23:23] super congrats [18:23:30] thank youuu :D [18:23:49] So we just have to convince Kevin to organize an off-site in India next year :D [18:23:49] Analytics, Engineering-Community, MediaWiki-API, Research consulting, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1608576 (SVentura) Ditto! Thank you to everyone working on this. +1 @Qgil, this is an important project and thank you Qg... [18:25:59] joal: yes! :D I'll have dates in a couple months :D [18:26:09] That sounds awesome ! [18:26:36] congratulations again :) [18:26:44] That's amazing news : [18:27:16] thanks joal :D [18:44:54] (PS1) Ottomata: Release 0.1.4-2 with upstart script [analytics/kafkatee] (debian) - https://gerrit.wikimedia.org/r/236066 [18:59:56] (CR) Ottomata: [C: 2 V: 2] Release 0.1.4-2 with upstart script [analytics/kafkatee] (debian) - https://gerrit.wikimedia.org/r/236066 (owner: Ottomata) [19:00:37] ottomata: yt? have a sec? [19:03:11] nuria: hiya [19:03:12] sure [19:03:26] ottomata: want to make sure i did not discover wonderbread here [19:03:51] ottomata: so through the rest proxy you can consume in text form [19:04:00] regardless of setup of topic [19:04:04] ottomata: did we knew that [19:04:06] ? [19:04:17] ottomata: so i can setup a plaintext json consumer [19:04:36] ottomata: from something that holds events in any from [19:04:50] yes [19:04:52] ottomata: you probably knew this [19:04:53] nuria, yes [19:05:03] but, using rest for consumption isn't actually very useful [19:05:22] its nice [19:05:38] but for building apps not that useful, youd' have to poll periodically in a loop or something for new messages [19:05:42] * madhuvishy wants wonderbread [19:05:45] its ok [19:06:08] ottomata: yess got it but for debug purposes and humans seeing plain text on binary streams it sure makes it easy [19:06:31] *easy* :) [19:06:31] but yeah [19:06:48] also, confluent ships with a kafka-avro-console-{produucer,consumer} [19:06:55] ottomata: haha yes, they do not give you logs just in case you get too confident [19:08:06] k, then, with the schema registry that you have on labs how can i access kafka directly to test consumption from a regular consumer, is that what we are interested on? [19:08:44] ottomata: we would like to consume in plain text from that consumer regardless that your Editevent topic is actually stored on binary [19:10:18] nuria, easiest to log into kafka-event-bus instance there [19:10:21] you will have access to kafka cli [19:10:25] just enter [19:10:27] kafka [19:10:28] for usage [19:10:35] but, you could also install kafkacat [19:10:36] and use that [19:10:40] that is a nice tool too [19:10:48] you should have sudo right [19:10:49] s [19:10:52] kafka addy is [19:10:58] kafka-event-bus.services.eqiad.wmflabs:9092 [19:11:34] i have sudoyes [19:12:19] ottomata: ok, will try some things [19:23:41] nuria: if you have some time today, we can chat about last access uniques [19:25:43] (CR) Nuria: [C: 2 V: 2] Correct bug in refine oozie job definition [analytics/refinery] - https://gerrit.wikimedia.org/r/236027 (owner: Joal) [19:26:01] nuria: got a moment? [19:26:15] gwicke: yessir [19:26:44] re avro: do you mean using avro schema with a JSON data option? [19:27:32] gwicke: no, i was thinking (cc ottomata) of doing away with json and just using avro, as it being a more descriptive set [19:27:48] gwicke: seems a better choice for starters. [19:28:19] gwicke: ideally you should be able to consume avro in text form [19:28:40] for many clients, json will be easier to work with [19:28:46] gwicke: so usage of avro should not imply binary consumption, that would be ideal [19:29:05] gwicke: mmm.. do you have a use case in mind? [19:30:02] sure.. in JS and PHP, json is built in, while avro requires a separate library & dealing with (hopefully up to date) schemas [19:30:18] python too, of course [19:30:44] for many consumers, json is really the only option [19:31:07] none of this means that we have to use json internally [19:31:19] but I think it makes the case for supporting JSON well [19:32:04] gwicke: mmm..not sure about the library argument, it will be less out-of-the-box (y) but clients do not gain/loose any ability from using one or other [19:32:13] I'm pretty agnostic about the storage format; we should perhaps benchmark compressed json vs. compressed avro [19:33:14] https://en.wikipedia.org/wiki/Turing_tarpit [19:33:15] gwicke: rest proxy gives json back to consumers [19:33:24] as it consumes binary from kafka [19:33:38] but, i think we both think that rest is not ideal for many consumers [19:33:45] great for producers, but maybe not consumers [19:33:46] ease of use matters in practice [19:34:36] gwicke: but we live in microlibrary world now, any usage of js will require you to install additional deps/libs [19:34:55] gwicke: php the way we use it is also quite customized [19:34:55] not really [19:35:16] any browser supports fetching resources & decoding JSON [19:35:16] gwicke: i think we should be able to build something that using Avro schemas, and then allows for producer-clients to choose whether or not their json encoded data will be produced to kafka as binary or as json, after validation [19:35:48] sure, I'm open to offering avro as another option [19:36:07] gwicke: do you see your consumer uses cases consuming via http? [19:36:08] my point is mainly that JSON should be well supported at the API [19:36:43] gwicke: I am not so sure that easy of use is that different here [19:36:55] ottomata: definitely some [19:37:17] gwicke: to validate json against an schema (from client side) you will also need additional libraries, and data is not going to be freeform in any case right? [19:37:21] most public API users will be using JSON, and that would be where the volume is going to be [19:37:45] nuria: you don't have to mess with schemas, though [19:37:54] it's going to be validated on the server anyway [19:38:06] a producer will have to know the schema in order to write data [19:38:22] right [19:38:26] know in the 'looked at the spec', yes [19:38:26] otherwise how will they know what to write? [19:38:29] otherwise data will go to /dev/null [19:38:37] *as in [19:38:48] right [19:39:02] there is no need to download specs & validate, though [19:39:13] yeah i think that's true [19:39:19] validation is up to this system [19:39:32] being schema-aware adds a layer that is not satisfied by any native php or js json capabilities [19:39:34] if the spec changes in incompatible ways, then the client code needs updating anyway [19:39:45] I don't see anything in avro that will help with that [19:40:01] we generally try to avoid breaking schema changes for that reason [19:40:10] gwicke: you might as well use an avro-aware client to write your unit tests right? [19:40:27] aye, but avro has the built in ability to validate compatibility for you [19:40:30] of different schemas [19:40:40] same with json schema [19:40:44] oh? [19:40:53] how so? [19:41:01] our swagger api specs all integrate with json schema, too [19:41:32] ottomata: you pick $json_schema_validator_lib, and call validate [19:41:47] how does it deal with field rename? [19:42:13] you mean a breaking rename, where the old field is no longer valid? [19:42:38] or aliased [19:43:33] well, normally you look for the new one & fall back to the old one [19:43:52] json schema does that? [19:43:54] and keep the old one working in your schema for a while, and possibly return some warnings [19:44:39] it's not a very common thing, though [19:44:46] in my experience, at least [19:45:38] basically, with avro you update the schema in your code, while with json you update the code that emits the data [19:47:11] hm, no, in this case, with avro, you update the schema in a registry [19:47:34] not in each client? [19:47:41] for cases where there are many consumers and producers, it is inconvenient or not possible to update them all [19:47:42] no [19:47:52] how does that work for consumers? [19:48:26] will they magically guess the new name & map it to the old one? [19:49:15] if you renamed a field via an alias, consumers with old versions of the schema would be able to read the new data via the old name alias, and consumers with the newer schema would be able to read the data with either name [19:49:31] I see where you are coming from & sympathize with the desire to leverage schemas, but I don't think it's realistic that everything is going to switch to Avro [19:49:33] in either case, the consumers could have a local copy of whichever version of the schema they are using, OR they could look it up by Id in the schema registry [19:50:19] in the case of streams, the schemas will not be associated with each record, only the schema _id will. confluent uses this to look up the schema for a given record [19:50:42] gwicke: i agree with that, i also think producers and consumers shoudl be able to work with json [19:51:05] if we can do both of those with Avro, i think there is a lot of value on standardizing on avro schemas [19:51:36] I agree that if we can consume json from an avro stream it would be the best of all worlds [19:51:53] there are two ways that could be done: [19:52:07] we can convert between avro & json schemas [19:52:09] everything avro binary in kafka, but that would mean there'd ahve to be some proxy in between kafka and the consumer in order to translate [19:52:14] gwicke: i think that doesn't work very well [19:52:33] or the other way, is to produce avro json text directly to kafka for those consumers that want to consume straight json from kafka [19:53:05] JSON Schema to Avro schemas [19:53:06] This processor is not complete yet. It is _much_ more difficult to write than the reverse for two reasons: [19:53:06] • JSON Schema can describe a much broader set of data than Avro (Avro can only have strings in enums, for instance, while enums in JSON Schema can have any JSON value); [19:53:06] • but Avro has notions which are not available in JSON (property order in records, binary types). [19:54:03] that sounds bad [19:54:05] i also don't see a need for JSON schema if Avro schema does JSON. it will make going back and forth between binary and jsontext much easier, which will be extra handy for analytics purposes [19:55:29] if the avro json serialization would be suitable for directly emitting as a JSON blob, without parsing, then that could work for me [19:55:41] aye, i think that would be cool [19:55:52] it also sounds that we should probably stick to the subset that's supported in both JSON schema & avro [19:56:23] agreed, regardless of the decision, staying compatible makes sense [19:56:40] how would you hook up avro schemas with eventlogging schema storage? [19:57:01] should work very similarly [19:58:18] gwicke: or why not just hook it up to confluent schema registry? i had talked with ori, and we had thought about making meta just a frontend editor for schema registry [19:58:47] also a possibility: make confluent rest proxy able to produce avro-json instead of just avro-binary [19:58:53] the schema registry has issues with the id stuff [19:58:56] ? [19:59:07] clients need to know opaque schema ids [19:59:59] not totally true, you can get the latest schema by name [20:00:03] where would you maintain the mapping from topic to schema id? [20:00:34] but, even so, for producing you have said yourself you don't need the schema, and for consuming, you'd only need it if you were consuming binary [20:00:54] the producer proxy needs the schema for a given topic [20:01:07] and others should ideally have an easy way to retrieve it, as well [20:01:22] we also need documentation for this api [20:01:38] # Fetch the most recently registered schema under subject "Kafka-value" [20:01:38] $ curl -X GET -i -H "Content-Type: application/vnd.schemaregistry.v1+json" \ [20:01:38] http://localhost:8081/subjects/Kafka-value/versions/latest [20:02:24] oh, okay [20:02:29] so the service could wrap that [20:03:55] yeah think so, this service still could be just the kafka rest proxy though. but that's only if we can get it to produce avro-json directly, [20:03:56] and [20:04:16] you say, only if we can map a topic to a schema 'subject' [20:04:30] and make sure that no one produces to a topic schemas from a different subject, right? [20:05:13] we probably want to provide 'production' schemas right in the config to keep things simple & reliable [20:05:25] ? [20:05:34] but eventlogging needs to support users editing schemas [20:06:13] hm, maybe. i'm for that, but i'm not sure it needs to be part of MVP. ori seems to think we can make meta work with confluent schema registry [20:06:17] schemas for events emitted by mediawiki or some other service should go through code review etc, so belong in the config [20:06:29] ah yes [20:06:46] i've thought a little about that, and am not really clear how that would work. i've seen some discussions about that on mailing lists [20:06:56] best practices on managing schemas that are in git as well as in schema registry [20:07:13] likely, they will be managed in git, but then posted to schema registry as part of deploy [20:07:14] maybe. [20:07:15] not sure. [20:07:39] hmm, at that point you might lose your compatibility guarantees [20:07:58] yeah they'd need to verify the schema changes as part of review [20:08:11] sounds like JSON schema ;) [20:08:12] this gets weird, because now there are 2 sources of truth for schemas [20:08:46] honestly, I'm not sure that the renaming thing is worth all this hassle [20:08:47] yeah, this would be an issue for either JSON schema or avro [20:09:35] we wouldn't even need a schema registry if we put all the schemas in the config [20:09:42] its not just renaming, its standardizing on something the whole org can use. evolution is more than naming, and there are many advantages of typed binary too [20:09:46] what config? [20:09:57] the service config [20:10:16] how would people get new schemas in? [20:10:19] we are already using json schema widely [20:10:27] to us, avro is the new kid on the block [20:10:32] gwicke: ... you would if you wanted to have more than 1 schema alive at one point [20:10:47] gwicke: "you would need a registry" [20:11:11] gwicke: as you only have 1 instance of config [20:11:38] gwicke: but potentially several versions of same schema can be alive [20:11:43] gwicke: at any one time [20:11:47] sure, but what do we gain by having multiple schemas? [20:12:02] compared to a backwards-compatible schema [20:12:32] if it's only about the renaming, then I think that's a bit weak [20:13:11] gwicke: you do not control at which rate clients update or at which rate events are published to an schema [20:13:28] so you could have several versions alive (of one schema) at any one time [20:13:53] we should control the rate of schema changes with code review [20:14:10] as well as making sure that we are deliberate about how we change schemas [20:14:20] gwicke: its not just renaming [20:14:35] what if you want to add a new field? [20:14:52] what value should that field have if a client with an old version of the schema is getting the new data? [20:14:53] not sure I follow.. just add it? [20:15:25] or, sorry, the other way around. [20:15:28] clients will normally be behind, not ahead in terms of schema [20:15:38] no [20:15:41] data sits around [20:15:41] :) [20:16:02] for a limited time, yes [20:16:40] with JSON, you either get the new field or you don't [20:16:41] limited == temporary == forever :) [20:16:52] if you have an old client, you just ignore fields you don't know about [20:16:53] right, but then each client needs to special case that [20:17:08] if you have a client with new schema, reading old data [20:17:35] that client then needs to special case the absense of that field [20:17:38] you still need to support the old data if you care [20:17:40] as do every consumer of that data [20:17:45] does* [20:18:01] I get that avro can handle renames, but there are many other schema changes that will require code changes anyway [20:18:12] if evolutionis built into your schema system, then you don't need to special case it. the new schema specifies what shoudl happen [20:18:26] semantic code changes yes [20:18:40] but, you shouldn't have to special case your code in order to read old data [20:18:46] lets say you replace a string with an object and some properties [20:18:59] gwicke: evolution on schema system makes your life easy in many realms, backfilling, migrations [20:19:03] naw, you can't change types :) [20:19:06] (a pretty common scenario in my experience) [20:19:13] normally, you'd add a new property for the object [20:19:26] so that you are both forward & backward compatible [20:19:32] and keep support for processing both [20:19:41] how would avro handle this? [20:19:44] but you have to keep that support for processing in every client [20:19:53] it doesn't, you can't change field types [20:20:07] if you need a new type, you need a new field [20:20:20] it sounds like it would only handle trivial renames [20:21:37] and field additions and deletions [20:22:37] but yes, that is a good thing. you should not be able to change the types of your fields. otherwise all your downstream consumers will go crazy! [20:22:44] both are covered by JSON schema too [20:23:21] you can specify a default value, for example [20:23:31] gwicke: in the case of field addition. reader has newer schema, data was written with older schema [20:23:37] there is also fairly rich string field validation [20:23:38] json schema will fill in the field with a default value? [20:23:48] yes, the validator can [20:24:33] example: https://github.com/epoberezkin/ajv#formats [20:27:18] gwicke: from what I can tell, that is not in the jsonschema spec, just that guy happened to decided to support it [20:29:04] no, there is patternProperty (to define property names by regexp), and http://json-schema.org/latest/json-schema-validation.html#anchor107 [20:29:25] they are part of the spec [20:30:23] Some keywords, if absent, MAY be regarded by implementations as having a default value. In this case, the default value will be mentioned. [20:31:20] http://json-schema.org/latest/json-schema-validation.html#anchor104 is the more general section [20:31:48] this is actually used in the swagger specs to document validation constraints [20:32:15] gwicke: i'm not seeing what is relevant in those links, maybe they aren't linking right? [20:32:39] that most recent one is scrolling me to 7.  Semantic validation with "format" [20:32:45] yup [20:33:09] it lets you assert the format of strings [20:33:38] using pre-defined patterns, but also by regexp [20:33:42] oh oh [20:33:50] sorry, thought we were still taking about defaults [20:34:33] also: http://json-schema.org/latest/json-schema-validation.html#anchor25 [20:35:13] yyesssss? [20:35:26] minLength / maxLength and regexp [20:35:54] these are not defaults though, no? [20:36:04] no, they are validators [20:36:24] http://json-schema.org/latest/json-schema-validation.html#anchor101 [20:36:30] these don't help with schema evolution though [20:36:30] that's 'default' ^^ [20:37:03] This keyword can be used to supply a default JSON value associated with a particular schema. It is RECOMMENDED that a default value be valid against the associated schema. [20:37:10] haha [20:37:22] what happens if the default value is not valid! [20:37:22] hah [20:37:36] i guess there is no way for json schema to evaluate itself :p [20:37:39] probably a bug report [20:37:40] validate* [20:37:42] hah, yeah [20:38:08] so, the json schema validator you are using will return data filled with with the default valud if the field is not present? [20:38:26] many validators support that, yes [20:39:09] the one I picked doesn't yet: https://github.com/epoberezkin/ajv/issues/42 [20:40:14] i just tried python jsonschema [20:40:21] there are many things that validate that shouldn't! :) [20:40:38] and also default doesn't do antyhing [20:41:00] does it pass the spec tests? [20:41:31] oh, i'm sure it does, i mean that shouldn't if one were to do schema validation properly [20:41:40] sory [20:41:44] schema evolution* [20:42:18] as I said, default is in the spec but not supported by all libraries [20:42:21] i have to change a type of a field to get an error [20:42:54] you were trying unspecified properties? [20:43:04] yes [20:43:15] so, I'm having some issues setting the HTTP/HTTPS proxy on stat1002. Anyone feel like helping me out? [20:43:17] normally those can be allowed or not in the validator [20:43:31] gotta do setdefault('additionalProperties', False) [20:43:31] aye [20:43:58] Ironholds: holaaa.. proxy for what thing? [20:44:06] ottomata: I'm going to grab lunch, bbiab [20:44:12] oh that is eventloggin [20:44:12] ok laters [20:46:16] ottomata: the kafka broker was localhost:9092 [20:46:23] in confluent-event-bus? [20:47:54] nuria: yeah, usually kafka is 9092 and zookeeper is 2181 [20:48:41] nuria, using a git repo for some of our standardised retrieval scripts [20:48:51] https://www.irccloud.com/pastebin/WW56rQYR/ [20:48:53] I mean, we are using that git repo, but being able to git clone/pull [20:49:58] Ironholds: is cloning what is not working? [20:55:03] nuria, indeed. I mean theoretically both should fail [20:55:13] specifically, it can't resolve the proxy [20:55:22] I've been trying export HTTPS_PROXY=blah [20:56:48] Ironholds: for cloning i normally do "git clone https://some-depot" (cloning anonymously) [20:57:05] Ironholds: but that doesn't work for ya? [20:57:41] nuria, on stat1002? I thought it didn't have an internet connection? [20:57:59] holy crap it works [20:58:01] ...what?! ;p [20:58:11] thanks nuria! [20:58:21] ottomata: is this the right connection string? "./kafkacat -b localhost:9092 -t EditEvent" on service instance on labs? [20:58:34] Ironholds: ya cause i did A LOT! [20:59:27] nuria: which project on labs is this in [20:59:42] services [20:59:48] madhu ^ [20:59:56] hmmm i'm not it it [21:02:24] madhuvishy: do not worry, it can wait for andrew to come back. I basically learned a abunch of things that andrew already knew so nothing new under the sun. we can talk about uniques if you want. [21:02:54] nuria: sure, batcave? [21:02:59] sure [21:14:09] nuria: kafka-event-bus [21:14:14] is the instance [21:14:31] Ironholds: hi sorry! [21:14:47] Ironholds: not working now? [21:15:18] Ironholds: you did [21:15:19] export HTTPS_PROXY=http://webproxy.eqiad.wmnet:8080/ [21:15:19] ? [21:16:47] ottomata, yeah, it works now [21:16:49] and that doesn't ;p [21:17:19] eh? [21:27:29] ottomata, nuria: https://phabricator.wikimedia.org/T88459#1610654 [21:32:27] ottomata, proxy setting, oddly, makes git cloning /not/ work [21:32:30] I don't know why [21:32:39] but NOT setting a proxy, on a machine that has no internet connection, works [21:48:32] weird ¯\_(ツ)_/¯ [22:23:07] latesr all! [23:15:38] (PS1) Joal: [WIP] Add cassandra load job for pageview API [analytics/refinery] - https://gerrit.wikimedia.org/r/236224 (https://phabricator.wikimedia.org/T108174)