[00:36:26] 10Analytics, 06Services (watching): Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2866893 (10Pchelolo) [02:32:48] 10Analytics, 10ChangeProp, 10Citoid, 10ContentTranslation-CXserver, and 10 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2867141 (10GWicke) Considering the upcoming deployment freezes and relatively low priority, the start of the roll-out is looking like January at this point. [05:04:36] 10Analytics, 06Services (watching): Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2867204 (10Nuria) @Pchelolo : feel free to commit a puppet change and we can review (no worries if you prefer not to do that). Probably you just need sudo on those boxes. [07:52:09] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2867323 (10Physikerwelt) @mschwarzer did that resolve the problem? [09:31:29] joal: o/ [09:31:34] hi elukey [09:31:54] random question, answer whenever you have time.. Do we check sequence holes for eventlogging data? [09:32:05] and if not, would it be wise to do it or maybe not? [09:37:44] hm elukey, I think we don't check regular holes in eventlogging [09:37:58] elukey: IIRC we check for holes only when errors occur [09:40:17] joal: I am reviewing all the configs for the varnishkafka instances that we were not monitoring super well (statsv and eventlogging), sometimes I find weird things like https://gerrit.wikimedia.org/r/#/c/326904 [09:40:34] not heavy but I was wondering what level of "trust" we put in the eventlogging data source [10:10:13] elukey: I think eventlogging is supposedly trustable, so it might be good to implement better data quality checks ! [10:10:54] super, I was thinking the same, I'll propose this to the team during standup :) [10:11:33] joal: unrelated news https://eng.uber.com/chaperone/ [10:11:36] \o/ [10:14:03] 10Analytics-Wikistats, 07Bengali-Sites: Ireland in Tagalog, Bengali and Urdu Wikipedia traffic breakdown - https://phabricator.wikimedia.org/T143254#2867642 (10MarcoAurelio) [10:24:29] elukey: Indeed ! chaperone website looks really good ! Hope the thing is same ;) [10:31:54] joal: https://phabricator.wikimedia.org/T149451 is very interesting [10:32:18] spark streaming or flink? :P [11:12:44] !log deleted /srv/stat1001 on stat1004 [11:12:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:35:13] 10Analytics, 14Trash: --- Items above are triaged ----------------------- - https://phabricator.wikimedia.org/T115634#2868309 (10Aklapper) 05Invalid>03Open @Pokefan95: Please check first whether task authors are established community members before closing a task as invalid. [11:41:48] * elukey lunch! [12:42:50] 10Analytics, 06Operations, 10Ops-Access-Requests, 06Services (watching): Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2868470 (10mobrovac) [13:02:30] 10Analytics, 06Operations, 10Ops-Access-Requests, 06Services (watching): Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2868534 (10mobrovac) a:05Nuria>03None @Pchelolo reading the logs works just fine for me: ``` mobrovac@kafka1001:~$ tail /srv/log/eventl... [13:10:33] 10Analytics, 10EventBus, 06Operations, 06Services (watching), 15User-mobrovac: Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2868690 (10mobrovac) a:03mobrovac Oh, you may mean syslog logs. We need to output them just as we do for SCB services. [13:27:56] hey team :] [14:06:45] 10Analytics, 10EventBus, 06Operations, 06Services (watching), 15User-mobrovac: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2868847 (10mobrovac) p:05Triage>03Normal [14:07:19] 10Analytics, 10EventBus, 06Operations, 06Services (doing), 15User-mobrovac: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2866893 (10mobrovac) [14:16:07] o/ (sorry I was in a meeting) [14:26:10] (03CR) 10Milimetric: [C: 04-1] Monthly request stats per article title (034 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) (owner: 10Phantom42) [14:30:13] (03CR) 10Phantom42: Monthly request stats per article title (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) (owner: 10Phantom42) [14:33:57] 10Analytics, 10EventBus, 06Operations, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2868944 (10mobrovac) The above patch fixes the permissions, but further work should be conducted here to bring the logs in a s... [14:58:33] 10Analytics, 10EventBus, 06Operations, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2868979 (10Ottomata) Hmmm, is `/srv/log` is the standard location for process logging? I put service output event content (er... [15:09:04] 10Analytics, 10EventBus, 06Services (next): Create alerts on EventBus error rate - https://phabricator.wikimedia.org/T153034#2868993 (10Ottomata) p:05Triage>03Normal [15:24:58] 10Analytics, 10EventBus, 06Operations, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2869040 (10mobrovac) >>! In T153028#2868979, @Ottomata wrote: > Hmmm, is `/srv/log` is the standard location for process loggi... [15:26:21] 10Analytics, 10EventBus, 06Operations, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2869044 (10mobrovac) Permissions are now ok: ``` mobrovac@kafka1001:~$ ls -alhF /var/log/eventlogging/ total 2.4G drwxr-xr-x... [15:26:36] 10Analytics, 10EventBus, 06Operations, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2869045 (10Ottomata) Lower chance of filling up `/`, but higher chance of filling up the same partition on which Kafka stores... [15:26:57] hey everyone - sorry been busy this morning [15:27:35] hiiii [15:27:37] fdans: hi! How's it going, have you learned *everything* yet? [15:27:38] :) [15:27:47] milimetric: is dashiki-01 an analytics labs project instance? [15:27:52] yes ottomata [15:28:03] i don't see it... [15:28:11] I think it's in a dashiki project [15:28:13] (not analytics) [15:28:22] og [15:28:23] oh [15:28:27] is fdans in that project? [15:28:46] I don't know, he should be in everything eventually, don't know who added him where [15:29:11] fdans: I can walk you through how labs works, that's what the above is about [15:29:58] well, he's trying to log into somethign in labs, and having troulbe [15:30:06] fdans, let's try an analytics instance, cause I know you are in that [15:30:08] try [15:30:09] yeah I was trying to access an instance, using dashiki as test [15:30:53] ssh kafka701.analytics.eqiad.wmflabs [15:31:18] (I'm looking to add him to dashiki) [15:31:29] yeah that did it :] [15:31:46] fdans: cool, ok, so labs ssh stuff is working for ya [15:31:50] milimetric: , it should be here: https://wikitech.wikimedia.org/wiki/Special:NovaProject [15:31:51] brb destroying everything [15:31:53] fdans: you're not in the dashiki project, yep [15:32:00] your username is fdans? [15:32:06] that's right [15:32:15] k, I'll give you access to destroy even more [15:32:30] excellent [15:32:41] 10Analytics, 10Pageviews-API, 10RESTBase-API, 06Services (watching): Pageviews Data : removes 1000 limit in the most viewed articles for a given project and timespan API - https://phabricator.wikimedia.org/T153081#2869070 (10mobrovac) [15:32:43] k, you should be able to get into the dashiki instances too [15:32:57] just fyi there's a dashiki-staging-01 that you can really destroy as much as you like [15:33:11] but let's hang out and I can walk you through how all that works [15:33:38] sure [15:34:41] I'm batcavin [15:38:53] 06Analytics-Kanban: Stop and remove legacy TSV generation jobs - https://phabricator.wikimedia.org/T153082#2869089 (10Milimetric) [15:53:31] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2869165 (10mschwarzer) Without having HDFS mounted Oozie fails, because it cannot access HDFS: ``` DEBUG ssh: stdout: Notice: /Stage[main]/Cdh::Oozie::Server/Exec[oozie_sharelib... [15:57:34] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2869166 (10Ottomata) Naw, that can't be right. Nothing accesses the `/mnt/hdfs` fuse mount in Puppet. It is only created for user convenience, so you can `cd` and `ls` around l... [16:01:24] ottomata, standupppp :] [16:01:54] AH [16:08:03] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2869187 (10Physikerwelt) @ottomata that would be excellent. We are working with the math cluster. [16:11:25] ottomata: ping [16:12:12] physikerwelt: HI in meetings, but hi! [16:16:17] ottomata: thank you for your offer to setup a hadoop cluster, I think it would be great if malte or me could try to follow the instructions first. Thus we are sure to understand what's going on;-) [16:17:26] aye, ok, those instruction are really old, and it does kinda change from time to time, so its hard to right explicit instructions on it [16:17:47] but, its mostly just setting up a few nodes (at least 3 i think) and including some ops/puppet role classes and setting some hiera vars [16:18:37] ottomata: ok [16:19:02] but, if yall want you can follow along [16:19:08] mabye we can do hangout share screen or something? [16:19:18] yes that would be great [16:19:19] s/right/write * ;p [16:19:33] ottomata: so grepping in puppet I can see references of stat1001 in hieradata/common.yaml, role/manifests/analytics_cluster/rsyncd.pp, role/manifests/statistics/base.pp and role/manifests/statistics/cruncher.pp [16:20:11] just reporting but they might not be needed (if so, let's BURN THEM!) [16:20:19] ottomata: what would be a good time/data for you [16:20:32] s/data/date [16:21:13] physikerwelt: what time zone are you and mschwarzer in? [16:21:27] europe/berlin [16:21:30] ah, elukey, yeah, we can probably remove those, but let's leave those for now [16:21:36] they just specify rsync allow hosts i think [16:21:56] ok, so today would be good, but probably can't get to it for an hour or two at least. would that be too late for you? [16:22:35] I can try to take care of it now [16:22:40] if you want [16:23:00] ha, elukey sorry, that was about hadoop in labs [16:23:03] oh, is that what you mean? [16:23:15] haha, having two convos at once :p [16:23:24] ottomata: let me check with malte [16:23:26] k [16:23:42] physikerwelt: you'll probably need to add me to the math labs project [16:23:43] ahhahahha [16:23:46] okok sorry [16:23:49] :) [16:26:09] ottomata: ok [16:30:11] hey a-team, I just merged the change to route requests to throrium [16:30:15] please try stuff, out [16:30:18] everything looks good from here [16:30:22] but, this should affect: [16:30:51] - stats [16:30:51] - metrics [16:30:51] - analytics [16:30:51] - datasets [16:30:51] - pivot [16:30:52] - yarn [16:30:53] all .wikimedia.org [16:33:13] (and it takes a while until all the caches are updated) [16:35:18] elukey: no DNS change, its all cache misc routing [16:35:21] i ran puppet on all of those [16:35:40] elukey: there was one stat1001 reference that i missed that was not rsync! :o [16:35:43] dunno how that happened [16:35:48] anyway: https://gerrit.wikimedia.org/r/#/c/326988/1 [16:35:48] :) [16:36:11] ahhh you ran puppet to all of them, ok! [16:36:15] I thought we needed to wait [16:36:16] good :) [16:38:37] 06Analytics-Kanban, 13Patch-For-Review: Replace stat1001 - https://phabricator.wikimedia.org/T149438#2869284 (10Ottomata) Hoookay! Routes changed, the following sites are now routing to thorium: - https://stats.wikimedia.org - https://datasets.wikimedia.org - https://analytics.wikimedia.org - https://pivot.w... [16:40:37] 06Analytics-Kanban, 13Patch-For-Review: Replace stat1001 - https://phabricator.wikimedia.org/T149438#2869311 (10Ottomata) Next up is to decom stat1001. I'll wait a week or so to verify that thorium is totally good. [16:44:27] Heya. [16:44:51] I'm getting "No 'Access-Control-Allow-Origin' header is present on the requested resource." errors on https://edit-analysis.wmflabs.org/editor-engagement/ but it was working yesterday. [16:45:00] Did the server move possible break CORS somehow? [16:45:18] ("XMLHttpRequest cannot load https://datasets.wikimedia.org/limn-public-data/metrics/ee/daily_edits_by_nonbot_reg_users/dewiki.tsv" etc.) [16:45:44] Hey, possibly! [16:46:20] hm [16:46:26] looking [16:46:32] James_F: almost certainly this is the server move [16:46:46] Kk. [16:50:55] * elukey afk! [16:57:33] James_F: better? [16:57:57] ottomata: Yup, all fixed! Thanks. :-) [17:12:08] 06Analytics-Kanban, 07Easy: Standardize logic, names, and null handling across UDFs in refinery-source {hawk} - https://phabricator.wikimedia.org/T120131#1845910 (10fdans) a:03fdans [17:15:35] ottomata, do you know mark bergma's irc handle? [17:16:38] milimetric: 'mark' [17:16:40] oops [17:16:42] mforns: ^ [17:16:49] thanks chasemp! [17:20:24] (03PS1) 10Milimetric: Remove legacy TSV jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/327003 (https://phabricator.wikimedia.org/T153082) [17:21:03] (03CR) 10Joal: "This will my heart so joyful !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/327003 (https://phabricator.wikimedia.org/T153082) (owner: 10Milimetric) [17:21:36] joal: already killed the job: https://hue.wikimedia.org/oozie/list_oozie_bundle/0024503-160420145651441-oozie-oozi-B [17:21:40] *bundle [17:21:48] mforns: you got it right! [17:21:57] * joal dances in his small [17:22:01] ottomata, yes thx [17:22:02] office [17:22:09] Thanks a lot milimetric :) [17:22:13] Wil update the charts [17:22:23] joal: nono, that's my bad, I can update [17:22:27] just -1 the change [17:22:40] Nono, no -1 for that patch ;) [17:25:29] physikerwelt: i'm going to grab some lunch, want to do labs hadoop stuff in 35 mins? 13:00 east coast, 19:00 berlin (I think?) [17:39:59] ebernhardson: Hi Sir ! [17:40:18] ebernhardson: have a minute to discuss your questions from yesterday? [17:47:04] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2869589 (10Ottomata) The math cluster's quota is almost full. We'll need room for at least 3 more instances, probably totally +16G RAM between the 3. Looks like there's only ab... [18:00:59] ottomata: I just went home. I'm ready now [18:01:42] ah great, ok, so, first we need to spawn some machines [18:01:50] but math project is pretty much full on quota [18:01:55] physikerwelt: ^ [18:02:05] can we fix that? either delete some instances, or increase quota somehow? [18:02:09] we can delete the mlp instance [18:02:11] milimetric, fdans, whenever is good for you, we can discuss Wikistats metrics? [18:02:25] physikerwelt: that'll probalby do it [18:02:27] its a big one [18:02:35] if you delete that, i'll spawn the nodes [18:02:35] mforns yep! [18:03:06] fdans, cool [18:03:12] omw [18:03:20] ok! [18:03:21] yes...will do [18:04:48] ottomata: it should be gone, now [18:04:56] k [18:07:44] ok physikerwelt am in this hangout https://hangouts.google.com/hangouts/_/wikimedia.org/hadoopify [18:07:45] if you wanna [18:07:56] this will actually be the first time i've done this with the horizon interface [18:10:50] I have configured a hadoop flink cluster on the bwcould which also uses openstack [18:11:09] aye, but this will be using the ops/puppet stuff [18:11:11] should be pretty easy [18:11:15] just hte interface is different than wikitech [18:12:24] the hangout link does not work for me [18:12:29] i gotta invite you, i PMed [18:12:33] what email(s) shoudl I invite? [19:05:42] a-team: holaaa [19:05:50] Hola nuria :) [19:06:02] hola... [19:06:14] hey nuria, we're in the cave talking wikistats, if you wanna join [19:06:28] sure, give me 2 mins [19:12:02] ebernhardson: helloooo? [19:12:53] joal: hi! [19:13:08] ebernhardson: Was willing to try to help before leaving if still needed :) [19:13:36] joal: so i think i found my problem eventually, for some reason hive didn't like my array>, changing it to array> made things stop complaining after a bunch of trial and error [19:14:08] today i'll finally be testing the oozie workflow/coordinator, and hopefully be done with that step :) [19:14:12] ebernhardson: I had in mind it would have been something with the struct [19:14:29] Sorry for not having been able to help, but glad it worked ! [19:14:38] no worries, learned something new about hive corner cases :) [19:15:18] ebernhardson: This corner-case spot is a neverending hole ;) [19:16:33] :) [19:17:53] 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 07Puppet: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870027 (10Dzahn) I see that we have role/common/eventlogging.yaml adding admin groups, eventlogging-admins and ev... [19:19:56] 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 07Puppet: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870028 (10elukey) >>! In T152621#2870027, @Dzahn wrote: > I see that we have role/common/eventlogging.yaml adding... [19:23:51] 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 07Puppet: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870043 (10Dzahn) ah, ok! (Let's keep using roles to add admin groups instead of host names though. It means les... [19:30:01] ottomata: so 1001 is no more? we migrated? [19:30:15] 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 07Puppet: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870083 (10elukey) I agree, it was a temporary measure to fix the immediate issue of the eventlogging role not pre... [19:32:39] nuria: everything is served via thorium now [19:32:41] so yes! [19:32:43] 1001 is still around [19:32:45] will wait a bit to decom [19:32:58] very few little snaffoos, found a couple of thing sthat weren't puppetized before [19:32:59] fixed those [19:33:03] but mostly it all just works! [19:51:53] jesus these download times for IntelliJ IDEA 🔥 [20:01:36] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream: EventStreams documentation - https://phabricator.wikimedia.org/T153117#2870252 (10Ottomata) [20:02:01] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream: EventStreams documentation - https://phabricator.wikimedia.org/T153117#2870268 (10Ottomata) Created a disambigution page so folks don't get confused: https://wikitech.wikimedia.org/wiki/Event* Don't forget to update https://www.mediawiki.org/wiki/API:Recen... [20:05:22] ottomata: o/ [20:05:41] about T149451 - would spark streaming or flink not great to use? [20:05:41] T149451: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451 [20:06:14] *be great [20:06:52] I agree that we could use kafkatee and kafkacat [20:07:18] elukey: yea it would [20:07:25] if we had all that set up [20:07:33] but, we already do the grepping and filtering using kafkatee [20:07:38] but right now we just write it to a file [20:07:49] i think we should just do it in the near term and pipe it into kafka [20:07:54] with kafkacat [20:08:51] oook okk [20:09:14] you are the expert and I will not disagree, just wanted to push you a bit :D [20:12:53] jdlrobson: to double check, our block size in hadoop is 256MB? [20:12:57] s/jdlrobson// [20:15:08] ebernhardson: hola! did you solved your problems from yesterday [20:15:09] ? [20:15:32] ebernhardson: ah, yes, reading backscroll now. [20:15:35] going afk team, byeeee [20:15:36] o/ [20:15:43] ciaoo elukey [20:16:01] ebernhardson: let me know if you are thinking of rewriting your query with with/as [20:28:35] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 13Patch-For-Review: Implement server side filtering (if we should) - https://phabricator.wikimedia.org/T152731#2858381 (10GWicke) I think it would be helpful to start by considering the use cases we actually need to support. So far, the only obvious case... [20:33:34] (03PS4) 10Nuria: Fixing tests on master [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) [20:41:12] a-team: i made a little disambiguation page, to help folks if they get confused: https://wikitech.wikimedia.org/wiki/Event* [20:41:20] (03PS5) 10Nuria: Fixing tests on master [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) [20:49:24] (03PS2) 10Phantom42: Monthly request stats per article title [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) [20:52:52] (03CR) 10Phantom42: "Fixed those issues" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) (owner: 10Phantom42) [20:56:24] nuria: yea i rewrote it using with, much cleaner now [20:56:46] ebernhardson: ahhh, same place? let me take a look maybe now .. ahem , i understand .. something [20:57:18] lol :) [20:57:48] milimetric: still working on patch give me a sec and i will send final patch w/o formatting chnages. I still want to have a formater for indentation but it shoudl not interfere [20:57:51] *should [20:59:22] ebernhardson: what file was the query again? [20:59:36] nuria: stat1002.eqiad.wmnet:/home/ebernhardson/query_clicks.hql [21:01:04] ebernhardson: indeed, now i still do not understand the query but i understand teh intentio [21:01:10] :) [21:01:12] *the intention of wht is going on [21:01:19] ebernhardson: thanks for refactoring it [21:03:14] np. It also made it easier to think about breaking out the pieces like this [21:03:29] i mean easier to think about what each individual part does, and then how it ties together [21:04:50] ebernhardson: and on your webrequest portion don't you need to filter by time in the form of year/month/day to swap through less data? ( i might be reading this totally wrong) [21:05:39] nuria: yes, thats what the ${row_timestamp} ${start_timestamp} and ${end_timestamp} is doing, the reason it's in there instead of more direct is because i need 1 hour from cirrus logs, and then i need that hour and the following hour from webrequest [21:05:53] that way if a user searches at 1:59:59 i can still see the clicks that happened at 2:04:00 [21:06:17] ebernhardson: i see , so that translates to year=23016 and month=09.. etc not a column timestamp [21:07:03] well, it uses deterministic functions to turn year/month/day/hour into a timestamp, such that the partitioning code can still choose apropriately. Verified with `explain dependency` that it indeed only selects 2 hourly partitions [21:07:32] it has to do that because otherwise rolling over from 23:00 to 00:00 the next day is difficult [21:08:14] ebernhardson: right. i see, ok, i would document that in query otherwise someone is going to think (like i did) you are combing teh whole dataset [21:08:48] ebernhardson: it is documented up top kind of on teh variable set up [21:09:17] yea a little bit, but it could be more explicit. I've been adding lots of comments to this one so i (or someone else) can figure out how it works down the road, can't hurt to add a couple more comments [21:09:48] i suppose i should also add a comment at the very top explaining the goal, instead of jumping right into parameters [21:11:20] ebernhardson: now it is loads better but yes that would be great , even a super brief one [21:11:57] ebernhardson: do you have somewhere documented the data retention in the source_cirrus_table [21:12:12] hmm, not sure it's documented anywhere [21:14:34] one thing i was wondering though, this data would be useful to keep more than 90 days, i was thinking if i anonymize the identity field (it's an md5 of ip+xff+user agent) it should be reasonable, since different hours wouldn't connect together. I wasn't sure best practice there though [21:15:04] it would seem another round of md5 with a random value used each hour would be sufficient, but wasn't sure how to get that random value into the hive query from oozie [21:17:45] ebernhardson: we can only retain 90 days [21:17:55] ebernhardson: so deleting would be best [21:18:38] ebernhardson: anonymizing correctly is hard to do well so we only retain long term data that has been throughly agreggated [21:18:54] hmm, ok [21:19:41] ebernhardson: this is the practice we follow with all datasets now, it might change and part of that research is here: [21:19:54] ebernhardson: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Sanitization [21:19:57] unfortunately i don't think this data can be aggregated in such a way that it would still be useful, it gets fed into a mathematical model that needs tuples of (query, pages shown to user, pages clicked) [21:20:01] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 07Documentation: EventStreams documentation - https://phabricator.wikimedia.org/T153117#2870581 (10Reedy) [21:20:03] so will have to live with 90 days for now [21:20:29] ebernhardson: right. [21:30:50] (03PS6) 10Nuria: Fixing tests on master [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) [21:31:26] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2870663 (10Ottomata) Ok! hadoop000 is up and running NameNode and ResourceManager. hadoop00[23] are running DataNodes and NodeManagers. You should be able to log into any of t... [21:32:08] ebernhardson: think that with the anonymization you suggested malicious search requests will be available like "search=credit-card-number-of-my-exboyfriend-up-forgrabs" [21:34:07] nuria: indeed. I could probably partially resolve that by only keeping queries that happen more than N times per month from unique sessions, but that wouldn't stop someone determined [21:34:40] our model actually only works when fed queries that happen fairly regularly (35 to 50 sessions per query is about the minimum it generates valid results for) [21:34:46] ebernhardson: right, thus the "anonymization is hard" and k-anonymization k"factor" [21:35:29] but that was also why i was hoping to keep longer data, because queries that are issued a dozen times a month might have enough data to then become useful over longer time periods. But can work with 90 days there is still >10k queries with > 35 unique sessions [21:38:43] ebernhardson: ya, for now, let's work with 90 days per our privacy policy. our webrequest data is retain even shorter , normally 60 days tops [21:41:53] (03CR) 10Nuria: "Please look now, tried to undo all formatting changes." (033 comments) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) (owner: 10Nuria) [21:45:27] ebernhardson: so you have tested your stemmer udf and such with the super long query right? [21:45:37] nuria: right, with a number of different languages [21:46:15] ebernhardson: it is mazing that it would work for western and arabic languages out of teh box [21:46:38] nuria: the one it doesn't work so well for is eastern languages :( but i think that's more about what lucene has contributors for [21:48:29] (03CR) 10Milimetric: [C: 032] "good work. I think we should split up those api tests sometime, that file's getting hard to read. But good stuff." [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) (owner: 10Nuria) [21:48:31] (03CR) 10Milimetric: [V: 032 C: 032] Fixing tests on master [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/326152 (https://phabricator.wikimedia.org/T152631) (owner: 10Nuria) [22:02:34] Pchelolo: got any more thoughts on https://gerrit.wikimedia.org/r/#/c/325836/ ? [22:03:54] ottomata: +1d [22:04:31] k [22:04:32] danke [22:08:25] right, closing for today, see you tomorrow! 👋 [22:12:39] laters! [22:14:41] (03CR) 10Nuria: [] Lucene Stemmer UDF (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/326168 (https://phabricator.wikimedia.org/T148811) (owner: 10EBernhardson) [22:15:03] ebernhardson: ok, added comments let me know if they make sense, they are mostly about object instantiation [22:15:11] fdansIsAway: ciaooo [22:16:03] mforns: can you triple check (whenever) that dashiki tests run clean for you and i will go ahead and deploy changes? Code is merged but just to be super sure [22:53:09] nuria: not sure directly how to go about ensuring the single analyzer instance, the problem is the initialize() method of GenericUDF doesn't have access to the lang parameter, it's provided per row. I could perhaps hold onto a HashMap to ensure single instantiation though [22:53:48] initialize() only knows the type, not the exact value that's passed. When i'm running this it will be provided with a variety of languages (depends on the row) [22:56:47] ahh, i see your comment further down about having a separate object that returns singleton, seems plausible [23:39:11] (03PS2) 10EBernhardson: Lucene Stemmer UDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/326168 (https://phabricator.wikimedia.org/T148811) [23:39:16] (03CR) 10EBernhardson: [] Lucene Stemmer UDF (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/326168 (https://phabricator.wikimedia.org/T148811) (owner: 10EBernhardson) [23:49:55] (03PS2) 10MaxSem: Add stuff from logging phases 2 and 3 [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/326831 (https://phabricator.wikimedia.org/T152559)