[00:01:02] neilpquinn: are you trying from a terminal in Jupyterhub? [00:01:07] madhuvishy: I also just realized I don't have a key set up there anyway, but it seems like there's a problem before that :) [00:01:12] yep, exactly. [00:03:54] neilpquinn: we have https://wikitech.wikimedia.org/wiki/HTTP_proxy which is what enables pip etc [00:04:05] could you try git pushing to the http url of the repo? [00:04:18] ssh:// is probably the culprit [00:07:03] madhuvishy: ah, thanks, that worked! should I just do that from now on? SSH would be nice just to avoid password-typing but it's not super important :) [00:10:33] neilpquinn: yeah, i think the firewall blocks ssh. I'll look into whether we can have that possible but my instinct is there are some security considerations (which i'm not fully clear about right now) [00:12:15] madhuvishy: data exfiltration :) [00:12:26] madhuvishy: right, that makes sense—it is the production cluster after all :) Have a wonderful time in Hawaii! [00:12:32] and really http should be tightly controlled [00:13:28] anyone available to check yarn logs? I'm suspecting that job_1480065021448_58095 died due to the reducer OOM'ing, but unfortunately it doesn't report that it just gets to 82% and the reduce % along with Cumulative CPU stop increasing, until the reducer is started over (and after 3 failures the oozie workflow fails) [00:13:42] bd808: aah makes sense [00:15:34] neilpquinn: thank you :) [00:22:12] ahha, enough digging around in hue i found "Container killed by the ApplicationMaster." which sounds pretty likely memory went past the limits [00:50:32] (03PS3) 10Phantom42: Monthly request stats per article title [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) [00:51:49] (03CR) 10Phantom42: [] "Now we take only full months in account" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) (owner: 10Phantom42) [01:56:25] (03PS1) 10MaxSem: WIP: add client usage of structured data [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/327418 (https://phabricator.wikimedia.org/T153272) [02:07:16] (03PS2) 10MaxSem: WIP: add client usage of structured data [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/327418 (https://phabricator.wikimedia.org/T153272) [02:19:10] (03PS4) 10Milimetric: [WIP] Port standard metrics to reconstructed history [analytics/refinery] - 10https://gerrit.wikimedia.org/r/322103 [02:49:24] (03PS3) 10MaxSem: Add client usage of structured data [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/327418 (https://phabricator.wikimedia.org/T153272) [03:59:16] (03CR) 10Yurik: [V: 032 C: 032] Add client usage of structured data [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/327418 (https://phabricator.wikimedia.org/T153272) (owner: 10MaxSem) [04:40:19] 10Analytics-EventLogging, 10ArchCom-RfC, 06Discovery, 10Graphs, and 9 others: RFC: Use YAML instead of JSON for structured on-wiki content - https://phabricator.wikimedia.org/T147158#2875408 (10Yurik) [07:15:05] (03PS1) 10MaxSem: Add client usage of structured data [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/327456 (https://phabricator.wikimedia.org/T153272) [07:15:23] (03CR) 10MaxSem: [V: 032 C: 032] Add client usage of structured data [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/327456 (https://phabricator.wikimedia.org/T153272) (owner: 10MaxSem) [08:17:42] 10Analytics-Tech-community-metrics: Panel Gerrit-Delays gets Gateway timeout - https://phabricator.wikimedia.org/T151751#2875650 (10Lcanasdiaz) 05Open>03Resolved [08:54:35] we definitely need stat1002 quotas [08:54:46] at least for the home directories [09:58:23] Hey, I'm part of this research project https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries Basically we're analysing wikidata query server logs. We need some help from someone with a better understanding of how https://query.wikidata.org/ works than we have. We noticed that some prefixes like f.e. PREFIX wd: are automatically being added to each query which is being [10:06:02] Oh, looks like my message was too long and got truncated… [10:06:11] We noticed that some prefixes like f.e. PREFIX wd: are automatically being added to each query which is being executed. [10:06:16] Can someone please tell us which prefixes are being automatically added, or where else to ask about it? [10:07:39] jgonsior: Hi! Maybe somebody in https://www.wikidata.org/wiki/Wikidata:IRC could help?? [10:11:45] Thanks, I'll try it there! And sorry for almost using all disk space on stat1002.equid.wmnet by the way [10:18:14] ahhh Hi! Didn't know your IRC name! Don't worry, your home wasn't the biggest, and thanks a lot for following up so quickly [10:18:29] we'd need to advertise more the data-tank partition and set up some home quotas [10:43:25] * elukey early lunch! [10:49:19] 10Analytics-Tech-community-metrics, 06Developer-Relations (Oct-Dec-2016): Panel Gerrit-Delays gets Gateway timeout - https://phabricator.wikimedia.org/T151751#2875883 (10Aklapper) Confirming that also https://wikimedia.biterg.io/app/kibana#/dashboard/Data-Status looks good. Thank you! [10:52:33] 10Analytics-Tech-community-metrics: Kibana's Mailing List data sources do not include recent activity on wikitech-l mailing list - https://phabricator.wikimedia.org/T146632#2875908 (10Aklapper) [11:36:41] (03PS1) 10Addshore: Count number of revslider disables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/327477 (https://phabricator.wikimedia.org/T152197) [11:44:54] (03PS1) 10Addshore: Count number of revslider disables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/327479 (https://phabricator.wikimedia.org/T152197) [11:45:04] (03CR) 10Addshore: [V: 032 C: 032] Count number of revslider disables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/327477 (https://phabricator.wikimedia.org/T152197) (owner: 10Addshore) [11:45:07] (03CR) 10Addshore: [V: 032 C: 032] Count number of revslider disables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/327479 (https://phabricator.wikimedia.org/T152197) (owner: 10Addshore) [11:48:29] 06Analytics-Kanban, 10MediaWiki-Vagrant: Cannot enable 'analytics' role on Labs instance - https://phabricator.wikimedia.org/T151861#2876063 (10Physikerwelt) @mschwarzer Did you already manage to setup Flink. I recommend to document your setup here https://wikitech.wikimedia.org/wiki/Flink [11:54:57] (03CR) 10DCausse: [] Lucene Stemmer UDF (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/326168 (https://phabricator.wikimedia.org/T148811) (owner: 10EBernhardson) [12:24:10] (03PS2) 10Mforns: Mark Limn as deprecated by Analytics at WMF [analytics/limn] - 10https://gerrit.wikimedia.org/r/327216 (https://phabricator.wikimedia.org/T148058) [12:25:15] (03CR) 10Mforns: [] Mark Limn as deprecated by Analytics at WMF (031 comment) [analytics/limn] - 10https://gerrit.wikimedia.org/r/327216 (https://phabricator.wikimedia.org/T148058) (owner: 10Mforns) [12:35:17] getting some lunch! [12:47:13] 10Analytics, 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10elukey) [13:09:43] 10Analytics, 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10Krenair) What's the rationale for this including deployment access? [13:10:49] 10Analytics, 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2876223 (10elukey) >>! In T153303#2876221, @Krenair wrote: > What's the rationale for this including deployment access? My mistake, I was about to... [13:11:01] 10Analytics, 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2876224 (10elukey) [13:52:46] 10Analytics, 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2876304 (10Krenair) okay [13:57:10] mforns: ! [13:57:11] Hi :) [13:57:15] joal, hello! [13:57:26] some docker, or later? [13:57:49] joal, sure! [13:57:56] batcave? [13:58:03] OMW ! [13:59:33] hey yall [13:59:46] Hi milimetric [14:00:17] joal: if you haven't gotten too far with the metrics, I can run and test them this morning, make sure they'er ok [14:00:39] oh you two are talking - later, I'll busy myself otherwise [14:01:35] milimetric: Didn't start yet on metrics (morning with Alexandre on python, now on docker with mforns) [14:28:45] mforns: Got some dock on the docker.sock thing: https://raesene.github.io/blog/2016/03/06/The-Dangers-Of-Docker.sock/ [14:29:25] joal, yea, I also got there, but couldn't make it useful to me... [14:29:31] k [14:29:53] mforns: which version of docker do you use? [14:30:13] joal, you think we should use another docker kafka cluster? [14:30:22] mforns: not sure at all [14:30:25] joal, 1.12.3 [14:30:49] k mforns, seeing if I need to upgrade ;) [14:30:58] or maybe I do [14:30:59] ? [14:31:05] no no, me ;) [14:42:45] thanks to ema we might have found what is the issue with the clickhouse debian build failures [14:43:10] I am reporting it upstream atm, hopefully we'll get some progress [15:12:00] elukey: Wow, for a report upstream, must be something real ! [15:13:10] well I have been chatting with upstream during the past days https://github.com/yandex/ClickHouse/issues/228 [15:13:14] for several things [15:13:25] BUT the issue is solved, I found the missing piece [15:13:28] and now it builds [15:13:30] FINALLY [15:13:47] Wow [15:13:54] * joal claps for elukey 1 [15:18:30] 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Puppetize clickhouse - https://phabricator.wikimedia.org/T150343#2876579 (10elukey) Thanks to the awesome upstream support the evil json license is gone in the last stable tag, and I was able to build the clickhouse-server deb package with the following:... [15:59:41] a-team: sorry I am currently working on a puppetmaster issue, I need a bit of time, will be late at standup.. if I miss it I'll send escrum, sorry :( [16:00:02] elukey, np ;] [16:04:42] elukey: standuppp [16:04:49] elukey: ah sorry [16:04:59] elukey: just saw your note [16:07:52] ebernhardson: still working on you patch, sorry my meeting yesterday run over by a bunch [16:10:03] nuria: no worries. thanks! [16:40:16] 10Analytics, 10Analytics-EventLogging: Research Spike: Better support for Eventlogging data on hive - https://phabricator.wikimedia.org/T153328#2877030 (10Nuria) [16:41:37] mforns can you tell me again which were the universal breakdowns [16:41:42] so i can update the IA map [16:42:38] a-team: conversation on CI/jenkins and kubernetes: https://phabricator.wikimedia.org/T152684 [16:49:11] ashgrigas: project, language, article type, and user type [16:49:19] thanks! [16:50:33] (03CR) 10Nuria: [V: 032 C: 032] Mark Limn as deprecated by Analytics at WMF [analytics/limn] - 10https://gerrit.wikimedia.org/r/327216 (https://phabricator.wikimedia.org/T148058) (owner: 10Mforns) [17:03:47] 06Analytics-Kanban: Create 1-off tsv files that dashiki would source with standard metrics from datalake - https://phabricator.wikimedia.org/T152034#2877113 (10Nuria) Tasks: * Create dynamically partition table (https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions) by wiki and metric. In this mo... [17:06:46] 06Analytics-Kanban: Create 1-off tsv files that dashiki would source with standard metrics from datalake - https://phabricator.wikimedia.org/T152034#2877144 (10Nuria) Path: run 1 small wiki, verify dynamic partitioning running, setup dashiki, run it for all wikis, make sure that dashiki can source all wikis [17:11:27] 06Analytics-Kanban, 06Reading-Web-Backlog: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2877168 (10Nuria) a:03mforns [17:15:10] 06Analytics-Kanban: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#2877178 (10Nuria) [17:18:18] 10Analytics-Dashiki: WMF Dashiki instance should have reasonable URL - https://phabricator.wikimedia.org/T88390#2877188 (10Nuria) [17:18:21] 10Analytics-Dashiki: Dashiki 404 - https://phabricator.wikimedia.org/T104545#2877186 (10Nuria) 05Open>03Resolved [17:32:22] 10Analytics, 10Analytics-EventLogging: Add user_agent_map field to EventCapsule - https://phabricator.wikimedia.org/T153207#2877270 (10Nuria) * for mysql/EL * install ua-parser python * proces raw ua on capsule, we need to cache ua string versus parsed blob for the life of the application * insert json blob in... [17:33:37] 10Analytics, 10Analytics-EventLogging: Add user_agent_map field to EventCapsule - https://phabricator.wikimedia.org/T153207#2872765 (10Nuria) Take a look at queries that use UA. [17:34:06] 10Analytics-EventLogging, 06Analytics-Kanban: Add user_agent_map field to EventCapsule - https://phabricator.wikimedia.org/T153207#2877285 (10Nuria) [17:37:00] 06Analytics-Kanban: Productionize loading of edit data into Druid (contingent on success of research spike) - https://phabricator.wikimedia.org/T141473#2877308 (10Nuria) a:03JAllemandou [17:37:33] 06Analytics-Kanban: Productionize loading of edit data into Druid (contingent on success of research spike) - https://phabricator.wikimedia.org/T141473#2499884 (10Nuria) ooziefying the druid loading [17:40:12] 06Analytics-Kanban, 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2877322 (10Nuria) [17:43:52] 10Analytics, 10Pageviews-API, 10RESTBase-API, 06Services (watching): Pageviews Data : removes 1000 limit in the most viewed articles for a given project and timespan API - https://phabricator.wikimedia.org/T153081#2869058 (10Nuria) At this time we do not have precalculated more than the first 1000, we will... [17:47:10] 06Analytics-Kanban, 10EventBus, 06Services (next): Create alerts on EventBus error rate - https://phabricator.wikimedia.org/T153034#2877336 (10Nuria) [18:10:14] mforns: doinf some tryal and error with docker-kafka, let me know when your meeting is over :) [18:10:45] joal, ok, the meeting ends in 50 mins, but then there's metrics meeting... [18:10:54] yop [18:11:02] mforns: if it ends early then ;) [18:11:08] ok [18:11:33] milimetric: thanks a lot for the Jeff Deans post, that's really fun :) [18:35:00] (03CR) 10Milimetric: [] "really great progress, a couple of small things and one tricky testing bug. The testing bug was there before but not as obvious until you" (0311 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) (owner: 10Phantom42) [18:40:12] * elukey afk! byeeeee [18:40:19] Bye elukey ! [18:44:18] joal, docker? [18:44:30] Yessir ! [18:44:33] Batcave ? [18:44:38] yessir :] [18:47:41] due to connection issues I'm randomly unavailable, I'll just work on the metrics for the rest of the day, and read up about dynamic partitioning [18:48:06] jo: we can work tomorrow to set that up? [18:48:20] milimetric: Yay ! [18:48:28] Please ping me whenever you want :) [18:48:35] milimetric: --^ [18:49:31] will do [18:49:41] (how long you around for today?) [18:49:42] see you tomorrow ateam! [18:49:48] nite fdans [18:49:53] fdans: Bye ! [19:27:23] if anyone has a moment, i'm trying to figure out a problem where windowing over ~1GB of data OOM's with 4 reducers (default 4GB memory). I ended up having to push memory all the way to 16GB (14GB heap) to get it to consistently pass. An example failure is: https://hue.wikimedia.org/jobbrowser/jobs/application_1480065021448_58877 [19:28:08] the part that seems most suspicious to me is the counters, for a 1GB dataset with 3GB of reduce shuffle, the reducers wrote 3GB to local disk, and read back 63GB [19:29:39] and this is the query: https://phabricator.wikimedia.org/P4630 [19:41:33] (03PS4) 10Phantom42: Monthly request stats per article title [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) [20:12:34] ebernhardson: let me finish teh steemmer patch ( 1 more test to go) and maybe we can look at that together? [20:15:05] (03CR) 10Milimetric: [V: 032 C: 032] "This is accepted. Good work. We now will have to update documentation and the puppet (infrastructure) version of the routes and docs. W" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/326545 (https://phabricator.wikimedia.org/T139934) (owner: 10Phantom42) [20:17:04] nuria: sure, i'm going to run and grab a sandwich will be back in 20-30min [20:24:14] 06Analytics-Kanban, 07Easy, 03Google-Code-In-2016, 13Patch-For-Review: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2878056 (10Milimetric) a:05Milimetric>03Phantom42 [20:28:06] joal: around? [20:37:08] (03PS3) 10Nuria: Lucene Stemmer UDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/326168 (https://phabricator.wikimedia.org/T148811) (owner: 10EBernhardson) [20:45:52] ebernhardson: code has a bit more bolierplate but as far as i know this is the way to do it if we wnat to use the more feature-full udf base class cc dcausse [20:47:30] (03PS4) 10Nuria: Lucene Stemmer UDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/326168 (https://phabricator.wikimedia.org/T148811) (owner: 10EBernhardson) [20:48:37] nuria: thanks! indeed that's a bit more boilerplate but seems reasonable [20:49:19] ebernhardson: do test it out just in hive just in case , test pass, if it were not to work let me know and i will add more tests [20:51:09] certainly, will do [22:02:23] ebernhardson: looking t query [22:05:11] ebernhardson: pardon my thickness but i do not even see the OOMs on logs [22:06:16] nuria: let me know when you have a sec to pick up that discussion about request logs [22:06:31] urandom: now is good [22:06:38] cool [22:06:41] urandom: you need los for plays [22:06:48] Re-plays [22:06:52] sorry, right? [22:07:00] yeah [22:07:21] ok, how do the urls that you are interested on look? [22:07:27] do you know? [22:08:10] they'd be any url going to services restbase, the lvs address i guess [22:09:05] * urandom is looking to make sure what that is [22:09:07] I'm pretty sure the logs have /api/rest_v1/ [22:09:51] yeah, that might make it easier [22:10:07] urandom: logs have /api/rest_v1/page as gwicke said but that is true of different restbase services [22:10:23] nuria: yup [22:11:01] that's all of the REST API [22:11:35] which would be a good start [22:11:45] nuria: or everything to restbase.svc.$dc.wmnet, if that is easier [22:11:52] can always filter further per entry point [22:13:03] urandom: we see the outside domain not the internal dns i think, let me look for asec and i will pull out a sample select you can modify and rerun [22:13:10] urandom: do you have access to 1002? [22:13:35] nuria: is that stat1002? [22:14:45] nuria: or kafka1002? [22:14:50] i have access to the latter [22:14:54] urandom: stat1002 [22:15:00] i don't have access there [22:15:05] urandom: let me know if you can ssh [22:15:18] i can't ssh to stat1002 [22:15:20] nuria: well, i suppose i'm doing a bit of an assumption the reducers are failing due to OOM, from my side all i see is that the application manager kills the reducer [22:15:31] why it was killed is only in the application manager logs, which i can't see [22:15:54] ebernhardson: where do you see teh app manager killing reducers in teh syslog? [22:15:55] but i was going to ask you if you thought it was reasonable to read this directly out kafka for the replays, is that where you were going? [22:15:56] *the [22:16:46] nuria: sec lemme dig one up from before i increased it to 8x16G [22:17:13] urandom: I think that hive would be a lot faster. query is done [22:17:38] urandom: kafka you are filtering the whole stream but you just want a snapshot right? [22:17:56] a snapshot would be a great start, yeah [22:18:05] we might need to refresh that from time to time [22:18:13] urandom: ok, here it is: [22:18:19] but a relatively recent snapshot would work [22:18:29] urandom: [22:18:35] https://www.irccloud.com/pastebin/nwKl010p/ [22:18:58] nuria: oh, i do not have access to stat1002 [22:19:14] urandom: ah , see that is a problem, try 1004 [22:19:29] nope [22:19:55] is this a matter of limiting where the data is? [22:19:58] urandom: ok, do file for access cause that is going to be handy [22:20:12] urandom: yes, data needs to stay on prod network [22:20:42] urandom: but you have access to other machines, let me zip it and move it there. So you know 1 hour of traffic is 1G of urls unzipped [22:20:50] urandom: to rest base only [22:21:04] nuria: great [22:21:13] urandom: most of what i see is render requests though: [22:21:26] urandom: wikimedia.org /api/rest_v1/media/math/render/svg/0d7a6432c4f00809d751c3b7bbc04c095ec41f7b [22:21:49] ok [22:22:28] yeah, the more like the workload we see on the production environment, the more confidence this will provide [22:23:12] the mix of varnish requests vs. RB requests is quite different however [22:23:24] the vast majority of those render requests never reach RB, for example [22:23:32] urandom: math requests make 50% of request looks like [22:23:43] 10Analytics, 10Analytics-EventLogging, 06Performance-Team: Stop using global eventlogging install on hafnium (and any other eventlogging lib user) - https://phabricator.wikimedia.org/T131977#2878547 (10Gilles) p:05Normal>03Low [22:24:34] gwicke: oh, i see [22:24:38] but, focusing on specific end points that have lowish hit rates should be quite realistic [22:25:14] i was hoping there would be some way of distinguishing a cache hit from the log entries [22:25:15] /api/rest_v1/page/html/*, for instance [22:25:40] urandom: there is but only if you know how varnish works real well [22:26:25] urandom: cache status is passed along [22:26:26] IIRC the logging happens in the frontend, so there is no way to really know [22:26:42] gwicke: no, there is [22:27:02] gwicke: you read the status of cache as passed on via varnishkafka [22:27:13] including second & third layer hits? [22:27:37] gwicke: yes [22:28:17] okay, that's handy [22:28:26] I guess something like x-cache is logged then? [22:29:06] gwicke: cache_status header info https://phabricator.wikimedia.org/T142410 cc urandom [22:29:10] (03PS2) 10MaxSem: Add another schema version [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/325719 (https://phabricator.wikimedia.org/T152513) [22:30:27] (03CR) 10Yurik: [V: 032 C: 032] Add another schema version [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/325719 (https://phabricator.wikimedia.org/T152513) (owner: 10MaxSem) [22:30:34] yah, that's based on x-cache [22:30:47] https://gerrit.wikimedia.org/r/#/c/303578/2/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb [22:30:52] urandom: wait, we alredy have a public dataset you can use [22:31:45] urandom: i just remembered, you just would need to filter anything taht is not restbase, we recently worked with some folks that needed a huge dataset to improve caching on their models 9used for jvm and others) [22:31:56] nuria: well, i can't find them :S Yesterday the 'Task diagnostic log' would say killed by application manager (never in syslog though), today those seem to be empty everywhere. But i do still have plenty of examples of hive trying the same reducer 4x and it failing each time. Other than the task diagnostic log i didn't find anything interesting in them yesterday [22:32:14] urandom: it is loads of data but it is going to have everything you need [22:32:35] nuria: sounds good; where can i access that? [22:32:44] urandom: let me remember [22:33:04] urandom: regardless do file for access ok? cause it will come handy, answering questions like these takes 2 minutes [22:33:19] urandom: https://datasets.wikimedia.org/public-datasets/analytics/caching/ [22:33:24] urandom: take a look at readme [22:33:55] urandom: 2 weeks of data from july, wait let me see, maybe resbase is not there [22:34:00] urandom: argh [22:34:37] urandom: no, it no work, cause urls were hashed [22:34:52] nuria: yeah, i was about to ask... [22:34:57] urandom: [22:35:03] https://www.irccloud.com/pastebin/uyDFhDcj/ [22:35:13] this is the select i just did [22:35:42] https://www.irccloud.com/pastebin/zA2KZDfP/ [22:36:37] gwicke: since you probably have access to 1002 can you move data somewhere in prod network where urandom can take a look? [22:37:17] I don't think I have access to stat* machines [22:37:53] nuria: is this what i'm requesting? https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Production_access [22:38:10] urandom: yes [22:38:10] yeah, prompts me for a password [22:38:22] nuria: great [22:42:39] nuria: so should i just request access to stat1002, or the access groups in that table that correspond to 1002, or...? [22:43:37] urandom: acess to 1002 and hive will give access to all tables [22:43:46] urandom: we do not have per table restrictions [22:44:15] nuria: so it's enough to just state in the ticket that i need access to that machine for purposes of using hive? [22:44:39] so miss rate (removing math) is about 1% [22:44:45] cc gwicke [22:45:01] sorry 9% [22:45:15] urandom: yes [22:46:05] urandom: so about 90% cache-hit-ratio w/o math requests [22:46:25] nuria: https://phabricator.wikimedia.org/T153375 [22:47:00] nuria: thank you! [22:47:13] that's quite decent [22:47:50] urandom: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Use_wmf [22:48:47] gwicke: ya, doesn't seem bad [22:49:15] this reminds me of https://phabricator.wikimedia.org/T122245 [22:49:56] urandom: ok, you let me know but you have the data there and a way to select more by modifying the query I included on the folder , all is world readable. [22:50:17] nuria: this is great, thanks again [22:51:08] maybe we could set up those stats with a varnish consumer [22:51:23] err, kafka consumer [22:52:24] gwicke: ya, that use case seems better taylored for real time processing + graphana posting of data [22:52:47] gwicke: than by swaping masive amounts of data on cluster as really it is ops data [22:53:29] gwicke: but to be clear you do not need that ticket done to do cache hit ratio analysis on your end [22:54:01] I was just thinking about getting live hit rates per entry point [22:54:35] this would do the splitting, and recording hit vs. miss would be quite easy [23:01:11] gwicke: right , but anything in the cluster now is async ,*almost seems* that is is better done with something that filters the incoming stream for restbase and sends to graphite the metrics which are (endpoint, http-code) and (endpoint, cache_status). not sure , i think that is why we have not done it yet [23:01:41] gwicke:scala can do it just like we do for pageview api but seems inefficient, again, not super sure [23:02:25] changeprop should be able to do this as well [23:02:49] although performance could be an issue, as it would need to look at all logs [23:03:02] is the web log topic well partitioned? [23:03:04] ebernhardson: back to your queries [23:03:41] gwicke: partitioned , wait.. ? what hive/ kafka? [23:03:47] kafka [23:04:20] gwicke: consuming raw from kafka is not easy as you would need to comb tons of data , it should probably be an already refined stream [23:04:42] gwicke: not sure how changeprop is related, these are pageviews we are taking about 200.000 per sec [23:05:29] right, and IIRC last changeprop consumer measurement was about 20k json messages per second and core [23:05:49] a refined stream would certainly be nicer [23:05:55] gwicke: but this are kind of raw thoughts, refining data on hive and posting to graphite would work too, i will put that back on our radar but our current efforts are editing and streams [23:06:06] gwicke: ya, i wouldn't consider otherwise [23:06:35] /api/rest_v1/ is about 5k/s [23:07:15] gwicke: ya, that item was for this quarter but editing data was higher, sorry, that is why it wasn't done [23:07:47] no worries, we can live without it [23:08:12] but it would be great to have, as it would give us a lot more insight in how the API is actually used [23:08:18] gwicke: if you want to get it done sooner rather than later why don't we work with mobrovac or Pchelolo so they can setup the scala jobs? [23:08:38] gwicke: that would be the fastest [23:08:59] ah? what? where? scala jobs? [23:09:06] gwicke: and it is the type of collaboration we do with other teams [23:09:17] Pchelolo: is that a "run away" kind of reaction? [23:09:33] Pchelolo: you being super java man , what do you think of scala? [23:09:50] nuria: I think scala is an overkill to be honest [23:10:01] I mean as a language they just have too much [23:10:03] Pchelolo: in hype? FOR SURE [23:10:09] Pchelolo: on that we can gree [23:10:11] *agree [23:10:24] Pchelolo: context: https://phabricator.wikimedia.org/T122245 [23:11:21] Pchelolo gwicke : see: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala [23:12:05] Pchelolo: i think that enhancing that job shoudl be sufficient to do the reporting, let me see [23:12:18] you could of course write in in java too, or python. they all have the same-ish spark apis (although java and scala seem to be ahead of python in functionality) [23:12:25] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#2878841 (10GWicke) IRC conversation on the topic: ``` gwicke maybe we could set up those stats with a varnish consumer err, kafka consumer nuria (IRC) gwicke: ya... [23:15:14] wait .. isn't this already reporting for all restbase endpoints.... [23:16:02] we have an aggregate metric [23:16:49] if we could turn that into a kafka topic of rest api log events, then we could take over splitting those [23:17:18] would be quite easy to do in fact, as we have the router code that consumes the API specs [23:17:30] and already does the logging [23:18:51] gwicke: wait gabriel, we do not need any of that if we enhance that bit of code i just sent you [23:20:21] splitting based on specs is not completely trivial, so there might be some value in reusing the existing implementation [23:21:31] gwicke: ya please, simplicity first [23:21:56] although, maybe there is a way to translate URI template to query patterns [23:21:59] gwicke: my idea before was for future world, if you want this done let's use this same idea and report to grphite [23:22:13] things like /{foo}/{+bar} [23:22:38] gwicke: let's not overcomplicate matters eh? [23:23:41] I'm not an expert on hive filters [23:23:50] maybe it can just be translated to a regexp query [23:24:13] using https://en.wikipedia.org/api/rest_v1/?spec [23:25:23] In the perfect world that filter would take the URI of the restbase spec, then convert every endpoint to a regex and then find which one was called for each request [23:26:47] basically this https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala#L76-L78 should not give out a single number, but instead a map 'endpoint -> number' [23:26:55] doesn't sound too hard to implement [23:27:30] gwicke, Pchelolo : an udf can do that, they are made for that purpose. It is no different that parsing a user agent. Makes sense? [23:28:12] if regexps are supported, then at least the matching shouldn't be too hard [23:29:24] gwicke: I am seeing already metrics for all restbase services in graphite that this code is reporting (under restbase) [23:29:45] {foo} -> [^/]; {/foo} -> (?:\/[^/]), and {+foo} -> .* [23:31:15] nuria: you mean the rest api metric matching all requests to /api/rest_v1/? [23:31:19] gwicke: but even, why woudl you need to split it at the reporting end? [23:31:26] or the RB reported metrics? [23:31:50] gwicke: just report and after you can groupe a the graphana end [23:31:58] RB has fine-grained metrics for all requests it receives [23:32:14] that's only cache misses, though [23:32:20] hence the task [23:33:32] since hit rates are (expected to be) high, we don't have a very realistic picture of actual use [23:35:25] gwicke: no, wait, according to that code things should be reported at : restbase.requests.varnish_requests?... maybe i am missing something here [23:35:53] yes, but it's only the total rate [23:36:08] the blue line in https://grafana-admin.wikimedia.org/dashboard/db/api-summary [23:36:10] first graph [23:36:17] gwicke: ah sorry yes [23:37:05] btw, what are your plans for streaming? [23:37:29] gwicke: let 's work then on reporting totals 200/500 per endpoint that shoudl be start via modyfing that scala script [23:37:40] would it be hard for you to produce a filtered kafka topic containing only /api/rest_v1/ requests? [23:38:26] gwicke: we are not going to do that in the near term, no , we will start on projects such as that one when we have streaming hardware [23:39:06] okay, but longer term that's the direction you want to head in? [23:40:27] gwicke: once we know where we stand with public event streams RCFeed deprecation next quarter| kubernetes+ budget for hardware we will decide, it is going to take at least one more quarter to define the project [23:41:31] so do you think it would make sense to just wait for that? [23:42:06] it seems that it might potentially let us reuse more code & avoid the need to write custom translation / mapping code [23:42:22] also, it would give us realtime graphs [23:42:52] which is useful for operational debugging [23:42:58] gwicke: no, not really [23:44:35] 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), 03Interactive-Sprint: Add Maps tile usage counts as a Data Cube in Pivot - https://phabricator.wikimedia.org/T151832#2878921 (10mpopov) Okie dokie, I've modified the query above to detect geohack specifically and wmflabs in general: ```... [23:45:05] gwicke: current solution will not give you realtime graphs but I wouldn't wait months for the other. Not sure about code reuse argument, you would need to write a lot more code for streaming likely . data in kafka is not refined. [23:45:48] we don't need a lot of info [23:45:57] method, url, status, hit/miss [23:46:34] maybe response time, if that's available [23:47:22] writing some json with that info to a kafka topic should be less work than writing a spec-based filter engine [23:49:07] 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), 03Interactive-Sprint: Add Maps tile usage counts as a Data Cube in Pivot - https://phabricator.wikimedia.org/T151832#2878929 (10Nuria) @mpopov : do you know of the syntax with/ as in hive? Seems that this query could benefit from that o... [23:51:49] gwicke: i do not think you need to write a spec_based fikter engine to get your metrics at all , that overcomplicates matters abunch [23:52:47] keeping a manual matcher in sync with a bunch of specs does not sound like a lot of fun [23:53:14] we'd have to update that each time we deploy an API end point [23:53:48] gwicke:well start provinding value in small steps right? [23:55:09] gwicke: you can add 200/500 req per endpoint and it is not that endpoints requests cannot be grouped [23:55:45] gwicke: with just a simpler matching than the detailed one that would need the spec [23:56:20] gwicke: that might be sufficient for an initial version and will provide loads of value. [23:59:01] maybe -- as I said, precise, spec-driven logging given a kafka topic should be relatively simple & little effort [23:59:07] ebernhardson: let me know when you have had time to test the stemmerUDF as it is now. are you still having problems with queries? [23:59:15] 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), 03Interactive-Sprint: Add Maps tile usage counts as a Data Cube in Pivot - https://phabricator.wikimedia.org/T151832#2878956 (10Yurik) Yes yes please :) [23:59:37] so imho it would be interesting to figure out if producing such a topic for requests matching /api/rest_v1/ would be possible [23:59:51] with limited effort