[00:44:02] is hue dead? https://hue.wikimedia.org/metastore/table/wmf/webrequest gives me 503: Request: GET http://hue.wikimedia.org/metastore/table/wmf/webrequest, from 10.64.0.171 via cp1043 cp1043 ([10.64.0.171]:80), Varnish XID 1001200157 [00:44:03] Forwarded for: 65.190.1.133, 10.64.0.171 [00:44:03] Error: 503, Service Unavailable at Tue, 14 Apr 2015 00:42:06 GMT [00:58:06] well, it works for me [01:01:45] yurik: you need to be added to the ldap userset hue pulls from i think [01:03:12] Ironholds: for the session code i still do not get how you calculated geometric means before, did you do it with a numerical aproximation for R? [01:03:39] exp(sum(log(value))) [01:03:55] well, if(value > 0){exp(sum(log(value)))} [01:04:18] ack. Missed a bit in sanitising [01:04:18] * Ironholds sighs at self [01:04:19] exp(sum(log(x[x > 0]), na.rm = na_rm) / length(x)) [01:04:23] you can't do it value-by-value, unfortunately [01:04:30] right, you cannot [01:04:35] nuria, it was working before i think [01:04:38] yesterday [01:04:55] yurik: ahhh, wait then might be domains, let me try [01:05:13] amusingly, R doesn't actually have the geometric mean built in [01:05:15] which: WTF. [01:05:22] Ironholds: so you calculate the geomean of the log [01:05:24] just got a 504 Gateway Time-out [01:05:30] for https://hue.wikimedia.org/metastore/table/wmf/webrequest [01:05:42] yurik: ya, "unavailable" same here [01:06:08] yurik: but ahem, if that loads the table fully ... [01:06:13] yurik: does it? [01:06:13] nuria, indeed [01:06:19] i really hope it doesn't :)) [01:06:27] yurik: jaja me too! [01:06:49] well, i mean i guess we can sit back and watch some poor server getting 100TB ... [01:06:52] into ram... [01:07:06] oh wait, its not 2025 yet [01:07:07] oops [01:08:01] err [01:08:11] why would you need 100TB for the geometric mean? [01:08:18] or are there overlapping threads? [01:08:22] yep [01:08:24] :)))) [01:08:32] it works for me, I promise. Want a screenshot? :P [01:08:39] 100TB in RAM? [01:08:59] * YuviPanda wonders how big a screenshot that would be [01:09:27] Analytics, Analytics-Kanban: Onboard Madhu - https://phabricator.wikimedia.org/T92985#1205450 (kevinator) a:madhuvishy [01:09:40] yurik: hue is kaput [01:10:00] long live hue! [01:25:59] Ironholds: even that numerical aproximation will not work for this data volume i do not think [01:27:19] how many values are you looking at? [01:28:07] well, we will have millions [01:28:14] and, would it help if I could come up with a way of never having more than... [01:28:16] lesse. [01:28:16] at least [01:28:20] three numbers in memory at a time? [01:28:32] let me throw up a prototype in a gist [01:28:37] Ironholds: I think teh geo mean is not apropiate measure on this case [01:28:57] Ironholds: it is not how many memory you hold at "one" time [01:29:05] well, the statistical distributions disagree. You can use the quantiles if that would work better, or the median, to avoid the problem with the arithmetic mean, but.. [01:29:06] Ironholds: it's that the number is too large [01:29:12] ..the arithmetic mean would be bad news bears [01:29:14] what do you mean? [01:29:22] Ironholds: you cannot mutiply ad infinitum [01:29:36] oh, the Inf rooting? [01:29:50] Ironholds: quantiles we already have, taht is no problem [01:30:42] Ironholds: but you cannot mutiply a series of numbers for ever... makes sense? [01:30:49] it does but you shouldn't have to. [01:31:21] you need to hold the result of your mutiplication and that can be a number too large to hold [01:31:40] that is, over the max_int limit? [01:31:47] far far over [01:31:57] huh. Otherwise I'd just suggest itterating through with the sum/length step, but..crap. [01:32:16] and you can't use scientific notation to reduce the size until the final calculation? [01:32:24] * Ironholds has no idea about Java's numeric types [01:32:41] Ironholds: well, no matter the language right? [01:32:55] if you mutiply enough numbers you will get there in any language [01:33:28] Well, yes [01:33:38] but...how much data are you running across to get 2147483647 as a value? [01:34:10] at least in R you get 9007199254740992 with double-precision. [01:34:30] I can't see how we'd reach that threshold but if we're going to, we're going to; the quantiles or median are probably better for that, then. [01:34:55] like, the point is to avoid the biasing the arithmetic mean introduces on log-scaled sets, and the median does that, or should. halfak thoughts if you're around? [01:35:55] we have percentile50, which is not an aritmetic mean [01:36:01] *arithmetic [01:36:22] I will comment about this on the phab item [01:36:23] indeed, it's the median, right? [01:36:28] so yeah, should be totally fine [01:38:06] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Mobile PMs has reports on session-related metrics from Wikipedia Apps - https://phabricator.wikimedia.org/T86535#1205502 (Nuria) I am not sure how the geometric calculation was done before as it seems like it's too big of a number and needs a numer... [01:38:17] Ironholds: k [01:40:31] nuria, thank you for working so late on this :) [01:40:51] Ironholds: no merit on my part, cause i took the manadatory spanish-2 hour lunch break! [01:41:07] jajaja, if we're localising :D [01:41:21] I spent my morning being scared by startup people [01:41:46] RStudio invited me to chat to them about the ecosystem. The experience of the ColdFusion inventor bringing me drinks is...I don't understand people who get used to being around, like, TimBL [01:42:59] jaja [01:50:20] :(( i tried to get just one value out of hive with "select record_version from wfm.webrequests where ... limit 1", and it takes forever :( [01:50:38] i was hoping to use it in scripts to quickly decide which query to run [01:50:51] depending on the record version [01:51:56] what was the "WHERE"? [02:01:40] Ironholds, all cluster vars as constants [02:01:44] *patition [02:02:29] year, month, day, hour, webrequest_source? [02:40:17] Ironholds, correct [02:40:28] btw, is it normal for a job to use 451GB of RAM?!? [02:40:52] ...no? [02:41:04] I mean, it's loading the entire set you asked for into memory [02:41:14] is this the same set of jobs that were using up the vast majority of the cluster memory last time? [02:41:14] want to take a look at it? its another essential one :) [02:41:22] i think so [02:41:51] job_1424966181866_84802 and job_1424966181866_84815 [02:42:01] 200GB & 450GB [02:42:41] 205824M and 6144M, according to hadoop job -list [02:43:27] i think me looking at them scares them -- take a look at the graph - http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&hreg[]=analytics1012.eqiad.wmnet|analytics1018.eqiad.wmnet|analytics1021.eqiad.wmnet|analytics1022.eqiad.wmnet&mreg[]=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate>ype=stack&title=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate&aggregate=1 [02:43:38] everything fun dead for a while [02:44:31] this is perhaps something you should bring up with the analytics engineers during work hours, if you're worried about it [02:44:36] job_1424966181866_84815 RUNNING 1428978513989 hdfs root.essential NORMAL 147 1 451584M 49152M 500736M http://analytics1001.eqiad.wmnet:8088/proxy/application_1424966181866_84815/ [02:45:13] work hours? i know not such thing. any hour is a work hour... as long as you enjoy it )) [02:45:31] for *their* work hours, yurik :) [02:45:33] not yours :) [02:47:15] indeed [02:47:19] because that way we don't, say [02:47:29] do anything particularly damaging because we forgot it was a sunday afternoon for everyone else ;p [02:47:37] *coughs discretely* [02:51:02] Ironholds: at least you aren’t coughing continuously [02:51:16] groooan [02:51:26] :D [02:56:12] LOLOLOL [02:56:32] * yurik 's goal in life is to make ppl caugh [02:56:43] coughing is good - clears up your lungs [02:56:53] yurik: http://www.urbandictionary.com/define.php?term=Caugh [02:56:58] > The sound a gopher makes when it simultaneously coughs and laughs while surrounded by future legal scholars. [02:56:59] besides, i monitor scripts, its not my fault they fail when ppl are not around ) [02:57:23] also http://www.urbandictionary.com/define.php?term=Caugh&defid=2282021 :D [02:58:06] YuviPanda, the bottomless wisdom from you have enlightened me greatly :-P [02:58:15] likewise, yurik :D [02:58:16] o/ [02:58:46] now if we could increase the analytics cluster with some spare machines... :D [02:59:13] * yurik wonders if there are any spares lying around that we can use for "hot plug"... [02:59:28] e.g. - high load, the spare reformats itself and helps out :) [02:59:38] (reformats = takes on a role) [03:00:01] something like s3 cloud apparently does with its dynamic scaling [03:00:22] we could get sooo many production machines from php to analytics in low time [03:39:29] Analytics-EventLogging: agent_type field does not work for anything except last few hours - https://phabricator.wikimedia.org/T95806#1205709 (Yurik) @kevinator or @ottomata, is there an easy and *quick* way to check the version? I tried ``` hive -e "select record_version from wmf.webrequest where webrequest... [03:45:14] (PS8) Nuria: [WIP] Add Apps session metrics job [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [03:47:39] Analytics-Cluster, Analytics-Kanban: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#1205724 (Nuria) @Yurik: these are not wiki pageviews, rtaher is a totally different website: https://annual.wikimedia.org/2014/ whose data is not on refined tables (besides being a super-s... [04:14:17] YUVIPANDA, don't raise your font at me! [04:14:26] TOO LATE [04:14:35] :) [04:14:36] what's up with the name? [04:15:33] joke in another channel :) [08:21:39] Analytics-Tech-community-metrics, ECT-April-2015: KPI pages in korma need horizontal margins - https://phabricator.wikimedia.org/T88670#1205914 (Qgil) The dialogs still can be cut sometimes. This is a screenshot of today. {F112669} [08:38:17] Analytics-Tech-community-metrics, ECT-April-2015: Instructions to update user data in korma - https://phabricator.wikimedia.org/T88277#1205950 (Qgil) Thank gr the technical documentation. I have added [[ https://www.mediawiki.org/w/index.php?title=Community_metrics&diff=1534296&oldid=1534285 | instruction... [08:40:30] Analytics-Tech-community-metrics, ECT-April-2015: KPI pages in korma need horizontal margins - https://phabricator.wikimedia.org/T88670#1205960 (Lcanasdiaz) Quim I see the old margin + JS behaviour in the screenshot you attached. Could you please clean your cache and try again? You should see a wider marg... [08:41:57] Analytics-Tech-community-metrics, ECT-April-2015: KPI pages in korma need horizontal margins - https://phabricator.wikimedia.org/T88670#1205961 (Qgil) Open>Resolved Sorry, this was a problem with my cache. Looks good now. This is Resolved Closed. Thank you! [08:50:15] Analytics-Tech-community-metrics, ECT-April-2015: Instructions to update user data in korma - https://phabricator.wikimedia.org/T88277#1205978 (Dicortazar) That's a good point, thanks for the addition! I'd say that documentation regarding the user identities and affiliations managing is good enough :). [08:54:01] Analytics-Tech-community-metrics, ECT-April-2015: Instructions to update user data in korma - https://phabricator.wikimedia.org/T88277#1205986 (Dicortazar) Open>Resolved [09:14:28] Analytics-Tech-community-metrics: "Who contributes code" page metrics are not updating - https://phabricator.wikimedia.org/T95166#1206014 (Dicortazar) The dashboard was successfully migrated. So we're now in a SortingHat version of the dashboard. The Gerrit part is now the one updated together with some oth... [13:15:23] Analytics-EventLogging: agent_type field does not work for anything except last few hours - https://phabricator.wikimedia.org/T95806#1206373 (Ottomata) You should just use it (or the date) in a conditional. Make sure either the partition date > 2015-04-10, or that the record_version = "0.0.3". Otherwise, d... [13:28:22] Analytics-Cluster: Installing Spark 1.3 on Cluster - https://phabricator.wikimedia.org/T95970#1206389 (Ottomata) Hey! Yeah, 1.3 is way better. It is a pain to officially install, it will be much easier to wait for a new release of CDH that includes this. However, it is very possible for you to run 1.3 on... [13:35:55] Analytics-Cluster: Installing Spark 1.3 on Cluster - https://phabricator.wikimedia.org/T95970#1206394 (Ottomata) Open>Resolved a:Ottomata Actually, that was way easier than I remember it. I just downloaded the .tgz, set HADOOP_CONF_DIR in conf/spark-env.sh, and there we go. cd /home/otto/spark-1... [14:21:35] Analytics-Kanban, Analytics-Visualization: Build Multi-tennant Dashiki (host different layouts) - https://phabricator.wikimedia.org/T88372#1206475 (Nuria) Please remember this also involves build changes as welll as code changes. [14:38:19] milimetric, mforns : master stats for EL: https://tendril.wikimedia.org/host/view/db1046.eqiad.wmnet/3306 [14:38:49] milimetric, mforns : i bet you are reading from a slave let me see if i can find it [14:38:56] nuria, ok [14:40:21] mforns, milimetric: volume wise, when I backfilled db sustained 30.000 inserts per secs [15:13:33] joal, you've used impala before? [15:13:48] set it up at Fotolia with an ops [15:13:49] ah pssh, we can talk in ops/analytics checkpoint [15:14:17] yup [15:15:04] arrf, Kevin wants me to stay [15:15:08] later ;) [15:27:09] hey nuria, when thining about how to output the app session data [15:27:16] try seeing if you can insert using HiveContext [15:27:22] maybe you can just insert into a hive table [15:27:23] ? [15:27:27] that is text formated? [15:27:33] dunno, haven't tried it [15:28:34] nuria: maybe we need 1.3? [15:28:35] https://spark.apache.org/docs/latest/sql-programming-guide.html#saving-to-persistent-tables [15:29:58] ottomata: Won't attend with jgage ... [15:30:15] dawwwww but i wanted to talk tot you about eventoggign! [15:30:23] we talk later joal? [15:30:36] for sure ;) [15:30:49] ottomata: I will look into that , let me launch job with different serialization [15:33:47] mforns: you can sync up with springle about slave issues, seems to me that we are reading from the EL slave if data is not there and it magically appears couple hours after [15:34:16] nuria, ok [15:34:26] mforns: given the tz i can help with that and try to ping springle before i am going to bed [15:34:51] nuria, do you know which hours is he around? [15:35:44] mforns: puff no...but i would sent him an e-mail cc-ing analytics with the problems you are seeing. seems that there are two: [15:36:20] mforns: 1) small intervals of complete data loss accross all tables (this points to replication issues, seems to me) [15:36:28] mforns: 2) delay reads [15:36:40] nuria, ok [15:36:43] will do [15:38:39] mforns: if you give him timestamps that will be best, check outage report and sal log because when andrew and i deployed /upgraded EL there are intervals w/o data (box switch) , disk full etc , for example, you would expect that when disk was full nothing worked: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150406-EventLogging [15:38:57] all incidents of that type should be logged [15:40:16] nuria, where is sal? [15:40:34] madhuvishy: I am on PST tz (normally i take a big break mid day) please do let me know if you need anything [15:40:38] mforns: https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:45] thx [15:58:43] Analytics-Cluster: Make spark work well with webrequest Parquet data - https://phabricator.wikimedia.org/T93105#1206854 (Ottomata) Open>Resolved We can use SQL context for this, no prob! [15:59:03] Analytics-Cluster: Setting up Ipython with Spark - https://phabricator.wikimedia.org/T92743#1206857 (Ottomata) Ellery, you've got this working, right? Can we resolve this task? [16:00:26] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#1206862 (Ottomata) [16:00:29] Analytics-Cluster: Story: Vet the kafkatee generated files - https://phabricator.wikimedia.org/T70248#1206859 (Ottomata) Open>Resolved a:Ottomata Data was vetted long ago. [16:00:44] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#737578 (Ottomata) [16:00:46] Analytics-Cluster: Story: AnalyticsEng generates new datafiles using kafkatee - https://phabricator.wikimedia.org/T70247#1206864 (Ottomata) Open>Invalid a:Ottomata We are using Hive to generate this data. [16:01:28] Analytics-Cluster: Story: Transparently switch from udp2log datafiles over to kafkatee generated datafiles - https://phabricator.wikimedia.org/T70250#1206870 (Ottomata) Open>declined a:Ottomata We are using Hive to generate these files. [16:01:29] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#737578 (Ottomata) [16:01:49] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#737578 (Ottomata) [16:01:51] Analytics-Cluster: Story: AnalyticsEng has kafkatee on analytics1003 - https://phabricator.wikimedia.org/T70246#1206877 (Ottomata) Open>declined a:Ottomata We are using Hive to do this. [16:03:06] ottomata: yt ? [16:03:39] yuo [16:03:40] yup [16:03:51] batcave ? [16:04:27] there. [16:11:16] ottomata, joal : TA_TATACHANNNNNN [16:11:24] hu ? [16:11:25] job without kyro took 14 mins [16:11:27] https://yarn.wikimedia.org/cluster/app/application_1424966181866_85708 [16:11:35] hopefully not a fluke [16:11:49] i am going to pur the kyro serializer again and re-test [16:33:29] ottomata, hi, thanks for all the good answers! And i have another one :) Any thoughts about the junk in URI_HOST field? https://phabricator.wikimedia.org/T95836 [16:35:32] yurik: , naw we can't do much about that, since those are real requests [16:35:52] we want to capture that stuf [16:35:55] even if it is dumb [16:36:04] ottomata, but that makes no sense - how would any DNS server out there give those hosts... [16:36:12] its an error of some sorts... [16:36:34] btw, ottomata - i saw that a few times there were non-0 is_pageview counts [16:37:06] there is a slight chance the is_pageview is broken, right? [16:37:07] yuri [16:37:08] curl -i -H 'Host: boogerballs' bits.wikimedia.org [16:37:09] kevinator_: you there ? [16:37:18] ? [16:37:56] i did the count(is_pageview) - group by uri_host, and some weird host names had > 0 [16:38:06] which ones? [16:38:21] i would have to research, thought to check with you if you knew anything about it [16:38:32] naw, i also don't know much about how the actual definition works [16:38:45] but, it is possible that it is wrong in some cases [16:38:58] if the weird urls were just casing, it probably is ok with that [16:39:34] is it possible to at least do the "lower()" on them? [16:39:45] ottomata, nuria : for pageviews aggregation, I should only consider webrequest_source to be in [text, mobile] correct ? [16:42:23] joal: mmm... "uploads"? [16:42:31] is_pageview will tell you [16:42:35] k [16:43:34] just text and mobile [16:43:42] nuria, why would you even see anything in uploads? [16:43:44] I don't actually think webrequest_source is run over upload [16:43:55] (if it is, that's unnecessary memory use. Stop that! :p) [16:44:00] *is_pageview is run over [16:44:07] that's what i thought [16:44:11] Ironholds: well, easy then [16:44:28] is_apgeview is run over the full table I think [16:44:35] that's....also unnecessary [16:44:39] yeah. [16:44:52] you can save some time and memory (a lot of it) by eliminating bits and upload and misc [16:44:55] Ironholds, nuria, please set uri_host to "lower(uri_host)" everywhere - makes everyone's life soo much easier )) [16:45:04] sorry, meant ottomata ^ [16:45:15] and remove the port [16:45:38] that will get rid of the most weird cases that should be considered in most stats [16:45:56] i'm still not sure why it shows 10.x.x.x as coming from GB :) [16:47:10] you're confused as to why it doesn't know where your server is? [16:47:32] it's your server. [16:47:41] If MaxMind knew it was in Virginia I'd be worried. [16:48:16] Analytics: Junk in wmf.webrequest.uri_host field - https://phabricator.wikimedia.org/T95836#1207012 (Yurik) At the very least, please make uri_host lower case, and remove the redundant port :80 string at the end, as it only obfuscates some results. Thanks! [16:48:41] Ironholds, i'm talking about the webrequests data :)) [16:49:01] uri_host there is often 10.x.x.x, and the geo tag is in GB [16:49:33] uri_host is 10.x.x.x? [16:49:46] what's the IP and XFF? [16:50:42] don't know - i was looking at the aggregates, grouped by uri_host [16:50:49] and by country [16:51:00] okay [16:51:04] there are tons of tem [16:51:06] them [16:51:08] yeah, I'm sure [16:51:22] but given that the geolocation is run over ip and xff, what the uri_host says is kind of irrelevant. [16:51:36] and I know you're talking about the webrequests data, MaxMind provides our geolocation software. [16:51:59] ah, sorry, misread :) [16:52:00] funny [16:52:10] * Ironholds shrugs [16:52:17] people submit dummy requests or spoof requests all the time. [16:52:32] because (1) we're a top-10 and (2) see the Greater Internet Fuckwad theorum. [16:52:41] i doubt this is the case though - i think it might be some of our servers doing it [16:53:01] passing in an invalid XFF? Oh, probably. [16:53:23] Operations has a long and consistent track record of doing mind-bogglingly selfish things when it comes to monitoring software and how it interacts with the logs. [16:53:35] see also "let's have Python ping $ARTICLE 10m times a day and not tell anyone" [16:53:41] not even sure it is an invalid request at first - something else calls it - like internal service/heartbeat, etc [16:53:47] indeed [16:54:17] don't we have a token bucket qos? [16:54:27] or something of that sorts? [16:59:00] Analytics: Make all wmf.webrequest.uri_host lower case, and remove ":80" at the end - https://phabricator.wikimedia.org/T96044#1207046 (Yurik) NEW [16:59:18] Analytics: Make all wmf.webrequest.uri_host lower case, and remove ":80" at the end - https://phabricator.wikimedia.org/T96044#1207053 (Yurik) [17:00:14] yurik, I don't know what that means [17:00:26] Ironholds, which part? [17:00:33] all I know is that every time I dug into this, ops talked down to me and -2d my patches, and I'm done dealing with them [17:00:45] sigh [17:00:47] i hear you [17:00:59] very dishartening dealing with them on occasion [17:02:12] Analytics-Cluster: Setting up Ipython with Spark - https://phabricator.wikimedia.org/T92743#1207060 (ellery) Yes, I'm all set. [17:04:29] joal, ottomata by the looks of it, yes, kyro serialization is not so hot [17:04:36] hmmm [17:04:40] sounds weird [17:04:41] Ironholds, do you think this can be done? https://phabricator.wikimedia.org/T96044 [17:05:02] sure, but I won't be the one to do it. [17:05:21] (not "I don't see it as valuable": "I am not in any way formally attached to analytics and I accidentally signed up for four papers") [17:05:52] yurik: , i don' think so. [17:05:54] Ironholds: ah, so you are putting in your… four papers. [17:05:54] mayyybe lower [17:05:59] * YuviPanda slowly walks away [17:06:00] but i think that would be dishonest [17:06:03] maybe another field that is normalized [17:06:09] but i wouldn't replace that one [17:06:16] that field is supposed to be the HTTP Host header [17:06:23] ottomata, but we already have the raw requests in case anyone wants to do funky hosts research [17:06:41] whereas the web.requests is for the regular counting [17:06:42] i'd be ok with an additional field, i suppose [17:06:57] and if data is not normalized, we will underount [17:07:00] undercount [17:07:13] why? most of the requests are bogus anyway [17:07:23] just don't count them [17:07:25] as for lower/upper [17:07:31] just call lower() on them whne you count? [17:07:46] that's the probelm - filtering them out requires a complex and well tested/maintained regex [17:07:54] ksorry Kryyo [17:08:08] yurik: are you trying to count using is_pageview? [17:08:23] YuviPanda, I don't get it. [17:08:45] or just raw # of request? [17:08:49] ottomata, no, https://git.wikimedia.org/blob/analytics%2Fzero-sms/HEAD/scripts%2Fcountrycounts.hql [17:09:07] grr, git is slow again [17:09:09] haha [17:09:28] i am doing sum(size), count(is_pageview), count(*) group by uri_host [17:09:29] i just use github [17:09:30] https://github.com/wikimedia/analytics-zero-sms/blob/master/scripts/countrycounts.hql [17:09:32] joal, ottomata , this might be why "If your objects are large, you may also need to increase the spark.kryoserializer.buffer.mb config property. The default is 2, but this value needs to be large enough to hold the largest object you will serialize." [17:09:50] Ahhhh ! [17:09:56] nuria: Riiight [17:10:10] 2mb? [17:10:17] what object is that big? [17:10:29] ottomata: that seems large right? [17:10:44] ja, for what you are doing, i dunno what it has to serialize though [17:10:53] i would think it would be some object at the level of a single rdd record [17:10:56] ottomata: to reshuffle, right? [17:10:58] so, whatever columns you are selecting [17:11:11] ottomata: not selection, as that is distributed [17:11:22] right, but you have them in memory after that [17:11:29] it is computed on each "node" [17:11:35] is that buffer for per executor serialization, or per object? [17:11:44] i guess [17:11:51] per rdd partition or per object? [17:12:04] ottomata: sounds like it is per "execution" [17:12:30] ja but it says 'if your objects are large' [17:12:35] so the largest object it would be the size of the rdd on one partition [17:12:41] so i would think you would increase that if your individual objects were going to be larger than mb [17:12:42] yeah [17:12:50] the full rdd on a partition? [17:12:51] nawwwww [17:12:52] really? [17:12:54] mabye. [17:12:56] ottomata: agreed [17:13:02] i guess why else would they have it set at 2mb though.. [17:13:03] Not the size of rdd [17:13:09] WOW, MS is implementing .NET LLVM compiler! [17:13:15] MIT lic [17:13:19] well, objects understood as objects full of data, not containers [17:13:33] i would think the serialization would happen on the record level [17:13:47] objects serialized with kryo is what is inside a rdd [17:13:49] so, it should be java object overhead + size of data fields inbetween shuffles [17:13:53] so if my rdd is 0.1M and i have 10 partitions and i need to reshuffle i have 1M of data [17:13:53] a row [17:14:24] nuria ?? [17:14:42] i think it shoudln't have to be much larger than the fields being worked with [17:14:48] so, 2MB sounds big enough tome [17:14:59] mmm... unless it's not per field [17:15:32] let me read the kryo docs [17:15:32] Except that we extract per id timestamp list [17:15:36] This can be big [17:15:45] oh. [17:15:46] yes. [17:15:50] joal: no, no, the timestamp list per uuid is small [17:15:55] It's not per field, but per row [17:15:56] oh ok [17:16:01] Always ? [17:16:07] joal: people do not look at 1000 wikipedia pages [17:16:34] joal: yes, there are no crawlers here [17:16:39] joal: this is apps traffic [17:16:39] I agree for the principle, but, a well hidden robot ? [17:16:48] joal: again, no crawlers [17:16:54] hmm [17:16:56] joal: it's mobile apps data [17:17:03] I would double check the number though ;) [17:17:56] joal: i alredy did [17:18:00] *already [17:18:04] oki :) [17:18:41] madhuvishy: helllooooooo [17:18:49] but even just thinking about use cases: the timestamp list is your list of hits in the app in a month [17:19:15] that is less than <10.000 as every hit is a page load [17:19:42] Analytics: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207115 (Ottomata) [17:20:46] Analytics: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207046 (Ottomata) I don't think we should alter the data in this field, but I would be fine with adding a new field that contained normalized uri_host. 'uri_host_normalized'? 'uri_host_n... [17:20:54] joal: so , not large, but the thing is that you have a lot of datapoints to shuffle through [17:21:10] Analytics, Analytics-Cluster: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207046 (Ottomata) [17:21:31] Yeah, but that should only put pressure on the network nuria ... [17:21:36] Analytics, Analytics-Cluster: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207046 (Ottomata) I don't think we should alter the data in this field, but I would be fine with adding a new field that contained normalized uri_host. 'uri_host_no... [17:22:02] joal: but there is where seraialization is taking place, righhhtt?? [17:22:23] yes m'dame [17:22:30] But serialization is at row level [17:23:29] joal: mannnn....then why that 2M limit, that seems huge [17:23:58] It is huge, but imagine we are talking about wikimedia revisions ? [17:24:08] Not so big then [17:25:21] joal: ya, true [17:25:49] joal, ottomata : run test again: with kryo (~30 mins) , without it (14 mins) [17:27:26] ok, then don't use it :) [17:27:48] sounds not right ;) [17:28:02] How do you register you kryo classes ? [17:30:13] nuria: also, how many partitions do you have and how manare you sending data to ? [17:33:41] ottomata, ok, so how about a new field uri_host_clean (or similar), that only contain a value like en.wikipedia.org, and never anything that is not a WMF domain? [17:33:57] why can't we just sanitise the hosts? [17:35:43] we could, but ottomata prefers not to, even though we do have the original in the raw requests )) [17:35:56]