[00:44:02] is hue dead? https://hue.wikimedia.org/metastore/table/wmf/webrequest gives me 503: Request: GET http://hue.wikimedia.org/metastore/table/wmf/webrequest, from 10.64.0.171 via cp1043 cp1043 ([10.64.0.171]:80), Varnish XID 1001200157 [00:44:03] Forwarded for: 65.190.1.133, 10.64.0.171 [00:44:03] Error: 503, Service Unavailable at Tue, 14 Apr 2015 00:42:06 GMT [00:58:06] well, it works for me [01:01:45] yurik: you need to be added to the ldap userset hue pulls from i think [01:03:12] Ironholds: for the session code i still do not get how you calculated geometric means before, did you do it with a numerical aproximation for R? [01:03:39] exp(sum(log(value))) [01:03:55] well, if(value > 0){exp(sum(log(value)))} [01:04:18] ack. Missed a bit in sanitising [01:04:18] * Ironholds sighs at self [01:04:19] exp(sum(log(x[x > 0]), na.rm = na_rm) / length(x)) [01:04:23] you can't do it value-by-value, unfortunately [01:04:30] right, you cannot [01:04:35] nuria, it was working before i think [01:04:38] yesterday [01:04:55] yurik: ahhh, wait then might be domains, let me try [01:05:13] amusingly, R doesn't actually have the geometric mean built in [01:05:15] which: WTF. [01:05:22] Ironholds: so you calculate the geomean of the log [01:05:24] just got a 504 Gateway Time-out [01:05:30] for https://hue.wikimedia.org/metastore/table/wmf/webrequest [01:05:42] yurik: ya, "unavailable" same here [01:06:08] yurik: but ahem, if that loads the table fully ... [01:06:13] yurik: does it? [01:06:13] nuria, indeed [01:06:19] i really hope it doesn't :)) [01:06:27] yurik: jaja me too! [01:06:49] well, i mean i guess we can sit back and watch some poor server getting 100TB ... [01:06:52] into ram... [01:07:06] oh wait, its not 2025 yet [01:07:07] oops [01:08:01] err [01:08:11] why would you need 100TB for the geometric mean? [01:08:18] or are there overlapping threads? [01:08:22] yep [01:08:24] :)))) [01:08:32] it works for me, I promise. Want a screenshot? :P [01:08:39] 100TB in RAM? [01:08:59] * YuviPanda wonders how big a screenshot that would be [01:09:27] Analytics, Analytics-Kanban: Onboard Madhu - https://phabricator.wikimedia.org/T92985#1205450 (kevinator) a:madhuvishy [01:09:40] yurik: hue is kaput [01:10:00] long live hue! [01:25:59] Ironholds: even that numerical aproximation will not work for this data volume i do not think [01:27:19] how many values are you looking at? [01:28:07] well, we will have millions [01:28:14] and, would it help if I could come up with a way of never having more than... [01:28:16] lesse. [01:28:16] at least [01:28:20] three numbers in memory at a time? [01:28:32] let me throw up a prototype in a gist [01:28:37] Ironholds: I think teh geo mean is not apropiate measure on this case [01:28:57] Ironholds: it is not how many memory you hold at "one" time [01:29:05] well, the statistical distributions disagree. You can use the quantiles if that would work better, or the median, to avoid the problem with the arithmetic mean, but.. [01:29:06] Ironholds: it's that the number is too large [01:29:12] ..the arithmetic mean would be bad news bears [01:29:14] what do you mean? [01:29:22] Ironholds: you cannot mutiply ad infinitum [01:29:36] oh, the Inf rooting? [01:29:50] Ironholds: quantiles we already have, taht is no problem [01:30:42] Ironholds: but you cannot mutiply a series of numbers for ever... makes sense? [01:30:49] it does but you shouldn't have to. [01:31:21] you need to hold the result of your mutiplication and that can be a number too large to hold [01:31:40] that is, over the max_int limit? [01:31:47] far far over [01:31:57] huh. Otherwise I'd just suggest itterating through with the sum/length step, but..crap. [01:32:16] and you can't use scientific notation to reduce the size until the final calculation? [01:32:24] * Ironholds has no idea about Java's numeric types [01:32:41] Ironholds: well, no matter the language right? [01:32:55] if you mutiply enough numbers you will get there in any language [01:33:28] Well, yes [01:33:38] but...how much data are you running across to get 2147483647 as a value? [01:34:10] at least in R you get 9007199254740992 with double-precision. [01:34:30] I can't see how we'd reach that threshold but if we're going to, we're going to; the quantiles or median are probably better for that, then. [01:34:55] like, the point is to avoid the biasing the arithmetic mean introduces on log-scaled sets, and the median does that, or should. halfak thoughts if you're around? [01:35:55] we have percentile50, which is not an aritmetic mean [01:36:01] *arithmetic [01:36:22] I will comment about this on the phab item [01:36:23] indeed, it's the median, right? [01:36:28] so yeah, should be totally fine [01:38:06] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Mobile PMs has reports on session-related metrics from Wikipedia Apps - https://phabricator.wikimedia.org/T86535#1205502 (Nuria) I am not sure how the geometric calculation was done before as it seems like it's too big of a number and needs a numer... [01:38:17] Ironholds: k [01:40:31] nuria, thank you for working so late on this :) [01:40:51] Ironholds: no merit on my part, cause i took the manadatory spanish-2 hour lunch break! [01:41:07] jajaja, if we're localising :D [01:41:21] I spent my morning being scared by startup people [01:41:46] RStudio invited me to chat to them about the ecosystem. The experience of the ColdFusion inventor bringing me drinks is...I don't understand people who get used to being around, like, TimBL [01:42:59] jaja [01:50:20] :(( i tried to get just one value out of hive with "select record_version from wfm.webrequests where ... limit 1", and it takes forever :( [01:50:38] i was hoping to use it in scripts to quickly decide which query to run [01:50:51] depending on the record version [01:51:56] what was the "WHERE"? [02:01:40] Ironholds, all cluster vars as constants [02:01:44] *patition [02:02:29] year, month, day, hour, webrequest_source? [02:40:17] Ironholds, correct [02:40:28] btw, is it normal for a job to use 451GB of RAM?!? [02:40:52] ...no? [02:41:04] I mean, it's loading the entire set you asked for into memory [02:41:14] is this the same set of jobs that were using up the vast majority of the cluster memory last time? [02:41:14] want to take a look at it? its another essential one :) [02:41:22] i think so [02:41:51] job_1424966181866_84802 and job_1424966181866_84815 [02:42:01] 200GB & 450GB [02:42:41] 205824M and 6144M, according to hadoop job -list [02:43:27] i think me looking at them scares them -- take a look at the graph - http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&hreg[]=analytics1012.eqiad.wmnet|analytics1018.eqiad.wmnet|analytics1021.eqiad.wmnet|analytics1022.eqiad.wmnet&mreg[]=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate>ype=stack&title=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate&aggregate=1 [02:43:38] everything fun dead for a while [02:44:31] this is perhaps something you should bring up with the analytics engineers during work hours, if you're worried about it [02:44:36] job_1424966181866_84815 RUNNING 1428978513989 hdfs root.essential NORMAL 147 1 451584M 49152M 500736M http://analytics1001.eqiad.wmnet:8088/proxy/application_1424966181866_84815/ [02:45:13] work hours? i know not such thing. any hour is a work hour... as long as you enjoy it )) [02:45:31] for *their* work hours, yurik :) [02:45:33] not yours :) [02:47:15] indeed [02:47:19] because that way we don't, say [02:47:29] do anything particularly damaging because we forgot it was a sunday afternoon for everyone else ;p [02:47:37] *coughs discretely* [02:51:02] Ironholds: at least you aren’t coughing continuously [02:51:16] groooan [02:51:26] :D [02:56:12] LOLOLOL [02:56:32] * yurik 's goal in life is to make ppl caugh [02:56:43] coughing is good - clears up your lungs [02:56:53] yurik: http://www.urbandictionary.com/define.php?term=Caugh [02:56:58] > The sound a gopher makes when it simultaneously coughs and laughs while surrounded by future legal scholars. [02:56:59] besides, i monitor scripts, its not my fault they fail when ppl are not around ) [02:57:23] also http://www.urbandictionary.com/define.php?term=Caugh&defid=2282021 :D [02:58:06] YuviPanda, the bottomless wisdom from you have enlightened me greatly :-P [02:58:15] likewise, yurik :D [02:58:16] o/ [02:58:46] now if we could increase the analytics cluster with some spare machines... :D [02:59:13] * yurik wonders if there are any spares lying around that we can use for "hot plug"... [02:59:28] e.g. - high load, the spare reformats itself and helps out :) [02:59:38] (reformats = takes on a role) [03:00:01] something like s3 cloud apparently does with its dynamic scaling [03:00:22] we could get sooo many production machines from php to analytics in low time [03:39:29] Analytics-EventLogging: agent_type field does not work for anything except last few hours - https://phabricator.wikimedia.org/T95806#1205709 (Yurik) @kevinator or @ottomata, is there an easy and *quick* way to check the version? I tried ``` hive -e "select record_version from wmf.webrequest where webrequest... [03:45:14] (PS8) Nuria: [WIP] Add Apps session metrics job [analytics/refinery/source] - https://gerrit.wikimedia.org/r/199935 (https://phabricator.wikimedia.org/T86535) (owner: Mforns) [03:47:39] Analytics-Cluster, Analytics-Kanban: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#1205724 (Nuria) @Yurik: these are not wiki pageviews, rtaher is a totally different website: https://annual.wikimedia.org/2014/ whose data is not on refined tables (besides being a super-s... [04:14:17] YUVIPANDA, don't raise your font at me! [04:14:26] TOO LATE [04:14:35] :) [04:14:36] what's up with the name? [04:15:33] joke in another channel :) [08:21:39] Analytics-Tech-community-metrics, ECT-April-2015: KPI pages in korma need horizontal margins - https://phabricator.wikimedia.org/T88670#1205914 (Qgil) The dialogs still can be cut sometimes. This is a screenshot of today. {F112669} [08:38:17] Analytics-Tech-community-metrics, ECT-April-2015: Instructions to update user data in korma - https://phabricator.wikimedia.org/T88277#1205950 (Qgil) Thank gr the technical documentation. I have added [[ https://www.mediawiki.org/w/index.php?title=Community_metrics&diff=1534296&oldid=1534285 | instruction... [08:40:30] Analytics-Tech-community-metrics, ECT-April-2015: KPI pages in korma need horizontal margins - https://phabricator.wikimedia.org/T88670#1205960 (Lcanasdiaz) Quim I see the old margin + JS behaviour in the screenshot you attached. Could you please clean your cache and try again? You should see a wider marg... [08:41:57] Analytics-Tech-community-metrics, ECT-April-2015: KPI pages in korma need horizontal margins - https://phabricator.wikimedia.org/T88670#1205961 (Qgil) Open>Resolved Sorry, this was a problem with my cache. Looks good now. This is Resolved Closed. Thank you! [08:50:15] Analytics-Tech-community-metrics, ECT-April-2015: Instructions to update user data in korma - https://phabricator.wikimedia.org/T88277#1205978 (Dicortazar) That's a good point, thanks for the addition! I'd say that documentation regarding the user identities and affiliations managing is good enough :). [08:54:01] Analytics-Tech-community-metrics, ECT-April-2015: Instructions to update user data in korma - https://phabricator.wikimedia.org/T88277#1205986 (Dicortazar) Open>Resolved [09:14:28] Analytics-Tech-community-metrics: "Who contributes code" page metrics are not updating - https://phabricator.wikimedia.org/T95166#1206014 (Dicortazar) The dashboard was successfully migrated. So we're now in a SortingHat version of the dashboard. The Gerrit part is now the one updated together with some oth... [13:15:23] Analytics-EventLogging: agent_type field does not work for anything except last few hours - https://phabricator.wikimedia.org/T95806#1206373 (Ottomata) You should just use it (or the date) in a conditional. Make sure either the partition date > 2015-04-10, or that the record_version = "0.0.3". Otherwise, d... [13:28:22] Analytics-Cluster: Installing Spark 1.3 on Cluster - https://phabricator.wikimedia.org/T95970#1206389 (Ottomata) Hey! Yeah, 1.3 is way better. It is a pain to officially install, it will be much easier to wait for a new release of CDH that includes this. However, it is very possible for you to run 1.3 on... [13:35:55] Analytics-Cluster: Installing Spark 1.3 on Cluster - https://phabricator.wikimedia.org/T95970#1206394 (Ottomata) Open>Resolved a:Ottomata Actually, that was way easier than I remember it. I just downloaded the .tgz, set HADOOP_CONF_DIR in conf/spark-env.sh, and there we go. cd /home/otto/spark-1... [14:21:35] Analytics-Kanban, Analytics-Visualization: Build Multi-tennant Dashiki (host different layouts) - https://phabricator.wikimedia.org/T88372#1206475 (Nuria) Please remember this also involves build changes as welll as code changes. [14:38:19] milimetric, mforns : master stats for EL: https://tendril.wikimedia.org/host/view/db1046.eqiad.wmnet/3306 [14:38:49] milimetric, mforns : i bet you are reading from a slave let me see if i can find it [14:38:56] nuria, ok [14:40:21] mforns, milimetric: volume wise, when I backfilled db sustained 30.000 inserts per secs [15:13:33] joal, you've used impala before? [15:13:48] set it up at Fotolia with an ops [15:13:49] ah pssh, we can talk in ops/analytics checkpoint [15:14:17] yup [15:15:04] arrf, Kevin wants me to stay [15:15:08] later ;) [15:27:09] hey nuria, when thining about how to output the app session data [15:27:16] try seeing if you can insert using HiveContext [15:27:22] maybe you can just insert into a hive table [15:27:23] ? [15:27:27] that is text formated? [15:27:33] dunno, haven't tried it [15:28:34] nuria: maybe we need 1.3? [15:28:35] https://spark.apache.org/docs/latest/sql-programming-guide.html#saving-to-persistent-tables [15:29:58] ottomata: Won't attend with jgage ... [15:30:15] dawwwww but i wanted to talk tot you about eventoggign! [15:30:23] we talk later joal? [15:30:36] for sure ;) [15:30:49] ottomata: I will look into that , let me launch job with different serialization [15:33:47] mforns: you can sync up with springle about slave issues, seems to me that we are reading from the EL slave if data is not there and it magically appears couple hours after [15:34:16] nuria, ok [15:34:26] mforns: given the tz i can help with that and try to ping springle before i am going to bed [15:34:51] nuria, do you know which hours is he around? [15:35:44] mforns: puff no...but i would sent him an e-mail cc-ing analytics with the problems you are seeing. seems that there are two: [15:36:20] mforns: 1) small intervals of complete data loss accross all tables (this points to replication issues, seems to me) [15:36:28] mforns: 2) delay reads [15:36:40] nuria, ok [15:36:43] will do [15:38:39] mforns: if you give him timestamps that will be best, check outage report and sal log because when andrew and i deployed /upgraded EL there are intervals w/o data (box switch) , disk full etc , for example, you would expect that when disk was full nothing worked: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150406-EventLogging [15:38:57] all incidents of that type should be logged [15:40:16] nuria, where is sal? [15:40:34] madhuvishy: I am on PST tz (normally i take a big break mid day) please do let me know if you need anything [15:40:38] mforns: https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:45] thx [15:58:43] Analytics-Cluster: Make spark work well with webrequest Parquet data - https://phabricator.wikimedia.org/T93105#1206854 (Ottomata) Open>Resolved We can use SQL context for this, no prob! [15:59:03] Analytics-Cluster: Setting up Ipython with Spark - https://phabricator.wikimedia.org/T92743#1206857 (Ottomata) Ellery, you've got this working, right? Can we resolve this task? [16:00:26] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#1206862 (Ottomata) [16:00:29] Analytics-Cluster: Story: Vet the kafkatee generated files - https://phabricator.wikimedia.org/T70248#1206859 (Ottomata) Open>Resolved a:Ottomata Data was vetted long ago. [16:00:44] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#737578 (Ottomata) [16:00:46] Analytics-Cluster: Story: AnalyticsEng generates new datafiles using kafkatee - https://phabricator.wikimedia.org/T70247#1206864 (Ottomata) Open>Invalid a:Ottomata We are using Hive to generate this data. [16:01:28] Analytics-Cluster: Story: Transparently switch from udp2log datafiles over to kafkatee generated datafiles - https://phabricator.wikimedia.org/T70250#1206870 (Ottomata) Open>declined a:Ottomata We are using Hive to generate these files. [16:01:29] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#737578 (Ottomata) [16:01:49] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#737578 (Ottomata) [16:01:51] Analytics-Cluster: Story: AnalyticsEng has kafkatee on analytics1003 - https://phabricator.wikimedia.org/T70246#1206877 (Ottomata) Open>declined a:Ottomata We are using Hive to do this. [16:03:06] ottomata: yt ? [16:03:39] yuo [16:03:40] yup [16:03:51] batcave ? [16:04:27] there. [16:11:16] ottomata, joal : TA_TATACHANNNNNN [16:11:24] hu ? [16:11:25] job without kyro took 14 mins [16:11:27] https://yarn.wikimedia.org/cluster/app/application_1424966181866_85708 [16:11:35] hopefully not a fluke [16:11:49] i am going to pur the kyro serializer again and re-test [16:33:29] ottomata, hi, thanks for all the good answers! And i have another one :) Any thoughts about the junk in URI_HOST field? https://phabricator.wikimedia.org/T95836 [16:35:32] yurik: , naw we can't do much about that, since those are real requests [16:35:52] we want to capture that stuf [16:35:55] even if it is dumb [16:36:04] ottomata, but that makes no sense - how would any DNS server out there give those hosts... [16:36:12] its an error of some sorts... [16:36:34] btw, ottomata - i saw that a few times there were non-0 is_pageview counts [16:37:06] there is a slight chance the is_pageview is broken, right? [16:37:07] yuri [16:37:08] curl -i -H 'Host: boogerballs' bits.wikimedia.org [16:37:09] kevinator_: you there ? [16:37:18] ? [16:37:56] i did the count(is_pageview) - group by uri_host, and some weird host names had > 0 [16:38:06] which ones? [16:38:21] i would have to research, thought to check with you if you knew anything about it [16:38:32] naw, i also don't know much about how the actual definition works [16:38:45] but, it is possible that it is wrong in some cases [16:38:58] if the weird urls were just casing, it probably is ok with that [16:39:34] is it possible to at least do the "lower()" on them? [16:39:45] ottomata, nuria : for pageviews aggregation, I should only consider webrequest_source to be in [text, mobile] correct ? [16:42:23] joal: mmm... "uploads"? [16:42:31] is_pageview will tell you [16:42:35] k [16:43:34] just text and mobile [16:43:42] nuria, why would you even see anything in uploads? [16:43:44] I don't actually think webrequest_source is run over upload [16:43:55] (if it is, that's unnecessary memory use. Stop that! :p) [16:44:00] *is_pageview is run over [16:44:07] that's what i thought [16:44:11] Ironholds: well, easy then [16:44:28] is_apgeview is run over the full table I think [16:44:35] that's....also unnecessary [16:44:39] yeah. [16:44:52] you can save some time and memory (a lot of it) by eliminating bits and upload and misc [16:44:55] Ironholds, nuria, please set uri_host to "lower(uri_host)" everywhere - makes everyone's life soo much easier )) [16:45:04] sorry, meant ottomata ^ [16:45:15] and remove the port [16:45:38] that will get rid of the most weird cases that should be considered in most stats [16:45:56] i'm still not sure why it shows 10.x.x.x as coming from GB :) [16:47:10] you're confused as to why it doesn't know where your server is? [16:47:32] it's your server. [16:47:41] If MaxMind knew it was in Virginia I'd be worried. [16:48:16] Analytics: Junk in wmf.webrequest.uri_host field - https://phabricator.wikimedia.org/T95836#1207012 (Yurik) At the very least, please make uri_host lower case, and remove the redundant port :80 string at the end, as it only obfuscates some results. Thanks! [16:48:41] Ironholds, i'm talking about the webrequests data :)) [16:49:01] uri_host there is often 10.x.x.x, and the geo tag is in GB [16:49:33] uri_host is 10.x.x.x? [16:49:46] what's the IP and XFF? [16:50:42] don't know - i was looking at the aggregates, grouped by uri_host [16:50:49] and by country [16:51:00] okay [16:51:04] there are tons of tem [16:51:06] them [16:51:08] yeah, I'm sure [16:51:22] but given that the geolocation is run over ip and xff, what the uri_host says is kind of irrelevant. [16:51:36] and I know you're talking about the webrequests data, MaxMind provides our geolocation software. [16:51:59] ah, sorry, misread :) [16:52:00] funny [16:52:10] * Ironholds shrugs [16:52:17] people submit dummy requests or spoof requests all the time. [16:52:32] because (1) we're a top-10 and (2) see the Greater Internet Fuckwad theorum. [16:52:41] i doubt this is the case though - i think it might be some of our servers doing it [16:53:01] passing in an invalid XFF? Oh, probably. [16:53:23] Operations has a long and consistent track record of doing mind-bogglingly selfish things when it comes to monitoring software and how it interacts with the logs. [16:53:35] see also "let's have Python ping $ARTICLE 10m times a day and not tell anyone" [16:53:41] not even sure it is an invalid request at first - something else calls it - like internal service/heartbeat, etc [16:53:47] indeed [16:54:17] don't we have a token bucket qos? [16:54:27] or something of that sorts? [16:59:00] Analytics: Make all wmf.webrequest.uri_host lower case, and remove ":80" at the end - https://phabricator.wikimedia.org/T96044#1207046 (Yurik) NEW [16:59:18] Analytics: Make all wmf.webrequest.uri_host lower case, and remove ":80" at the end - https://phabricator.wikimedia.org/T96044#1207053 (Yurik) [17:00:14] yurik, I don't know what that means [17:00:26] Ironholds, which part? [17:00:33] all I know is that every time I dug into this, ops talked down to me and -2d my patches, and I'm done dealing with them [17:00:45] sigh [17:00:47] i hear you [17:00:59] very dishartening dealing with them on occasion [17:02:12] Analytics-Cluster: Setting up Ipython with Spark - https://phabricator.wikimedia.org/T92743#1207060 (ellery) Yes, I'm all set. [17:04:29] joal, ottomata by the looks of it, yes, kyro serialization is not so hot [17:04:36] hmmm [17:04:40] sounds weird [17:04:41] Ironholds, do you think this can be done? https://phabricator.wikimedia.org/T96044 [17:05:02] sure, but I won't be the one to do it. [17:05:21] (not "I don't see it as valuable": "I am not in any way formally attached to analytics and I accidentally signed up for four papers") [17:05:52] yurik: , i don' think so. [17:05:54] Ironholds: ah, so you are putting in your… four papers. [17:05:54] mayyybe lower [17:05:59] * YuviPanda slowly walks away [17:06:00] but i think that would be dishonest [17:06:03] maybe another field that is normalized [17:06:09] but i wouldn't replace that one [17:06:16] that field is supposed to be the HTTP Host header [17:06:23] ottomata, but we already have the raw requests in case anyone wants to do funky hosts research [17:06:41] whereas the web.requests is for the regular counting [17:06:42] i'd be ok with an additional field, i suppose [17:06:57] and if data is not normalized, we will underount [17:07:00] undercount [17:07:13] why? most of the requests are bogus anyway [17:07:23] just don't count them [17:07:25] as for lower/upper [17:07:31] just call lower() on them whne you count? [17:07:46] that's the probelm - filtering them out requires a complex and well tested/maintained regex [17:07:54] ksorry Kryyo [17:08:08] yurik: are you trying to count using is_pageview? [17:08:23] YuviPanda, I don't get it. [17:08:45] or just raw # of request? [17:08:49] ottomata, no, https://git.wikimedia.org/blob/analytics%2Fzero-sms/HEAD/scripts%2Fcountrycounts.hql [17:09:07] grr, git is slow again [17:09:09] haha [17:09:28] i am doing sum(size), count(is_pageview), count(*) group by uri_host [17:09:29] i just use github [17:09:30] https://github.com/wikimedia/analytics-zero-sms/blob/master/scripts/countrycounts.hql [17:09:32] joal, ottomata , this might be why "If your objects are large, you may also need to increase the spark.kryoserializer.buffer.mb config property. The default is 2, but this value needs to be large enough to hold the largest object you will serialize." [17:09:50] Ahhhh ! [17:09:56] nuria: Riiight [17:10:10] 2mb? [17:10:17] what object is that big? [17:10:29] ottomata: that seems large right? [17:10:44] ja, for what you are doing, i dunno what it has to serialize though [17:10:53] i would think it would be some object at the level of a single rdd record [17:10:56] ottomata: to reshuffle, right? [17:10:58] so, whatever columns you are selecting [17:11:11] ottomata: not selection, as that is distributed [17:11:22] right, but you have them in memory after that [17:11:29] it is computed on each "node" [17:11:35] is that buffer for per executor serialization, or per object? [17:11:44] i guess [17:11:51] per rdd partition or per object? [17:12:04] ottomata: sounds like it is per "execution" [17:12:30] ja but it says 'if your objects are large' [17:12:35] so the largest object it would be the size of the rdd on one partition [17:12:41] so i would think you would increase that if your individual objects were going to be larger than mb [17:12:42] yeah [17:12:50] the full rdd on a partition? [17:12:51] nawwwww [17:12:52] really? [17:12:54] mabye. [17:12:56] ottomata: agreed [17:13:02] i guess why else would they have it set at 2mb though.. [17:13:03] Not the size of rdd [17:13:09] WOW, MS is implementing .NET LLVM compiler! [17:13:15] MIT lic [17:13:19] well, objects understood as objects full of data, not containers [17:13:33] i would think the serialization would happen on the record level [17:13:47] objects serialized with kryo is what is inside a rdd [17:13:49] so, it should be java object overhead + size of data fields inbetween shuffles [17:13:53] so if my rdd is 0.1M and i have 10 partitions and i need to reshuffle i have 1M of data [17:13:53] a row [17:14:24] nuria ?? [17:14:42] i think it shoudln't have to be much larger than the fields being worked with [17:14:48] so, 2MB sounds big enough tome [17:14:59] mmm... unless it's not per field [17:15:32] let me read the kryo docs [17:15:32] Except that we extract per id timestamp list [17:15:36] This can be big [17:15:45] oh. [17:15:46] yes. [17:15:50] joal: no, no, the timestamp list per uuid is small [17:15:55] It's not per field, but per row [17:15:56] oh ok [17:16:01] Always ? [17:16:07] joal: people do not look at 1000 wikipedia pages [17:16:34] joal: yes, there are no crawlers here [17:16:39] joal: this is apps traffic [17:16:39] I agree for the principle, but, a well hidden robot ? [17:16:48] joal: again, no crawlers [17:16:54] hmm [17:16:56] joal: it's mobile apps data [17:17:03] I would double check the number though ;) [17:17:56] joal: i alredy did [17:18:00] *already [17:18:04] oki :) [17:18:41] madhuvishy: helllooooooo [17:18:49] but even just thinking about use cases: the timestamp list is your list of hits in the app in a month [17:19:15] that is less than <10.000 as every hit is a page load [17:19:42] Analytics: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207115 (Ottomata) [17:20:46] Analytics: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207046 (Ottomata) I don't think we should alter the data in this field, but I would be fine with adding a new field that contained normalized uri_host. 'uri_host_normalized'? 'uri_host_n... [17:20:54] joal: so , not large, but the thing is that you have a lot of datapoints to shuffle through [17:21:10] Analytics, Analytics-Cluster: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207046 (Ottomata) [17:21:31] Yeah, but that should only put pressure on the network nuria ... [17:21:36] Analytics, Analytics-Cluster: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207046 (Ottomata) I don't think we should alter the data in this field, but I would be fine with adding a new field that contained normalized uri_host. 'uri_host_no... [17:22:02] joal: but there is where seraialization is taking place, righhhtt?? [17:22:23] yes m'dame [17:22:30] But serialization is at row level [17:23:29] joal: mannnn....then why that 2M limit, that seems huge [17:23:58] It is huge, but imagine we are talking about wikimedia revisions ? [17:24:08] Not so big then [17:25:21] joal: ya, true [17:25:49] joal, ottomata : run test again: with kryo (~30 mins) , without it (14 mins) [17:27:26] ok, then don't use it :) [17:27:48] sounds not right ;) [17:28:02] How do you register you kryo classes ? [17:30:13] nuria: also, how many partitions do you have and how manare you sending data to ? [17:33:41] ottomata, ok, so how about a new field uri_host_clean (or similar), that only contain a value like en.wikipedia.org, and never anything that is not a WMF domain? [17:33:57] why can't we just sanitise the hosts? [17:35:43] we could, but ottomata prefers not to, even though we do have the original in the raw requests )) [17:35:56] raw reqeuests are less efficient to query [17:36:06] Ironholds: because uri_host == HTTP Host header [17:36:14] it would be dishonest for it not to be! [17:36:24] but, i'm fine with uri_host_normalized or _clean or something [17:36:25] well, we already strip things from it [17:36:32] not in the table data... [17:36:34] ? [17:36:37] Analytics, Analytics-Cluster: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207212 (Yurik) funny dup. I think the new field should only contain "proper" uri hosts, like those controlled by wmf. All else should be blank. [17:36:40] we remove http:// and https://. We remove www except for wikidata. [17:36:52] no we dont' remove that [17:36:56] Ironholds, you remove www? [17:36:58] it just doesn't show up? [17:36:58] i saw them [17:36:59] joal: well, a shuffle will be shuffling across all partitions if i understand this right, .... [17:37:11] joal: now, how many partitions do i have i do not even know [17:37:12] Correct [17:37:21] nuria: You can also ask to get more / less partitions [17:37:23] i think you can already safely remove the :80 [17:37:24] ottomata: Hi!! Just got to office etc. [17:37:29] Ironholds: unless varnish does something with it that I don't know about [17:37:39] but, uri_host shoudl be HTTP host header from client [17:37:54] %h [17:37:54] Remote host. Defaults to '-' if not known. [17:38:16] joal: are partitions defined by the context? [17:38:26] sorry [17:38:28] wrong field. [17:38:44] %{Host@uri_host}i [17:38:45] nuria: partitions are defined by how many blocks need to be read [17:38:49] %{X}i [17:38:49] The contents of request header X. [17:38:59] ottomata, who is the target audience for this table? Debugging shoud go via logs and raw, and this table seems more suited for research [17:39:23] yurik: if/when we build a 'research' table that is santizied, then Ithink that could be worth doing [17:39:26] that may or may not be this table [17:39:35] but, that would be the one that had user_agent cleaned, and IPs removed [17:39:37] joal: I should be able to see this on the job, let me lauch it again [17:39:40] nuria: would be interesting to check in sparkshell with a cached rdd, using the ui [17:39:45] yup [17:39:56] nuria: I go for dinner, will be back ;) [17:40:03] madhuvishy: hiyaa, made this ticket specially for you: https://phabricator.wikimedia.org/T96053 [17:40:10] joal: but the shell doesn't have the capacity (memory) to load more than 1 bit of data [17:40:14] research in the sense of internal stats, but yes, a general research table would not have those columns... per column security? [17:40:19] joal|away: k [17:42:40] ottomata: Yay thank you [17:43:25] HAHAHAHhahahahaha [17:43:32] aaww qchris is not here [17:43:43] milimetric: qchris left me a joke in one of his many documents [17:43:46] and didn't tell me about it [17:43:53] (or if he did I forgot about it) [17:43:58] Do not email the contents of new-htpasswd-line or paste in IRC. Ask the op to pick up the file from stat1001 by thonself [17:43:59] " [17:44:06] thonself! [17:44:07] bwaaahhaa [17:46:15] ottomata: i think our handsome qchris volunteer is kind of busy right now... [17:47:13] Analytics, Ops-Access-Requests, operations: Grant Sati access to geowiki - https://phabricator.wikimedia.org/T95494#1207253 (Ottomata) Sati, this is done. I need a way to get the password to you securely. Unless you know how to use gpg, probably the simplest thing to do would be for us to hop into a... [17:47:33] nuria: ohHHHHhhh yes he probably is! [18:18:46] joal|away, ottomata : so, ahem, from what i can see on spark ui we have 1500 partitions? [18:18:58] ottomata: according to http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html [18:19:06] and https://yarn.wikimedia.org/proxy/application_1424966181866_85825/stages/ [18:21:52] hmm joal, whoopsie: [18:21:56] https://issues.apache.org/jira/browse/HIVE-6394 [18:22:05] doesn't work with our current version of hive! [18:22:12] good thing no one really wants it yet :p [18:22:26] Analytics-Cluster, operations, ops-eqiad, Patch-For-Review: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1207402 (Cmjohnson) Open>Resolved 2 mother boards later and this is fixed. Resolving [18:26:06] whoaaa [18:26:08] spark-sql [18:26:38] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Story: user wants to be able to re-run a failed report more easily [13 pts] - https://phabricator.wikimedia.org/T88610#1207418 (Capt_Swing) @mforms just checking in: what does 'paused' status mean on Kanban workboard? [18:26:41] ottomata: oops indeed ! [18:27:00] ottomata, joal : so iam going to remove the kryo serialization then? [18:27:20] nuria: 1500 partitions is, I think, the number of blocks read [18:27:40] You would get the same number of mappers from a query in hive [18:28:19] nuria: you can, but I would like to do some more testing on that aspect later [18:28:28] ottomata, joal: if the 2mb limit is per row and basic spark classes are already registered with serializer..."seems" that serialization with kryo should work as is [18:29:23] basic spark classes --> I am not sure [18:30:08] joal: ah, ok, it says so on their docs [18:30:17] joal: "Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library." [18:30:19] joal i would think they would be, seeing as the config setting nuria is tweaking is a spark one [18:30:41] ok [18:30:52] But the fact that is is slower is baizzare [18:32:37] joal: ya, i also do not understand the partitions abstraction either if it doesn't related to hdfs nodes [18:32:54] It's linked to hdfs blocks [18:34:03] nuria: want to discuss in batcave ? [18:34:08] sure [18:34:13] ottomata: want to join? [18:35:27] hm, no into reading stuff :) [18:35:35] you can do it! [18:37:23] milimetric: folks are trying to make mondrian work with spark sql :p [18:39:05] whoaa, and nuria, milimetric, mforns, check this out: https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.rdd.JdbcRDD [18:39:19] you can query mysql from spark and get an RDD [18:39:33] then you could join them in spark, coOooL [18:39:48] huh! [18:40:34] ottomata: but the main problem isn't loading the data from mysql. It's the fact that our current schema forces us to delete useful data [18:40:37] so no sqoop needed? [18:40:46] because we just overwrite instead of append [18:40:48] ehhh? [18:41:04] mforns: just a thing, but ja, you could do it that way [18:41:08] also, it isn't working well [18:41:12] but spark 1.3 has a spark-sql cli [18:41:17] that hooks up to hive metastore [18:41:24] cool :) [18:41:33] so, you can issue sql statements that query hive tables but are run as spark jobs [18:42:03] milimetric: don't get this "ottomata: but the main problem isn't loading the data from mysql.  It's the fact that our current schema forces us to delete useful data" [18:42:09] which schema? [18:42:50] our mediawiki schema [18:43:11] like, "update page set page_namespace = 2 where page_id = 1234;" [18:43:17] that's the problem ^ [18:43:35] what we want is to eventually have: [18:44:05] "insert into page (valid_from, valid_to, page_id, namespace) values (...) [18:44:56] we as in data warehouse? [18:45:08] yeah, we want a data warehouse [18:45:15] so querying data directly from mysql doesn't help us with that [18:45:45] ah, all i'm saying it is OCOOOool [18:46:06] it's cool, i agree. I said "huh!" as in "cool!" [18:46:08] :) [18:46:10] haha [18:46:30] and also wondering: if spark has jdbc connectors and *could* work with mondrian, and *could* work as sql cli [18:46:35] mabye impala/drill not needed? [18:47:11] naww, probably still [18:47:12] dunno [18:50:57] I guess we could try it. One great thing we get with Hive is instant familiarity for mysql folks [18:51:30] when Oliver first started looking at Hive, that was a huge plus. And if we're taking data out of analytics-store and putting it in here, I think that's a requirement - not having a totally different tool [18:51:44] and spark's a little different, you don't quite get to do things like "show tables;" etc. [18:51:53] * Ironholds rises from the depths [18:51:58] WHO SPAKETH MY NAME [18:52:19] oh, Hive? yeah, easy-peasy [18:52:30] just remember it's strict as all hell, joining between tables is a pain, and how partitions work [18:52:34] milimetric: you do! [18:52:36] spark-sql is a SQL CLI [18:52:42] that uses hive-metastore [18:52:58] just like hive CLI [18:53:06] although it is barely working for me atm :) [18:53:46] oh, well, then if it starts more than barely working, we're gold [18:53:48] gold! [19:00:10] yurik: ahem... could you run 1 job at a time so cluster has alittle more resources to spare for other jobs? [19:01:44] yurik: cause i am running now 1 and it is taken 50% longer than it did couple hours ago [19:12:11] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Story: user wants to be able to re-run a failed report more easily [13 pts] - https://phabricator.wikimedia.org/T88610#1207515 (Nuria) @Cap_Swinng Means we are not working on it at the moment as other higher priority... [19:31:39] mforns: regarding the mysql consumer, i can see how it could run into a runtime exception, shut down and restart, that would be secs/minutes and events might be dropped [19:32:05] mforns: it is hard to see how that could happen for an hour if the process was actually running [19:32:14] nuria, aha, one question, does consumer use buffers? or does zmq use buffers? [19:32:28] I see [19:32:49] it consumes an stream for zero mq the same one than the logs (same code) [19:33:24] so there is no "buffering" on our code [19:34:12] but the zmq stream is pub/sub or is buffer, I mean the data waits in zmq to be pulled? or zmq sends it to the consumer? [19:35:07] nuria, ^ [19:38:27] nuria, today there was a 2hour hole in the EL db [19:39:32] nuria, I suspect it must have something to do with buffers, because otherwise, there would be a time hole in the consumer logs also. [19:39:54] nuria, but the consumer continues to work normally at all times [19:41:37] nuria, the processor logs do contain the missing data, and it is valid data, so.. I'd say the problem is in zmq or the consumer. [19:43:01] halfak: Yt ? [19:43:12] Yeah. What's up joal? [19:43:20] connected to altoscale :) [19:43:29] I have a few questions for you [19:43:35] batcave for 5 minutes ? [19:53:41] mforns: zmq is further up [19:53:58] nuria, doesn't the consumer read from zmq? [19:54:03] mforns: processor is consuming from zmq [19:54:21] mforns: everything consumes from zmq upstream [19:54:44] nuria, so where does the consumer read from? [19:55:14] mforns: there are several consumers, mysql, logs [19:55:34] mforns: that read from an stream of validated events [19:56:02] mforns: if events are coming in and are persisted into logs, it means the incoming stream has them [19:56:17] mforns: which means that mysql consumer saw them too [19:56:56] nuria, I see [19:56:56] mforns: if there are no errors and time interval is very well defined for a bunch of tables while processes on vanadium where running [19:57:03] mforns: it could be the network [19:57:34] mforns: checking whether processes where running can be done by checking /upstart/ logs [19:57:46] mforns: sounds like you already have done that [19:57:48] halfak: we'll talk tomorrow :) [19:58:28] Woops. Sorry. Got distracted [19:58:33] * halfak reads scrollback [19:58:36] ottomata: are you trying something with the cluster ? [19:58:40] no [19:58:45] nuria, yes, upstart/consumer logs have records of writes at the time of the holes [19:58:48] Sorry joal. tomorrow it is. [19:58:49] wasup? [19:58:50] halfak: np ;) [19:59:03] cluster seems really behind [19:59:09] ottomata: --^ [19:59:16] mforns: absolutely "zero" events for a timeperiod in which the consumer was running w/o errors or a magic restart sounds like inserts are going into dev/null due to network/db issues [19:59:32] nuria, aha [19:59:41] hm [20:00:17] but nuria, wouldn't the consumer logs indicate write error, instead of success? [20:00:22] hm, ja actually yurik is runnign a lot of jobs, but still i don't think he should be delaying essential jobs [20:00:53] bob is runinng a distcp [20:00:55] dunno what that is [20:00:58] sup? [20:01:00] mforns: if it cannot reach db yes, you will see an error, if it is a silen fail on the other end, no, [20:01:13] each one of my jobs is now running 6 hrs or so ( [20:01:31] mforns: breaking for lunch will be back in couple hours (mandatory spanish lunch) [20:01:35] i think they are low mem though, right? [20:01:43] nuria, :] buen provecho! [20:01:44] siesta! [20:02:56] yurik: dunno, they have lots of mappers and reducers, dunno about mem. do you need to run 4 at once? [20:03:26] 5 at once? [20:04:46] whoa kevinator_ [20:04:49] what's this? [20:04:50] https://yarn.wikimedia.org/proxy/application_1424966181866_85641/mapreduce/job/job_1424966181866_85641 [20:09:12] joal: look at this: [20:09:13] https://yarn.wikimedia.org/proxy/application_1424966181866_85641/mapreduce/job/job_1424966181866_85641 [20:09:23] whatever this is looks like it will never finish to me [20:09:23] eh? [20:09:24] Yeah, I have seen that [20:09:31] I know [20:09:43] Doesn't take much resource, but still ! [20:11:09] Same with the jobs from Yurik, many reducers each ! [20:12:19] joal, is that bad? my jobs tend to run very long time, any optimization suggestions are welcome -- https://github.com/wikimedia/analytics-zero-sms/blob/master/scripts/countrycounts.hql [20:13:36] Analytics-Kanban, Analytics-Visualization: Build Multi-tennant Dashiki (host different layouts) - https://phabricator.wikimedia.org/T88372#1207679 (Milimetric) Agreed, nuria, here are my first thoughts for how to proceed. They're on line 644 in the tasking etherpad, we can iterate there and update here w... [20:14:02] hm, joal, if you sort by Fair Share, you can see that only a couple of the essential jobs have any share allocated? [20:14:04] i'm not really sure what that means [20:14:25] I have seen as well, and I don't know either [20:14:26] but, one of them is taking a full steady/instantaneous fair share [20:14:27] 982940 [20:14:29] whatever that means [20:15:08] not sure I believe it though, that one is just the oozie launcher [20:15:08] ? [20:15:09] ottomata: go ahead and kill my query. I used "tablesample" but it didn't seem to make a difference [20:15:23] ja, not sure what the original query was but [20:15:27] https://yarn.wikimedia.org/proxy/application_1424966181866_85641/mapreduce/job/job_1424966181866_85641 [20:15:33] it wants to run >230K mappers [20:16:55] !?! I didn't realize it decided for that many. [20:17:20] but yes, kill it [20:18:22] yurik: Do you really need every webrequest_source ? [20:18:50] joal, sadly, yes [20:19:03] need total traffic per country [20:19:18] i wish we had that info for zero, but that task has stalled with ops :(( [20:19:49] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Story: user wants to be able to re-run a failed report more easily [13 pts] - https://phabricator.wikimedia.org/T88610#1207698 (mforns) @Cap_Swing, We just need to spend some time fixing other systems. When we are don... [20:19:53] x_analytics containing zero is not enough I guess ... [20:20:09] yurik: --^ [20:21:40] joal, ?? don't understand [20:21:59] oh, sorry, was reading wikibug [20:22:14] yurik: You want traffic per country for the full app, not for zero only, correct ? [20:22:47] :) [20:23:44] yes, x_analytics only contains it when for the mobile cluster, not desktop nor other [20:23:56] hmm [20:23:56] most of our partners whitelist an entire ip range [20:23:56] so they need to see how different is their traffic compared with country levels [20:23:56] Yeah, but having a global count wouldn't help either, since you don't know for zero per se [20:23:56] anyway [20:23:56] a few comments on your coed are arriving [20:23:56] right - its not very useful - we generate separate pageview reports for mobile pageviews [20:23:57] thanks, always good ) [20:24:11] I mean. if it's not very usefull, given the computation power it demands, it might be good to consider waiting for the accurate data, what do you think ? [20:25:18] is_zero is not useful for this report - we need country level data [20:25:30] but we also need pageviews per zero partner [20:25:47] and for that we can't get total download size (because traffic is not marked) [20:26:15] I understand you don't have precise enough data [20:26:39] But does having overall global number really help ... [20:26:43] That's my question :) [20:28:44] yes - different use case, different graph :) [20:29:05] they want two graphs - my usage, by all sorts of sub-categories [20:29:10] and overall per country [20:29:52] btw, is there an easy way to get query result data now directly from python? [20:30:06] i use /mnt text files of my table [20:30:23] but sometimes it would be good to do a quick query on top of it [20:31:04] reading the file is the easiest [20:31:15] You can go for pyspark, but it's more complicated [20:33:55] yurik, can you send a link to your code in gerrit, for me to comment easily ? [20:34:16] joel, i'm not sure if its possible to comment on non-patched code [20:34:24] if it was a patch, than yes [20:34:47] ah. if it has been pushed without code review, that's correct [20:35:34] its a small gerrit repo without anyone else working on it, with tons of scripts )) [20:35:44] yurik: np [20:36:58] Question: are you intersted in having the sum when uri-host is undefineD ? [20:37:13] It could be filtered out instead of leaving as '-' [20:38:11] yurik: also, using x_analytics_map['https']=1 costs less than having to look for the string [20:38:20] Same for proxy [20:39:15] yurik: finally having the distribute by is bad, you have a lot of reducers, and only one doing the job [20:39:47] joal, sum for undefied - yes, as it shows what how much i am skipping, and aids in explanation [20:40:16] the way proxy & host are checked - they use identical condition, so they end up together [20:40:33] https trick - good to know, will fix, didn't know we had that ) [20:40:44] btw, why is_zero wasn't done the same way? [20:40:53] Same way :) [20:41:06] is_zero is a separate field, isn't it? [20:41:18] It is, built on top of x_analytics_map [20:41:28] i see [20:41:41] To facilitate access to this data without having to know what is inside each field [20:41:57] And, to speed up request via parquetn [20:42:07] as for distribute - i think i used to have it without it, but than it produced tons of text files, each a few kb. This way they collapse [20:42:34] Yeah, of course --> everythink go to the same reducer [20:42:53] SET mapred.reduce.tasks=1 [20:42:54] is there a way to reduce them to multiple, and than collapse that? [20:43:00] The good way to do it [20:43:10] ok, will fix [20:43:36] Other thing: I would go for 5 requests, one per webrequest_source [20:43:49] how so? [20:44:27] Having 5 requests, one per webrequest_source, then another for final aggregation [20:44:41] But it' a pain [20:44:43] i don' twant to hardcode the webreq_source constants, in case you add more [20:45:05] in theory i could autocheck for that, but it is a massive pain ) [20:45:40] Those are the few tricks I can tell [20:45:51] And in addition, don't run many at once [20:45:58] Those queries are really big [20:46:09] ok, what about line 42? [20:46:25] If is usually not used in sql [20:46:32] usual syntax is case when [20:47:18] SUM(CASE WHEN is_pageview THEN 1 ELSE 0 END) as count [20:47:42] yes, if is an hql artifact. But can it be used directly - count(is_pageview) ? [20:47:49] nopeb [20:48:05] It would count everything, not only the TRUE value [20:48:05] because if is_pageview was a null, it would not count [20:48:17] not the null vals [20:48:25] correct [20:48:32] so here null !== false [20:48:37] yeah [20:48:38] sigh [20:48:43] NULL means no value [20:48:54] Here, there is one [20:49:23] thx [20:49:26] Oh yeah last thing: Please have 'COUNT(1)' instead of count(*) [20:50:06] Prevents from reading every field (hive is not dumb, but it's better syntax) [20:51:34] About country code, there are exceptions when it's not two letters ? [20:51:38] yurik: --^ [20:51:55] i have seen it as "false" [20:52:03] probably not a string too [20:52:08] hmmm [20:52:13] Really weirdo [20:52:19] welcome to my world [20:52:20] Should be '-' in any case [20:52:27] '--' [20:52:45] Another trick would be to do <> '--' [20:52:52] Cheaper than regexp [20:53:05] ?? the dash is any char? [20:53:44] no, saying: you could check for length and being different than empty [20:54:06] little optim, nvermind [20:54:19] yeah, better safe here ) [20:54:25] The uri_host regexp is expensive [20:54:37] arrf [20:54:39] anyway [20:54:58] that's it for me tonight [20:55:06] See you all tomorrow [20:55:24] Analytics-EventLogging, Analytics-Kanban: Troubleshoot EventLogging missing chunks of data - https://phabricator.wikimedia.org/T96082#1207872 (mforns) NEW a:mforns [20:55:48] joal|night, yes it is, would be easier if it was sanitized beforehand ;) [20:55:58] thanks! [21:04:45] Analytics, Analytics-Cluster: Create new normalized uri_host field in refined webrequest table. - https://phabricator.wikimedia.org/T96044#1207903 (Yurik) I use this to convert, which does not catch all bad hosts ``` if (lower(uri_host) RLIKE '^([a-z0-9-]+\\.)*[a-z]*wik[it][a-z]*\\.[a-z]+(:80)?$'... [21:15:00] nuria, milimetric, I'm poking around Event Logging for the first task from here (https://phabricator.wikimedia.org/T92985) and wondering how I can set up a dev environment for it [21:21:49] (PS1) Yurik: Optimized countrycounts.hql [analytics/zero-sms] - https://gerrit.wikimedia.org/r/204153 [21:23:38] madhuvishy: have you set up mediawiki-vagrant ? [21:23:49] https://www.mediawiki.org/wiki/MediaWiki-Vagrant [21:23:54] ah no, not on this machine yet [21:24:27] well, mw-vagrant was built partly to help with EL dev [21:24:37] aah [21:24:43] let me set that up [21:26:34] once you set it up, you enable the eventlogging role and that'll give you a full development environment. https://www.mediawiki.org/wiki/Extension:EventLogging#Developer_setup [21:31:26] ah cool [21:32:15] milimetric: new laptop so lots of installations. sadly they gave me 15 inch, i asked for 13, and i'll get another one in a week and have to repeat all this [21:32:26] (mwahahaha) [21:32:37] YuviPanda: go away. [21:33:03] madhuvishy: you should get the new macbook, that thing looks sweet [21:33:25] * YuviPanda saves the taunting for later [21:33:30] milimetric: Yeah! I wanted that too. But no 16gb ram even if you upgrade [21:34:03] :) I have 4GB, but 640K should be enough for anyone [21:34:52] ha ha I know 16gb is probably not even requirement. Just seems like a good superpower of sorts to have and boast about [21:35:23] also YuviPanda will taunt more if I have 8 :P [21:35:36] tch tch :P [21:35:40] I was doing android development [21:35:44] I needed all the RAM in the world [21:35:53] I suspect if you touch anything Java much 16G is not enough.. [21:36:03] IntelliJ + PyCharm + Vagrant eat up a lot [21:37:11] ha ha right. [21:54:41] 16gb of ram? [21:55:11] can't hear you over the noise my work-from-home development desktop makes with its 64gb of ram, 4TB of HDD space, 1TB of SSD space and six processors [21:58:06] * YuviPanda shouts louder [22:01:55] Ironholds: :O [22:02:20] madhuvishy: it’s ok, he writes code in one of the most inefficient languages known to man. [22:03:43] C++? [22:09:20] R [22:14:10] YuviPanda: I don't know what you mean [22:14:25] I can write an R package that runs faster than your Python ;) [22:14:57] sorry, can’t hear you over my headphones :P [22:49:48] milimetric: There seem to be dependencies for the eventlogging role. [22:50:22] madhuvishy: you enabled it and then did vagrant provision? [22:50:25] and you've got errors? [22:50:27] yeah [22:50:36] what are the errors [22:51:47] http://dpaste.com/11ATT1K#wrap [22:52:05] Skipping because of failed dependencies seems to be the highlight [22:56:53] milimetric: ^ [22:58:59] madhuvishy: that looks like problems with mediawiki vagrant. This is ... not uncommon [22:59:19] I'm not sure why, but people smarter than me work on that project and somehow it always has some issue [22:59:32] milimetric: aah [23:00:05] i'm not sure how to help you, because it always seems very random, like we uninstall and reinstall vagrant, virtualbox, re-clone everything, delete vms, run setup, and at some point it just works again [23:00:15] are you on ubuntu or mac? [23:01:22] milimetric: mac. doing a vagrant reload and trying again [23:01:40] madhuvishy: nuria is on a mac, she might be able to help. [23:04:20] madhuvishy: milimetric composer failure [23:04:22] vagrant ss [23:04:23] err [23:04:24] vagrant ssh [23:04:27] cd /vagrant/mediawiki [23:04:29] composer install [23:04:30] wait [23:04:35] and then come back out and do vagrant provision again [23:05:35] milimetric: I blame composer, mostly :) [23:05:49] YuviPanda: milimetric, did not do composer install, but reload fixed it [23:05:52] everyone who has had vagrant problems it’s always been composer, at least as far as I saw [23:05:56] ah nice. [23:05:56] provision ran now [23:06:13] yeah, I think it’s just that it takes two runs sometimes, and sometimes not [23:06:21] and if you run composer install it gives a more useful error message if needed [23:06:30] it’s all terrible PHP things I try to not be aware of [23:06:47] I blame an ecosystem that's as sturdy as a Popsicle stick bridge :) (ruby that is) [23:07:17] :D [23:11:38] madhuvishy: need help? [23:12:10] nuria: fixed for now :) [23:12:26] madhuvishy: ok, lemme know [23:36:44] Analytics-Kanban, Analytics-Visualization: Build Multi-tennant Dashiki (host different layouts) - https://phabricator.wikimedia.org/T88372#1208348 (Nuria) Updated tasking eitherpad. [23:43:33] milimetric: added comments in tasking etherpad about mutitetenant dashiki [23:43:38] *multitenant [23:53:01] Analytics-EventLogging, Analytics-Kanban, VisualEditor, Patch-For-Review: Wikitext events need to be sampled {lion} - https://phabricator.wikimedia.org/T93201#1208447 (Jdforrester-WMF) p:High>Normal [23:54:00] Analytics-EventLogging, Analytics-Kanban, VisualEditor, WikiEditor, and 4 others: Wikitext events need to be sampled {lion} - https://phabricator.wikimedia.org/T93201#1208459 (Jdforrester-WMF) Open>Resolved a:Milimetric>Krenair [23:54:04] Analytics-EventLogging, Analytics-Kanban, VisualEditor, WikiEditor, and 4 others: Wikitext events need to be sampled {lion} - https://phabricator.wikimedia.org/T93201#1131831 (Jdforrester-WMF)