[08:45:38] hi halfak [08:45:47] Hi joal! [08:45:54] How are you ? [08:46:12] It's fun to be able to talk with you in the morning :) [08:47:40] Good! :D Just hanging out in the Lyon airport. [08:47:48] Waiting for my flight to Copenhagen. [08:48:11] Starting tomorrow morning, I'll be a real, live coworker again [08:48:18] *AND* in the same timezone as you. [08:48:20] :) [08:51:51] Cool ! [08:52:09] for how long in Copenhagen ? [08:57:15] joal, I'll be in full-time-work-mode until Friday afternoon. Then I have a workshop over the weekend. I fly back to the states on Wednesday, June 3rd. [09:08:18] Woops. Airport wifi tanked a few minutes ago. [09:32:54] Analytics-Tech-community-metrics, ECT-May-2015: Provide list of oldest open Gerrit changesets without code review - https://phabricator.wikimedia.org/T94035#1153222 (Dicortazar) Ok, I think this is fixed. Cllosing the task. [09:33:02] Analytics-Tech-community-metrics, ECT-May-2015: Provide list of oldest open Gerrit changesets without code review - https://phabricator.wikimedia.org/T94035#1312097 (Dicortazar) Open>Resolved [09:33:19] Analytics-Tech-community-metrics, ECT-May-2015: Provide list of open Gerrit changesets with most activity which aren't -1/-2'ed - https://phabricator.wikimedia.org/T94036#1153232 (Dicortazar) I think this is fixed. Closing the task. [09:33:26] Analytics-Tech-community-metrics, ECT-May-2015: Provide list of open Gerrit changesets with most activity which aren't -1/-2'ed - https://phabricator.wikimedia.org/T94036#1312099 (Dicortazar) Open>Resolved [09:59:17] Analytics-Tech-community-metrics, ECT-May-2015: Community Metrics for IRC channels not updated since 09/2013 - https://phabricator.wikimedia.org/T96371#1312109 (Dicortazar) I've been checking what's going on here. From the initial list of IRC channels whose logs where downloaded and analyzed, I'm missing... [10:31:40] (CR) Ricordisamoa: "It has been updated, thank you." [analytics/quarry/web] - https://gerrit.wikimedia.org/r/211294 (owner: Ricordisamoa) [11:04:49] Analytics-Cluster, Analytics-Kanban: Compute pageviews aggregates daily and monthly from April {wren} - https://phabricator.wikimedia.org/T96067#1312187 (MeganHernandez_WMF) Checking back on my question about how to access this info. Any update? Really wonderful that this is happening. I'm very interes... [12:19:26] Analytics-Cluster, Analytics-Kanban: Compute pageviews aggregates daily and monthly from April {wren} - https://phabricator.wikimedia.org/T96067#1312352 (JAllemandou) @MeganHernandez_WMF This data is currently being generated in a not yet productionized fashion. Data from April 1st to May 25th is accessi... [12:25:51] Hi folks, there is some lag in cluster computing refined webrequests [12:26:21] There 2 nodes out and 6 unhealthy, making our computatui [12:26:30] computation power limited [12:27:21] The balancer is curently running, but if you could postpone jobs for the cluster to catch-up, that would be great [13:16:42] morning! [13:19:33] Hi milimetric :) [13:19:40] Had a good weekend ? [13:19:42] hey joal :) [13:19:47] yes, but the closet's still not done [13:19:54] we ended up doing two other major projects [13:19:59] Arf ... Started [13:20:01] ? [13:20:20] Priorization is always at the core of everything :) [13:20:23] hehe [13:20:27] those others were a priority! [13:20:53] huhu [13:21:03] Do you have a few minutes for me ? [13:21:21] Would like some feedback on what I missed last friday [13:22:01] For those interested: http://twitter.github.io/effectivescala/ [13:25:09] joal: just give me another minute, getting through my email [13:25:15] no problemo :) [13:28:32] joal: ok, batcave? [13:28:36] OMW ! [13:34:35] MOorniiing [13:35:33] Hi ottomata :) [13:35:51] ottomata: cluster issues today :( [13:35:56] unhelathy nodes [13:36:14] yeah saw you remail [13:36:15] looking [13:36:20] balancer is runnig...but seems stuck? [13:36:23] or really slow [13:36:27] hm [13:36:52] gonna kill and restart it [13:36:55] k [13:37:53] joal: what is annoying about this, is that we aren't actually the low on space [13:37:57] its just that certain nodes are low on spae [13:37:58] space [13:38:03] so they don't get assigned work [13:38:18] guess that is what we get for having a heterogeneous cluster :( [13:38:27] hmm [13:39:09] we have 166.42 TB total free [13:42:01] joal: i'm going to increase this: yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage [13:42:12] how much currently ? [13:42:15] 80% ? [13:42:22] it is at 90 [13:42:30] It's not bad already ! [13:43:54] ? [13:44:00] fairly high ! [13:44:09] well, they are being marked as unhealthy with 10% space left [13:44:14] that's 2 TB! [13:44:20] yeah true [13:44:40] 99% would give them > 200G free [13:45:25] ok [13:45:25] and > 400G on the newer nodes [13:45:28] gonna set it at that [13:54:25] ok phew, joal applied that and restarted nodemanagers on the unhealthy nodes [13:54:36] gonna move to a cafe, back shortly :) [13:55:14] Awesome :) [13:55:16] Thx ottomata [13:59:34] hmmmMM [14:00:00] now some are alarming about actual physical disks being full..>.>.>>.>>.>>>..>... DUHH because 200GB is not much / 12 [14:00:01] disks [14:00:20] crap crackers [14:00:43] HMmm, its ok [14:00:51] they will still have 18G free on each disk [14:00:56] which is fine i think [14:00:59] we should use the space [14:01:00] hm. [14:01:04] we just need to adjust alarms [14:54:39] Analytics-Tech-community-metrics, ECT-May-2015: Community Metrics for IRC channels not updated since 09/2013 - https://phabricator.wikimedia.org/T96371#1312646 (Dicortazar) IRC metrics are now updated. I'll work on its automation. I'll close this ticket, but let me know extra channels you may be interested. [14:55:15] Analytics-Tech-community-metrics, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1312649 (Dicortazar) [14:55:16] Analytics-Tech-community-metrics, ECT-May-2015: Community Metrics for IRC channels not updated since 09/2013 - https://phabricator.wikimedia.org/T96371#1312648 (Dicortazar) Open>Resolved [14:57:00] Analytics-Tech-community-metrics: Tech metrics missing IRC channels - https://phabricator.wikimedia.org/T56230#1312651 (Dicortazar) Once {T96371} is closed, we may want to work on this task again. We can discuss this during this Friday conf call and leave it as June task. [14:58:08] ottomata: Do we have a beta-labs cluster ? [14:59:08] Analytics-Tech-community-metrics, ECT-May-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1312653 (Dicortazar) An update about this. We kept working on the following steps till we get a better approach to deal with the retrieval process. Thus, we only have data goi... [14:59:41] joal: a hadoop cluster? [14:59:42] no [14:59:44] i haven't set one up anyway [14:59:52] ok [14:59:56] kafka is running in deployment project [15:00:00] which i think is beta labs [15:00:05] yup [15:00:06] but i don't maintain it much [15:00:09] k [15:00:36] Wanted to test camus on eventlogging data, therefore beta-labs was easier [15:01:37] aye, i mean, you could set one up pretty easily i think [15:01:46] or, you can test in prod pretty easily, [15:01:51] you can just change import paths in camus, ja know? [15:01:52] Right [15:02:10] With resource bottleneck, I would have prefered not to test in prod ;) [15:17:58] aye [15:18:03] because of delay? [15:18:09] eventlogging data is so small relatively though [15:18:10] no biggy [15:21:59] yeah, you right [15:37:48] https://gist.github.com/ottomata/f91ea76cece97444e269 [15:40:46] Analytics-EventLogging, Analytics-Kanban, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1312720 (BBlack) We probably want to bump up the shm_workspace as well (which already occasionally... [15:47:50] ottomata: https://issues.apache.org/jira/browse/SPARK-6904 [15:53:03] ahh, yeah joal, that's a duplicate of the one that was referred to in the response to my email [15:53:03] https://issues.apache.org/jira/browse/SPARK-6910 [15:54:03] Target is Spark 1.5. In Spark 1.3 a workaround is to do something like:sqlContext.table("tableName").registerTempTable(...) which caches the list of partitions in memory on the driver. The initial pull is expensive but it is much faster after that. [15:54:06] will try that, but that still sounds bad [16:01:15] ottomata, so are there meant to be a kazillion production hadoop jobs running? [16:01:28] I'm adding <1S of processing time every 1S for a legal thing [16:06:46] hang on Ironholds, meeting [16:07:21] kk [16:15:26] Ironholds: there are 28 running [16:15:28] which isn't really that many [16:16:17] okay. And yet, <1S of processing time for 1S of time ;p [16:16:34] in fact, <1S of processing time a minute [16:17:11] not sure what you are saying [16:19:07] ottomata, I am running a job for the lawyers, and for every minute of real time that passes, the process is spending between 0 and 5 seconds actually computing on the data, according to the reporter. [16:20:37] ottomata: tried it already --> no way to load everything, fails after 20 mins, even with a lot of driver memory [16:21:07] hm, aye k [16:21:46] ottomata: let me know when ready for ops news ;) [16:22:14] ? [16:22:37] my meeting is finish, and IO'd love to have feedback on Impala :) [16:23:16] OH [16:23:25] i haven't tried it since we made those mem changes [16:23:29] k [16:23:34] focused on those spark issues and reqstats on friday [16:23:43] yup ok [16:24:18] About those spark stuff: I am goning to continue to use raw parquet files, and update when hive context works [16:24:44] yes. [16:24:47] i think that is the way to go joal :/ [17:27:44] joal: i'm going to delete some old wmf_raw data again [17:27:57] ok [17:28:01] :( [17:29:29] And in fact: I don't really mind ;) [17:35:47] Analytics-Cluster, operations, procurement: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1312883 (Ottomata) NEW [17:37:19] yeah [18:12:43] (running home from this cafe, back on shortly) [18:34:38] Analytics-Cluster, Analytics-Kanban: Ooziefy and parquetize pageview intermediate aggregation using refined table fields [13 pts] {wren} - https://phabricator.wikimedia.org/T99931#1312937 (kevinator) p:Triage>Normal [18:49:48] joal: impala looking good so far [18:49:50] for simple test [18:49:55] NIIIIIIICE :) [18:50:06] I wanna TEST :) [18:52:16] Analytics-Cluster, Analytics-Kanban: Create new normalized uri_host field in refined webrequest table {hawk} [13 pts] - https://phabricator.wikimedia.org/T96044#1312973 (kevinator) p:Triage>Normal [18:55:40] Analytics-EventLogging, Analytics-Kanban: Event Logging doesn't seem to handle unicode strings - https://phabricator.wikimedia.org/T99572#1312988 (kevinator) We could write a unit test to validate this is true/false. Is it at the processor or consumer level? Need to try to replicate error and fix if it e... [18:58:05] Analytics-EventLogging, Analytics-Kanban: Event Logging doesn't seem to handle unicode strings {oryx} [8 pts] - https://phabricator.wikimedia.org/T99572#1312991 (kevinator) [19:01:33] Analytics-Cluster, Analytics-Kanban: Create new normalized uri_host field in refined webrequest table {hawk} [13 pts] - https://phabricator.wikimedia.org/T96044#1313004 (JAllemandou) So, I think we could add a field lowering the host, removing the port if any, and splitting the domain into a map like that... [19:03:53] Analytics-Cluster, Analytics-Kanban: Create new normalized uri_host field in refined webrequest table {hawk} [13 pts] - https://phabricator.wikimedia.org/T96044#1313005 (JAllemandou) So, I think we could add a field lowering the host, removing the port if any, and splitting the domain into a map like that... [19:04:08] Analytics, MediaWiki-extensions-ContentTranslation: Limn language dashboard: eswiki graph is wrong/stuck - https://phabricator.wikimedia.org/T99074#1313006 (Milimetric) Kartik, I ran that query in the database a few times and that's the number I get. So the replication may be messed up for that single t... [19:05:12] joal: https://gerrit.wikimedia.org/r/#/c/169346/1/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IdentifyMediaFileUrl.java [19:05:26] that was the media file normalization that I pinged you on Phabricator about, in case you didn't see it [19:05:34] (the one I was talking about this morning) [19:05:55] I think you're talking about the "PercentDecoder" bit of it :) [19:06:02] I looked into that this morning ;) [19:07:07] milimetric: --^ [19:09:02] joal: right, along with the trimming and maybe other stuff, I didn't look too closely [19:09:04] good that you looked it over [19:09:13] thx for that :) [19:19:48] milimetric: Is there a reason to decode percent this way instread of classic URL decoding ? [19:29:09] See you tomorrow guys, I'm going for diner ! [19:37:57] llaters [20:07:25] Analytics-Cluster, Analytics-Kanban: Create new normalized uri_host field in refined webrequest table {hawk} [13 pts] - https://phabricator.wikimedia.org/T96044#1313125 (Ottomata) That would make the implementation much easier, as we wouldn't have to code in WMF project/subdomain/mobile/zero/whateverelse... [20:31:18] Analytics-Cluster, Analytics-Kanban: Create new normalized uri_host field in refined webrequest table {hawk} [13 pts] - https://phabricator.wikimedia.org/T96044#1313170 (Yurik) Lower and 80 is good, but i think we should really solve the common problem here of unusual domains - let's only allow something... [21:18:33] Analytics-Cluster, operations, procurement: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1313237 (Dzahn) p:Triage>Normal [21:21:31] Analytics-Cluster, operations: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1313252 (Dzahn) p:Triage>Normal [21:51:30] Analytics-Engineering, operations: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#1313270 (Dzahn) p:Triage>Normal [22:24:11] Analytics-Cluster, Analytics-Kanban: Create new normalized uri_host field in refined webrequest table {hawk} [13 pts] - https://phabricator.wikimedia.org/T96044#1313322 (madhuvishy) +1 to what @Yurik said. [23:16:44] Analytics-EventLogging, Analytics-Kanban: Code to write a new Camus consumer and store the data in two Hive tables [21 pts] {oryx} - https://phabricator.wikimedia.org/T98784#1313411 (kevinator)