[00:14:22] kevinator: phabricator, a task tagged "Analytics", but you know we have like 200 tasks in our backlog right? :) [00:20:52] Only 200? Jealous. [00:33:05] milimetric: yeah, I know... Good news: I don't have any new tasks to add. I just thought it may good to add to the wiki page how to request-a-feature ( https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI ) [00:46:39] Analytics-Kanban: Investigate adding piwik to transparency report - https://phabricator.wikimedia.org/T125175#1980488 (Nuria) NEW [00:47:33] Analytics-Kanban: Investigate adding piwik to transparency report - https://phabricator.wikimedia.org/T125175#1980497 (Nuria) [01:31:12] (PS4) Madhuvishy: Development environment for wikimetrics using docker [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/267172 (https://phabricator.wikimedia.org/T123749) [03:01:01] (PS11) Alex Monk: [WIP] Database selection [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) [03:37:12] (PS5) Madhuvishy: Development environment for wikimetrics using docker [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/267172 (https://phabricator.wikimedia.org/T123749) [06:29:05] Analytics-Kanban, Services, RESTBase-API: RESTBase pageview data not updated - https://phabricator.wikimedia.org/T125048#1980936 (Alexdruk) At 6:30 UTC the same problem for https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2016/01/28 [08:49:39] Analytics-Kanban: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#1981001 (elukey) a:elukey [10:08:06] Analytics: wmit-* account creation campaigns totals - https://phabricator.wikimedia.org/T123059#1981068 (FedericoLeva-WMIT) >>! In T123059#1978846, @Milimetric wrote: > We don't keep the raw data that far back. If you'd like to do this sort of query, you'd have to have access to the cluster and be ready to... [10:10:38] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1981074 (Qgil) This KPI is requiring a lot more hours than I expected, when I proposed a couple of years ago or so. W... [10:12:49] Analytics-Tech-community-metrics, DevRel-January-2016: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1981079 (Qgil) If the current page kind of works for your purposes, then maybe we can just resolve this task? I don't see the benefit to keep it... [10:21:08] joal: just seen the 1002 email.. ouch [10:29:05] indeed [10:29:09] I'm on it [10:29:20] it's a mess [10:37:53] let me know if I can help.. still catching up from yesterday [10:37:54] ? [10:44:43] np elukey [10:46:33] /o\ [11:04:05] elukey: problem identified, but I don't know how to solve :( [11:05:40] I saw the errors in webrequest.log, is it still camus banging its head against the wall? [11:06:12] king of, it's having problems with one partition [11:08:31] wow elukey, didn't notice the end of the file [11:10:30] ?? [11:10:35] elukey: can you please stop puppet and camus ? [11:10:49] yes sir [11:10:55] only webreq? [11:11:17] I haven't check the others, let me see [11:11:55] elukey: only webrequest, yes :) [11:13:46] done :) [11:13:51] Thanks :) [11:13:55] Same as yesterday [11:14:16] But with something else in the middle (issues to import some text and upload partitiond) [11:15:08] I hope to be able to do these things in the future :) [11:15:18] You will, for sure :) [11:17:45] elukey: I have some confirmation of stuff: when kafka1012 went back online yesterday, it had corrupted data [11:18:05] since that moment, one partition import through camus failed [11:18:25] (one text for sure, I suspect the same for upload) [11:18:58] this morning I cut a ticket for SMART errors on 1012 for sdf [11:19:11] ticket == phab task [11:19:24] sdf ? [11:19:51] im france, sdf means homeless person, so I guess that's not what you are about :) [11:20:51] ahhahah nono /dev/sdf [11:21:11] not sure if the drive was homeless before being inserted in the rack [11:29:30] Ah, so that means yesterday's issue on kafka1012 was due to a disk being homeless :) [11:29:33] I get it [11:30:46] so yesterday issue seemed to be related to weirdness in fstab, and failure to mount partitions.. Andrew fixed it manuall [11:30:49] manually [11:30:58] the disk issue is something new from this morning :D [11:30:59] ok elukey [11:31:13] We should have monitored better on camus :( [11:33:56] hi, I'd like to reboot bohrium (the piwik server) for the kernel update, ok to go ahead? afaict it's used for analysing the ios app, but 2-3 mins of missing data are probably ok? [11:38:09] moritzm: good question! Not really sure.. joal? [11:38:21] moritzm: I think it's ok (even if since I don't manage the piwik thing :) [11:38:32] I just hope it'll come back alive :) [11:39:29] ok, I'll reboot in an hour or so (and give a brief headsup here before) [11:39:41] thanks moritzm [11:55:11] elukey: kafka1012 has corrupted and makes camus fail [11:55:14] I confirm that [11:55:20] I don't know what we should do [11:55:49] We could ensure kafka1012 is not leader on any partition [11:56:35] joal: theoretically I can stop it, it is not in the mediawiki config and it shouldn't break the world again [11:56:49] elukey: would it break EL ? [11:56:49] joal, yt? [11:56:55] hi mforns [11:56:56] yes you are :] [11:56:58] I'm here yes [11:57:00] :) [11:57:06] joal: probably we'll need to kick EL again [11:57:15] until the pykafka bug is solved [11:57:26] hm, not sure what kicking means [11:57:33] restarting it :D [11:57:52] EL seems fine right now, sorry if I lack context [11:58:15] mforns: camus broken since yesterday because of corrupted data in kafka1012 [11:58:39] So I'm looking for a way to remove kafka1012 as leader in kafka corrum [11:58:45] I see [11:59:04] elukey says we can take it down, and I wonder if that wouldn't break EL again as it did yesdterday [12:00:02] having a partition corrupted, even if not leader, might not be great [12:00:21] so stopping the service while waiting for the disk replacement might be better [12:00:28] but I need to check with ops [12:00:32] joal, elukey, EL may get some errors, but I think it will be fine [12:00:47] mforns: yeah but shouldn't a restart be enough? [12:01:00] hm mforns, yesterday, when 1012 was down, EL was stuck because it couldn't write, right ? [12:01:33] joal, yes, but we didn't loose any data, for some reason, the forwarder could write normally [12:02:11] makes sense --> we received data from client, and the forwarder workedso stuck but no data loss [12:02:14] makes sense [12:03:01] which does not ensure that will work today... but if we need to stop kafka1012, we have to do this at some point anyway, so... [12:03:09] elukey: if ops agrees, I'd go for taking 1012 down [12:03:10] we might loose some server-side events [12:03:20] mforns: you ok with that ? [12:03:36] mforns: we could manually change EL forwarder config to remove 1012 in the wrtiing list [12:03:39] I'll ask to them [12:04:09] Cause for the moment data is not imported correctly on hadoop, and we are later and later as time passes [12:04:13] joal, are we planning to remove 1012 from all clients? [12:04:20] aha [12:04:38] mforns: I don't think so, if it gets back in shape, it's fine to have it [12:04:52] The concern is now and as long as it's not fixed [12:05:15] joal, but how long will EL try to connect to it? [12:05:52] mforns: If we want EL not to be stuck, better to remove kafka1012 globally from the broker list [12:06:10] Doing it on puppet, then merge from ops [12:06:11] yes, because, how long are we planning to take it down? [12:06:27] I don't know ! [12:06:29] yes, makes sense [12:06:54] Ok mforns, let's go for that : remove kafka1012 from EL broker's list [12:07:04] cool [12:07:15] in puppet, then take the machine down, and hopefully camus'll restart :S [12:07:41] joal: what brokers list? [12:07:45] ah EL [12:07:50] didn't see it [12:08:11] joal, elukey, batcave? [12:08:15] but shouldn't be enough to just stop it? Then after EL will get the correct metadata we'll be fine [12:08:16] sure [12:08:18] sure [12:39:23] joal, mforns: https://github.com/wikimedia/operations-puppet/blob/81b610d0fe43dc051c23a8e6ec78b8cc6af5c050/modules/role/manifests/kafka/analytics/broker.pp#L30 [12:40:06] I can theoretically remove on disk only on kafka1012 but I'd need to disable puppet [12:40:16] and then restart kafka [12:40:19] a bit nasty [12:40:27] hm elukey [12:41:04] the cleanest solution is to bring it down [12:41:11] re-checking with ops [12:42:45] elukey: I agree about taking it down first: Before thinking of fixing the machine itself, I'd prefer to have the systems dependent on kafka catching up :) [12:48:51] yep definitely [12:48:59] elukey, is there a task I can link the EL puppet change to? [12:51:33] mforns: there is https://phabricator.wikimedia.org/T125199 [12:51:41] elukey, thanks! [12:51:49] ah forgot to [12:52:01] ? [12:52:20] !log analytics1027 - disabled puppet and camus for webrequest log [12:53:23] elukey: the logbot isn't active here, you need to log that in -operations [12:53:41] wwwhhhhaaaatt?? [12:53:48] I thought it was active O_O [12:53:54] sigh [12:54:06] all right I am going to lunch, back in 30mins [12:54:53] untrue moritzm : https://tools.wmflabs.org/sal/analytics [12:55:01] elukey: you can log in both :) [12:55:32] ah, sorry for the confusion. I thought that was meant to go into https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:01] moritzm: We should probably log those infos in both, but it's still useful for us to keep track of our systems :) [12:56:16] yeah, indeed [12:57:53] I logged in both! [12:58:10] thx elukey :) [13:00:43] what joal???? the log isn't active here?? [13:01:03] ah! ok, sorry lazy reader [13:01:07] :P) [13:02:19] joal, we should create a task for handling the kafka connection exceptions in EL code so that it rotates hosts, no? [13:02:31] mforns: you're right ! [13:02:38] this will increase the probabilities of our hack getting merged [13:02:50] ok I'll do [13:03:31] mforns: I think we shouldn't cheat, and say that the code you are asking to merge is hack, and relate to a proper task as you say :) [13:04:10] But I don't know how to word it though: Ensure pykafka doesn't fail to write when the first instance of the writing corrum is dead ? [13:04:15] yes, I've done that, but I think if I mention the proper long term fix, this will be better no? nobody wants to merge temporary code no? [13:04:25] hehe [13:04:26] I agree :) [13:05:00] Handle pykafka connection errors gracefully? [13:05:02] Or more something like: Ensure pykafka actually works as a librabry interacting with distributed system ;) [13:05:17] I like yours better mforns :) [13:05:22] ok [13:05:24] Less argumentative :) [13:18:13] Analytics: Handle EventLogging's pykafka connection errors gracefully {oryx} - https://phabricator.wikimedia.org/T125207#1981370 (mforns) NEW [13:40:30] elukey: I've installed the opendjk-7 security on the kafka* hosts (so they'll pick up with the next reboots) [13:44:21] (PS1) Mforns: Disable reports until database is able to respond queries [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/267245 (https://phabricator.wikimedia.org/T124383) [13:45:27] Analytics-Kanban, Editing-Analysis, Patch-For-Review: Queries for the edit analysis dashboard failing since December 2015 [5 pts] - https://phabricator.wikimedia.org/T124383#1981440 (mforns) If everybody agrees with disabling reports until the database can respond to the queries, there is the change. [13:45:38] good morning! WHAAaa more broken stuff!? [13:51:55] elukey: joal, just read backlog, i think yall did the right things, nice! mking bfast, coffee... :) [14:07:25] ottomata: o/ [14:09:01] all right a-team, it seems that we have the green light from the ops team to try the 1012 stop [14:09:09] whenever you are ready, I'll try it [14:09:47] moritzm: all right! [14:11:25] I'll reboot bohrium in about 10 mins [14:12:53] ready elukey :) [14:13:22] all right, let me inform the ops team [14:13:38] mforns_lunch: are we ready on EL :) [14:14:34] ? [14:17:56] ops team alerted, standby [14:18:05] I'll wait mforns :) [14:18:38] makes sense [14:18:45] ottomata: so far, nothing soved yet [14:18:54] solved [14:19:21] oh thought yall already did that [14:19:22] k [14:19:26] elukey: i'll check on EL [14:20:03] ready [14:20:08] ottomata: we planned, agreed on the solution but didn't act (waiting for merging on EL and for ops green light on taking k1012 down) [14:20:31] oh merging? [14:20:59] hack the removal of k1012 in the writers lists to prevent yesterday's issue [14:21:11] oh [14:21:14] hmmm [14:21:19] mforns_lunch: was to code that one [14:21:20] that bug is in pykafka too? [14:21:21] ? [14:21:27] not sure why that would matter for pykafka [14:21:33] it only uses that on startup [14:21:35] ottomata: batcave for a minute ? [14:21:37] sure [14:26:52] precisation from the ops channel (something that wasn't clear for me): stopping kafka on one hosts without rebooting/shutdown *shouldn't* trigger the bug in hhvm since the clients would get a connection refused. [14:27:04] the problem was with the timeout [14:27:17] since the host was down and the socket connection didn't fail fast [14:30:39] oh weird [14:31:36] elukey: let'sd killllllll it ! [14:31:49] i'm watching EL logs and am ready kick it [14:32:04] all right [14:35:06] elukey: let us know when [14:35:14] bohrium/piwik rebooted, I don't have any credentials for piwik.wikimedia.org, so if anyone can check whether it's all well, that would be nice [14:38:22] kafka1012 stopped :) [14:38:33] !log stopped kafka on kafka1012 [14:39:41] !log launching manual runs of camus to try to fix state [14:40:04] ok el looks fine [14:40:06] i didn't restart it [14:40:49] elukey: do we have someone in eqiad able to swap the bad disk? [14:40:59] this is a good question, not sure [14:41:11] ok so no ticket about the disk yet, right? [14:41:18] ah no I cut one [14:41:22] ok [14:41:24] https://phabricator.wikimedia.org/T125199 [14:41:29] elukey: just checkin, did you stop pupet on kafka1012? [14:42:02] ah no good point [14:42:22] done [14:42:42] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1981555 (jcrespo) All tables except Edit should be synced now. Edit is still pending. [14:44:08] Analytics: Handle EventLogging's pykafka connection errors gracefully {oryx} - https://phabricator.wikimedia.org/T125207#1981556 (Ottomata) @elukey just stopped kafka on on kafka1012. I watched logs for all eventlogging processors and the eventlogging mysql consumer. As far as I can tell, everything is fin... [14:45:56] !log disabled icinga on kafka1012 until Feb 07 [14:55:46] elukey: successfull camus run :) [14:56:00] elukey: can you restart cron and puppet on anlytics1027 please ? [14:56:59] joal: sure thing! \o/ [14:59:55] !log analytics1027 - puppet re-enabled, camus restarted [15:01:13] thx elukey [15:15:10] joal: things looking ok? [15:15:34] waiting to see how stuff catch up [15:15:42] aye k [15:15:44] ottomata: mobile is good, but others ... [15:15:47] ja [15:16:21] hehe, hey a-team [15:16:26] its time for choose your own adventure [15:16:43] choice A: https://phabricator.wikimedia.org/T96331 [15:16:43] choice B: https://phabricator.wikimedia.org/T125141 [15:16:44] ? [15:17:00] hehe :) [15:17:16] i'm starting to think about cdh5.5 upgrade, and i think i should do this first [15:17:31] i'd prefer choice B :) [15:19:21] ottomata: going for druid for fast aggregated data, I agree :) [15:20:58] very ignorant mode on: what is the main gain that we could have with Impala? [15:25:23] impala is a 'fast querying tool' developped by cloudera to do fqster hive on hadoop [15:26:18] joal: yep yep I mean for us [15:26:33] makes sensefaster queries for our users [15:26:41] elukey: we were going to try it for things like pageview data [15:26:45] could it be a good candidate for a post-mysql era? (if any) [15:26:46] for internal use [15:26:48] sorry elukey, lazy reader :) [15:26:52] or maybe use it to feed cassandra for the api [15:26:59] elukey: true! [15:27:09] ottomata: use it instead of hive everywhere [15:27:18] esp for el data it could be good, who knows [15:27:45] ottomata: but we still haven't figured out if resource management works good with yarn [15:28:44] yeah [15:28:49] i'm inclined to just uninstall it [15:28:49] for now [15:28:55] just to make the CDH upgrade simpler [15:29:00] and we can reinstall alter if we want [15:29:43] ottomata: works for me :) [15:30:00] ottomata: yep if it is not an immediate need I would vote for removing it, but it might re-enter in the game if we decide to propose an alternative for mysql in the next X months.. does it sound reasonable? [15:30:08] KISS on a friday, I can't say no! [15:31:27] moritzm: rebooting piwik should be fine anytime, it's like ... a tier-20 service [15:31:41] aye [15:31:44] i wont' get rid of the puppetization stuff [15:31:46] * elukey is imagining joal dancing in his room [15:31:53] just add comments that is hasn't been tested with 5.5 [15:32:09] ottomata: APPROVED [15:32:31] elukey: Next offsite, YOU'LL SEE THAT ! [15:32:38] :D [15:32:56] hahahaah [15:32:56] ottomata: I don't know what happenned to EL yesterday :( [15:33:05] looking forward for the next offsite [15:33:06] Seems to have caught up easy this time [15:35:08] yeah, dunno either [15:35:11] maybe its a similar bug [15:35:13] we just stopped kafka [15:35:14] not the whole box [15:35:19] yesterday the box was down [15:35:29] i haven't tested the kafka libs with stopping the box [15:35:31] just stopping kafka [15:39:54] ottomata: this is a good point.. Could it be a timeout issue like the hhvm one? [15:40:00] could be ja [15:40:09] hey a-team, I was having lunch, I read the scrollback, I guess operations still didn't get to the EL puppet patch... [15:40:15] but it seems EL is fine, no? [15:41:02] yeah, mforns, Maybe there is an issue when the box goes down [15:41:08] but i don't think there's a problem with broker restarts [15:41:08] well, processors and consumers are failing [15:41:16] now they are?! [15:41:18] ? [15:41:23] mmm, if you look at grafana, you'll see [15:41:35] i havent seen the logs yet, but I bet they have errors [15:42:34] ok, let's figure this out [15:42:37] byw ottomata, this is the patch i pushed to avoid this: https://gerrit.wikimedia.org/r/#/c/267245/ [15:42:39] because something is fishy [15:42:54] looking for a phab ticket, i wrote down the logs i saw... [15:43:25] mforns: https://phabricator.wikimedia.org/T125207#1981556 [15:43:33] sorry bad patch [15:43:51] ottomata, this is the one: https://gerrit.wikimedia.org/r/#/c/267243/ [15:43:53] ja [15:44:07] mforns: btw, taht would break EL in beta [15:44:18] mmm of course [15:44:40] but ja, there are log messages about kafka1012 being down [15:44:43] but, they are normal looking log messages [15:45:01] but you are right, it look slike some of the processors aren't doing their job [15:45:27] yes, sure, if EL will be in this state for a couple hours, that's fine, but it can not be like this for days [15:46:00] ottomata, should I change the patch to do the hardcoing only for production env? [15:46:01] yeah [15:46:13] mforns: i want to figure out what's happening, because this isn't right [15:46:17] ok [15:46:20] i've tested this before, and EL works with broker changes. [15:46:26] let's hold on removing ka12 [15:46:26] oh! [15:46:31] ok ok [15:46:45] there are valid-mixed messages coming [15:46:46] in [15:46:51] just not all it looks like [15:46:56] like maybe one of the processors is stuck? [15:47:16] maybe [15:47:21] I'll check [15:47:33] yeah hmm, according to burrow they are not consuming? [15:50:34] mforns: i'm going to just restart el and see what happens, ok? [15:50:38] ok [15:50:46] watching logs [15:51:05] also [15:51:22] !log restarting eventlogging [15:52:12] HUH [15:52:29] mforns: it seems not to work! [15:52:55] processor-05 tried to connect to ka12 and failed, and after that tried ka13 with no error messages, but still, does not consume [15:52:55] gonna try it with kafka1012 removed [15:53:18] ok [15:54:50] yeah now is fine [15:54:51] hm [15:54:53] mforns: weird [15:54:58] totally not good [15:55:11] how did you remove ka12? [15:55:22] stopped puppet [15:55:30] sed -i -e 's@kafka1012.eqiad.wmnet:9092,@@g' /etc/eventlogging.d/*/* [15:56:01] cool [15:56:03] ok mforns ja lets do as you say and remove kafka1012 in prod [15:56:18] do like you did, but do an if $::realm == 'production' conditional there [15:56:25] ok [15:56:44] i'm more and more thinking that lvs for kafka client bootstrap is a good idea [15:56:50] * mforns looks [16:01:48] ottomata, can I do in puppet: x = if 1=1 {} else {} ? [16:02:16] I know in ruby I can, but.. [16:02:26] ? [16:02:32] oh ternary? [16:02:34] mforns: you want [16:02:56] https://docs.puppetlabs.com/puppet/latest/reference/lang_conditional.html#selectors [16:03:02] thx [16:03:22] elukey: ^^ relevant to your other q too [16:03:45] yep yep I was reading [16:04:15] el-valid-mixed seems to be working now from grafana :D [16:07:42] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [30.0] [16:10:26] ottomata, https://gerrit.wikimedia.org/r/#/c/267243/2 [16:14:07] Thanks mforns to have explained ottomata :) [16:14:19] ottomata: I didn't manage to convey the thing :) [16:14:32] naw, joal you did, i just didnt' believe it [16:14:36] something still is fishy [16:14:37] hehe [16:14:54] it looks to me like without a restart [16:15:00] some of the processors contintued to work [16:15:02] but not all [16:15:06] there was data flowing through valid-mixed [16:15:07] looks like so, agreed [16:15:13] but, with a restart [16:15:18] and with ka12 still in the list [16:15:19] nothing worked [16:15:37] right [16:15:47] That what we experienced yesterday [16:16:06] ottomata: therefore the bad hack in puppet [16:16:10] aye [16:16:14] :S [16:16:25] this so bad though. i tested this in labs many times, and had none of these problems [16:17:23] It's weird to me that a kafka lib fail if one the writers is not up [16:17:36] Analytics: Handle EventLogging's pykafka connection errors gracefully {oryx} - https://phabricator.wikimedia.org/T125207#1981737 (Ottomata) I just tested again with @mforns, and we saw that on a restart of EL with kafka1012 down, EL did not consume or produce properly. This will need investigation for sure! [16:17:45] yeah, it shouldn [16:17:47] 't [16:22:02] Analytics, Analytics-Cluster: Productionize Impala {hawk} - https://phabricator.wikimedia.org/T96331#1981754 (Ottomata) Open>declined Declining and doing T125141 instead. We can revisit this in the future if we decide we want to try impala again. [16:25:02] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [16:42:19] Analytics, Analytics-Cluster, Patch-For-Review: Uninstall Impala [3 pts] - https://phabricator.wikimedia.org/T125141#1981828 (Ottomata) [16:42:39] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Uninstall Impala [3 pts] - https://phabricator.wikimedia.org/T125141#1979381 (Ottomata) [16:42:53] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Uninstall Impala [3 pts] - https://phabricator.wikimedia.org/T125141#1979381 (Ottomata) a:Ottomata [16:45:03] elukey: we can move this to done, ja? [16:45:03] https://phabricator.wikimedia.org/T124644 [16:47:04] yep! [16:48:39] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1981850 (MarkAHershberger) >>! In T64221#1979612, @Aklapper wrote: > 2) without a tool ({T60585}) that allows users t... [16:58:17] Analytics-Kanban: Prepare presentation quaterly review [3 pts] - https://phabricator.wikimedia.org/T123528#1981878 (Nuria) Open>Resolved [16:58:21] Analytics-Kanban: Quaterly review 2016/01/22 (slides due on 19th) - https://phabricator.wikimedia.org/T120844#1981879 (Nuria) [17:00:01] Analytics-Kanban: Add wm:BE-Wikimedia-Belgium to Wikimetrics tags {dove} [1 pts] - https://phabricator.wikimedia.org/T124492#1981889 (Nuria) This tag is now available at: https://metrics.wmflabs.org/ [17:00:10] Analytics-Kanban: Add wm:BE-Wikimedia-Belgium to Wikimetrics tags {dove} [1 pts] - https://phabricator.wikimedia.org/T124492#1981890 (Nuria) Open>Resolved [17:09:30] elukey: just fixed that param in the module, you should be able to do the hiera thing now [17:11:46] ottomata: ok! git submodule add URL:puppet-kafka for the module's code right? [17:11:49] just to have it [17:16:47] if you don't have it locally [17:16:47] do [17:16:51] git submodule update --init [17:17:08] that should be it [17:20:41] Analytics-Kanban: Lower parallelization on EventLogging to 1 consumer - https://phabricator.wikimedia.org/T125225#1981933 (Nuria) NEW [17:20:48] Analytics-Kanban: Lower parallelization on EventLogging to 1 consumer - https://phabricator.wikimedia.org/T125225#1981940 (Nuria) p:Triage>High [17:20:58] Analytics-EventLogging, Analytics-Kanban: Add autoincrement id to EventLogging MySQL tables. - https://phabricator.wikimedia.org/T125135#1981942 (Ottomata) [17:21:04] ahahhhhhhh didn't know it! [17:21:06] thanks! [17:41:47] Analytics-Kanban: Eventlogging should start with one bad kafka broker, retest that is the case - https://phabricator.wikimedia.org/T125228#1982053 (Nuria) NEW [18:10:14] gettin lunch, back in a bit [18:10:34] mforns: just to be sure your docker version is 1.9? [18:10:48] madhuvishy, 1.9.1 [18:10:52] ok cool [18:13:28] mforns: do you know if there is a way to see all pages in https://meta.wikimedia.org/wiki/Config:* [18:14:06] nuria, no.. I tried that when I was working with schema pages, but couldn't manage to do that [18:14:15] but maybe there's a way! [18:14:30] mforns: so for the pages that dashiki uses you just have to know they are there, right? [18:14:31] nuria, do you mean in the search results? [18:14:42] mforns: well... any way really [18:14:47] aha [18:14:53] * mforns thinks [18:15:22] nuria, maybe with an api call [18:15:27] (CR) Nuria: "Could you add to commit message & readme some more info as to how to execute the tabular layout?" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267045 (https://phabricator.wikimedia.org/T118329) (owner: Milimetric) [18:18:26] nuria: we can do it if all of them belonged to a category on Media wiki [18:20:30] nuria, actually, you CAN use the search box in meta with "config:" and see the suggestions [18:20:37] mforns: i did [18:21:09] mforns: and .. ahem... they are not super helpful [18:21:14] but the search results are... yea [18:21:24] ottomata, joal: reading from some kafka docs it seems that they use the sendfile syscall, to basically move data quickly from disk to sockets [18:21:51] and Kafka relies heavily on the page cache to store the log [18:22:16] nuria, yes, we should do as madhu says and have a category for that and a list like with eventlogging incidents [18:22:39] my suspicion is that if the log gets corrupted then kafka just sends it to the consumer, that needs to do the checksum [18:23:08] I've read other people getting InvalidMessageExceptions in their consumers trying to skip those messages [18:23:17] mforns: (cc madhuvishy ) I think we probably want a namespace (like event logging) rather than a category but is less easy to create that that it seems [18:23:50] I don't believe that Kafka does checksums before sending data [18:23:56] aha [18:24:06] Isn't Config a name space? [18:25:18] mmmmmm, nuria, madhuvishy, and I think there are other dashiki configs in Dashiki:.... [18:25:29] nuria: we don't have a namespace for event logging either. Its just all Schema: pages. Even with it you can't find all of them I think. Categories are the easiest way afaik [18:25:37] madhuvishy: no, it is a "fake" namespace [18:25:43] Ahh [18:25:50] madhuvishy: EL is a namespace [18:26:05] madhuvishy: that is why they render as json nicely formatted... [18:26:19] Right Schema is a namespace [18:26:25] I thought config was too [18:26:37] madhuvishy: but Config is Low tech namespace [18:26:46] We can may be ask for it to be a namespace [18:26:48] elukey: aye all sounds right [18:26:54] curious about those invalid messages exceptions [18:27:09] But that still doesn't give us the power to see all pages in one place I think [18:27:10] i wonder what the sendfile call returns if there is an error [18:27:14] I can check [18:27:18] maybe no error, eh? [18:28:23] the send file should prevent data to go from disk to kernel space to userspace and then back again [18:28:30] Analytics-Kanban: Investigate adding piwik to transparency report - https://phabricator.wikimedia.org/T125175#1982178 (Nuria) Looks like puppet deploys this from latest: bblack> nuria: puppet controls the software deployment 10:26 AM and in turn, the puppetization indicates the content comes from a... [18:28:32] that would be the only way to do CRC [18:28:36] before sending [18:28:40] but the overhead would be massive [18:29:11] so the data is handled in kernel space, moving it from disk to socket directly [18:29:44] and delegating to the consumer the CRC, falling back (i presume) to other brokers in case of faulty data [18:29:50] it would make sense [18:30:13] not super familiar with pykafka et all to know if this is admissible [18:30:16] :( [18:30:26] nuria, madhuvishy, maybe with something like: https://meta.wikimedia.org/w/api.php?action=query&list=search&format=json&srsearch=config: ? [18:31:22] mforns: some false matches no? [18:32:30] also mforns, on the docker thing - the only thing we had different was that linux-kernel-extra thingy - if you have time today/anytime can you try installing that, restarting docker and seeing if the same issue shows up? As far as I'm reading this is a known issue but was already fixed. [18:34:24] madhuvishy, false matches, sure, we'd need to do some greping afterwards [18:34:48] madhuvishy, makes sense, I had forgotten, will do [18:35:37] Analytics-Kanban: Investigate adding piwik to transparency report - https://phabricator.wikimedia.org/T125175#1982244 (Nuria) Added transparency site to piwiki, siteId is '4'. Need to add snippet to transparency depot [18:37:06] mforns: yeah - i've been thinking when we have the new dashiki deployment complete - as part of the deploy build to publish a html page of all the dashboards and their configs [18:37:09] nuria: ^ [18:37:17] the deploy config has all the details [18:37:37] we can easily build it and it could be at dashiki-dashboards.wmflabs.org or sth [18:37:50] awesome idea [18:37:52] madhuvishy: in puppet? [18:37:57] fabric [18:38:06] madhuvishy: ah, ok [18:38:20] that task is paused for nw [18:38:40] i'll work on it after your work to move dashiki to use the pageview api is done [18:39:22] https://gerrit.wikimedia.org/r/#/c/259437/ [18:39:24] madhuvishy: as a standalone app (w/o puppet, usable by anyone) it will be ideal to have a way to search for other peoples config as dashiki as a platform it is useful on its own, can live behind a cdn domain w/o any puppet at all [18:40:00] ottomata: https://gerrit.wikimedia.org/r/#/c/267295/1 [18:40:22] madhuvishy: but everything on its own time [18:41:10] nuria: i don't understand - I agree that it can - but we are going too far ahead to make a feature for other people. the dashiki deployment task to build a html of all the pages can be used by anyone anyway [18:41:26] madhuvishy, that linux-image-extra was already installed and last version in my machine [18:41:29] nuria: it would be nice to have an onwiki way [18:41:32] mforns: aah [18:41:36] :/ [18:41:44] madhuvishy: no, I am not saying we need to build anything [18:41:55] madhuvishy: but dashiki is used by 3 other teams beside us [18:41:59] madhuvishy: w/o puppet [18:43:54] Nuria: but the plan is to move all of it [18:44:38] madhuvishy: all usages? [18:45:01] madhuvishy: i do not think so, we want people to be able to use it by dropping it anywhere [18:45:02] mforns: bummer. I read all the issues and they are closed! I'll try to ask YuviPanda later today [18:45:10] ok [18:45:23] madhuvishy: let me see .. maybe we are talking about different things? [18:45:24] hm elukey do you need the debdeploy things? [18:45:29] nuria: ya maybe [18:45:33] I'll try from scratch again on monday morning and see if I forgot something [18:45:46] ottomata: it was already there.. [18:46:00] OH! [18:46:01] it is! [18:46:03] didn't realize [18:46:06] :D [18:46:06] that the file existed [18:46:07] ok cool [18:46:10] Thanks mforns :) [18:46:47] yeah 1012 seems to be canary for kafka deploy somehow [18:46:53] not really sure how it works [18:47:01] madhuvishy: one thing are "layouts" other "instances" [18:47:12] madhuvishy: there are several dashiki instances we donot control [18:47:27] elukey: i'm not sure about deb deploy either [18:47:37] nuria: which ones? [18:48:01] madhuvishy: the one used by content translation and MarkTraceur for ex [18:48:15] 08ee54f2 (Moritz Muehlenhoff 2015-10-21 14:27:06 +0200 2) debdeploy-kafka: [18:48:16] nuria: language and mutimedia? [18:48:21] madhuvishy: yes [18:48:22] we need to ask to moritzm :) [18:49:11] elukey: aye, but if it was there, it'll be fine [18:49:18] nuria: they are all setup within the analytics project as far as i know [18:49:27] nuria: the thing is - we don't need to own anything [18:49:34] even with the puppet setup [18:49:52] ottomata: I don't have the +2 so I can't do proper merge :( [18:50:06] a-team, have a nice weekend! see you on monday :] [18:50:06] aye, (you don't?) i'll get to it, 1 min [18:50:08] madhuvishy: right, what i am saying is that dashiki can stand on its own w/o puppet [18:50:09] laters! [18:50:15] madhuvishy: puppet is a convenience for us [18:50:15] nuria: the puppet module just provides a simple way to serve static sites. [18:50:17] ciao! [18:50:29] bye :] [18:50:30] it just has a static apache thing [18:50:43] ottomata: yeah, not sure why.. I need to investigate next week :( [18:50:51] madhuvishy: but if you already have apache and you want to put dashiki behind it should be just a drop (which is the way it is now) [18:50:58] nuria: yes [18:51:00] do you mind to also do the puppet-merge on palladium? [18:51:09] but the fabric code doesn't depend on puppet either [18:51:18] and everybody can use it [18:51:20] to deploy [18:51:31] madhuvishy: yes, understood [18:52:04] ok elukey just in case, to double check that we don't make a boo boo [18:52:09] nuria: millimetric is my dashiki guy [18:52:19] I ain't control shit [18:52:25] we are gonna have a "dashiki" project on labs - and an instance that can house all these dashboards. since they are just a bunch of html css and js, all people will need to do to get it on labs is, fab edit-analysis deploy or something [18:52:34] i'm going to manually edit the kafka systtmed file on 1012 so that if we are wrong, puppet won't be able to start kafka [18:52:35] this is not how it is [18:52:37] MarkTraceur: ya, i know milimetric and i wrote dashiki couple years ago [18:52:53] nuria: but it's the future [18:52:55] ottomata: all right! [18:53:49] a-team: logging off, have a nice weekend! [18:54:20] MarkTraceur: :) nuria the idea is to make it easy for everyone to deploy. currently all of them are housed in limn1 [18:54:39] laters elukey thanks for all your help! [18:55:08] nuria: No, I mean, I never set up a dashiki instance, millimetric did it for me [18:55:10] ottomata: thank you for all the patience, same thing for joal.. really learning tons of things with you guys! [18:55:18] I owe you a lot of beers :) [18:56:12] btw, patch looks good elukey, puppet didn't start kafka [19:00:41] Analytics-Cluster, Analytics-Kanban: Create and maintain an Analytics Cluster in Beta Cluster in labs. - https://phabricator.wikimedia.org/T109859#1982355 (Ottomata) [19:30:43] (PS1) Milimetric: Update with January 2016 data [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/267306 [19:31:00] (CR) Milimetric: [C: 2 V: 2] Update with January 2016 data [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/267306 (owner: Milimetric) [19:40:41] milimetric: Hmm, https://gerrit.wikimedia.org/r/#/c/267306/1/datafiles/rc_active_editors_count.csv changed quite a lot of old data. Did we change the metric? [19:41:09] (Pretty much all of the files do, FWIW.) [19:44:48] James_F: no, all the files always do, Erik's metrics are re-evaluated for all of history to take deleted pages into consideration [19:45:45] Ah, OK. [19:45:47] this is why I think our current approach to analytics is busted and we need append only streams and things like Druid [20:59:36] Analytics-Kanban, Editing-Analysis, Patch-For-Review: Queries for the edit analysis dashboard failing since December 2015 [5 pts] - https://phabricator.wikimedia.org/T124383#1982749 (Neil_P._Quinn_WMF) From my perspective, that seems reasonable since I'm working on T118063. Any objections, @Jdforrester-... [21:24:33] madhuvishy: I got a ping here! [21:24:50] YuviPanda: yeahhh [21:25:01] so I was asking marcel to test the docker stuff [21:25:25] and he got this Cannot link to a non running container error [21:25:43] YuviPanda: since you have linux, was going to ask if you have run into it with docker compose [21:26:00] the source is there are services like create_db that just run and exit [21:26:09] ah [21:26:11] ok [21:26:15] will do later :D c [21:26:19] for me, it runs, and exits but doesn't block the services starting up [21:26:20] madhuvishy: can you open a bug with details? [21:26:22] * YuviPanda is at UCB [21:26:27] in ubuntu it seems to [21:26:42] YuviPanda: ya np [21:26:55] i see a number of similar bugs filed at docker [21:27:00] they claim to be fixed [21:27:18] fun [21:27:21] YuviPanda: like https://github.com/docker/compose/issues/1193 [21:29:52] joal: thanks for the update; is it ok running queries on hive again? [21:30:40] HaeB: I think the cluster is still catching up - I think joal meant to say to wait until monday for the go ahead [21:30:43] nope HaeB, cluster is about 20 hours behind on computation, so it's better to leave it alone for the weekend :) [21:31:03] indeed madhuvishy [21:31:05] thx :) [21:31:08] :) [21:31:31] ok :( understood, thanks for your work to get it back up [21:32:40] Thanks HaeB :) [21:32:44] Have a good weekend ! [21:32:56] * joal is back planning the backfilling jobs [21:39:04] Analytics-Tech-community-metrics, DevRel-January-2016: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1982837 (Nemo_bis) > If the current page kind of works for your purposes, then maybe we can just resolve this task? Probably. Maybe remove "Imp... [21:52:02] madhuvishy: sorry, juggling too many things. I can take a look later [21:55:55] ottomata: anywhere near ? [21:58:19] joal: ja [21:58:21] am here [21:58:24] how goes? [21:58:26] Heya [21:58:33] Backfilling ... [21:58:56] Was wondering if there was a way to check the mysql db behind oozie (response time are soooooo bad ...) [21:59:13] lookin [21:59:21] on an1027 IIRC [21:59:27] ? [21:59:35] ja [21:59:39] long query time! [21:59:41] I tried mysql -u oozie -p from there, but refused :( [21:59:42] 509 secs [22:00:12] Like I have started an API call, something like 10 minutes ago ... [22:01:31] I want to double-check there are indices on that DB (I assume yes, but who knows. ... [22:01:41] yeah machine is really overloaded [22:02:10] joal hm [22:02:11] you know [22:02:20] i do have a mysql server running on analytics1015 [22:02:21] that is a slave of this [22:02:30] i did that in anticipation of moving hive and oozie there [22:02:36] Right [22:02:41] Switch ? [22:02:42] could make it master [22:02:58] difficult ottomata? [22:03:00] probably a bad idea at the end of a friday [22:03:03] shouldn't be [22:03:04] I'm no good at mysql [22:03:06] easier than moving hive/oozie [22:03:11] for sure ! [22:03:12] and would probably help al ittle [22:03:17] right, bad idea :) [22:03:28] let's forget about that, this too shall pass :) [22:03:31] ok [22:03:50] I'm gonna keep fighting [22:04:45] hue is actually taking a lot of memory [22:05:07] disk is pretty utilized though [22:05:12] that is probably most of the slow down [22:05:16] hm [22:07:29] looks like mysql is doing most of the io [22:07:36] k [22:07:46] so switching mastesr probably would help... [22:08:01] ottomata: in DB, no indexes on name for instance [22:08:27] YuviPanda: no problem - whenever you can [22:09:22] Hey ottomata, my API call finished, after more than 10 minutes ! [22:09:25] :D [22:09:34] uh oh, but slave says [22:09:35] Last_IO_Error: error connecting to master 'root@10.64.36.127:3306' - retry-time: 60 retries: 86400 message: Can't connect to MySQL server on '10.64.36.127' (110 "Connection timed out") [22:09:58] Arf, when was that ? [22:10:38] oof two days ago [22:11:44] i think the io on an27 is causing the slave to timeout [22:11:45] :( [22:11:58] arf, nevermind [22:12:15] I'll go for manual ;) [22:13:42] (PS3) Milimetric: Implement a Tabular Layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/267045 (https://phabricator.wikimedia.org/T118329) [22:14:14] hmm [22:14:32] oo no [22:14:35] firewall change? [22:14:46] ouch [22:21:18] a-team, I'm off for today ! [22:21:23] Have a good weekend all :) [22:21:45] laters! [22:21:58] ottomata: don't bother with the mysql stuff, we'll that later :) [22:22:03] Ciao ! [22:22:07] ok [22:22:12] i want the slave working though [22:22:15] so i'm fixing at least that [23:10:23] (PS1) Milimetric: Clean up syntax errors and bad names [analytics/dashiki] - https://gerrit.wikimedia.org/r/267389