[02:09:13] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Patch-For-Review, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2733662 (Mholloway) @JMinor I took a stab at incorporating your heuri... [05:52:57] !log created 0001009-161020124223818-oozie-oozi-C to run webrequest-load-check_sequence_statistics-wf-upload-2016-10-21-3 (oozie errors) [07:20:26] FYI I am rebooting stat100[234] [07:26:50] all done, not sure if there are specific checks to do after a boot [07:27:46] Hi elukey :) [07:28:48] o/ [07:30:36] What's up this morning? [07:31:08] nothing on fire for the moment :) [07:31:16] I rebooted all the stats, including 1001 [07:31:23] (so a brief outage for our websites) [07:31:44] okey, from your messaage I understood eveything went smotth? [07:32:17] elukey: Do you know the kernel change we are deploying with this reboot? [07:32:57] yes it is to solve a vunerability bug (11 years old!) discovered in the kernel [07:33:09] Maaaaaan .... 11 years old ! [07:34:13] Riccardo gave me yesterday the link to the actual git commit: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 [07:34:14] elukey: upgrading my debian: indeed, new linux image [07:35:44] joal: what it would be great to do today is to figure out if the kafka mess that happened yesterday caused data drop [07:36:22] elukey: We restarted the jobs with higher threshold with mforns, but didn't check [07:36:44] elukey: I guess looking for timestamp = '-' in webrequest upload is a good start [07:36:48] long story short: firewall connection tracking was saturated due to a race condition between puppet and the OS, and connections were dropped [07:37:13] I think that it could be worse: https://grafana.wikimedia.org/dashboard/db/varnishkafka [07:37:13] elukey: And, since the problems only occurred on webrequest upload, I think there is no big deal [07:37:25] no no not only upload [07:37:35] ALL the brokers were at some point misbehaving [07:37:36] yes, but oozie errors = only upload [07:37:38] at firewall level [07:38:08] so, since no email from oozie, it means less than 1% data loss if any [07:38:14] right, I understand [07:38:44] but what if we dropped half of the connections from vk, without causing holes? Is it possible? [07:38:52] I am thinking out loud [07:39:18] if you see https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen [07:39:42] from 15:00 to 17:30 [07:39:58] these are librdkafka failed delivery reports [07:40:03] now we have retries [07:40:32] (s/now//) [07:40:48] buuuut I am paranoid you know :) [07:40:59] if feels weird that we didn't drop data with that mess [07:42:50] I am checking pivot atm [07:45:08] elukey: sorry phone [07:45:50] elukey: If there had been data loss, it would have happened between VK and kafka, not in vk - meaning, sequence numbers are not messed up [07:46:49] And with our checks in load, would there have been (enough) problems with missing sequence numbers, we would have know it [07:46:54] elukey: --^ [07:47:09] So I'm confident that if data has been lost, it's very small [07:48:06] elukey: makes sense what I'm thinking? [07:49:14] sure sure [07:49:20] ok to reboot bohrium now, the server hosting piwik? [07:49:59] I think so moritzm! [07:50:55] mobrovac: ciao! Whenever you want to do the kafka100[12] reboots let me know [07:50:59] even this afternoon [07:55:05] it's back up, if anyone uses it, I'd welcome a quick functionality check. but at least it's showing me the mod_auth prompt :-) [08:00:02] we'll ask to nuria and milimetric, they are the masters of piwik :) [08:00:19] ok :-) [08:00:46] but I can see data flowing in apache [09:24:25] ok to reboot krypton? it runs burrow, so I wanted to doublecheck [09:26:12] elukey: i'm good to go [09:26:28] moritzm: go ahead, thanks! [09:26:35] mobrovac: all right let's do it [09:27:00] so what I am going to do is (and then you tell me what extra steps we'll need) [09:27:11] 1) depool kafka1001 [09:27:49] 2) check kafka topics replication, and if good stopping kafka (extra check afterwards) [09:27:56] 3) reboot kafka1001 [09:28:21] 4) wait to come up again, check topics, launch preferred replica election, check topics again [09:28:24] 5) re-pool [09:28:33] and then same thing with kafka1002 [09:28:39] yup [09:28:51] in step 4 we also need to check the proxy service is up before repooling [09:31:47] sure sure [09:31:57] all right proceeding with kafka1001 [09:35:51] krypton back up, let me know if anything is fishy [09:36:14] sure [09:36:33] rebooting kafka1001 [09:40:38] all good, up and running [09:40:42] replicas are in sync [09:40:48] and mirror maker is up as well [09:40:57] proxy is up too [09:41:03] (checking traffic with httpry) [09:41:55] cool [09:42:10] repooling [09:43:02] ok I can see 201s flowing [09:43:11] let's wait a bit and then do 1002 [09:43:26] (a bit == some datapoints in grafana) [09:43:45] i will restart cp now then [09:43:47] for sanity [09:43:58] and redo the same after kafak1002 [09:45:23] ok, restarted, let's wait at least 60 seconds now to confirm messages are being handled by cp elukey [09:46:28] sure sure [09:46:39] mirror maker on 1002 didn't like this reboot [09:46:45] I am going to open a phab task [09:50:46] piwik is working fine elukey [09:53:53] \o/ [09:54:38] elukey: ok, we can proceed to kafka1002 [09:55:01] super [09:57:16] hi! [09:57:38] joal: I took the liberty to add you as a reviewer of https://gerrit.wikimedia.org/r/#/c/316931 :) [09:59:36] it involves some oozie and hql and I'm never very confident with this, so when you have some time it'd much appreciated :) [10:03:31] mobrovac: kafka1002 back serving traffic [10:03:54] ok, restarting cp again for sanity [10:07:28] joal: I am investigating 3776-21 Oct 2016 00:00:00, probably I'll fail but let's see if I manage to solve the issue (also the same thing for maps) [10:09:03] moritzm: analytics reboots should be completed [10:09:19] thanks mobrovac! [10:14:32] the jobs seems to have failed while waiting the _IMPORTED flag [10:15:14] trying to re-run them [10:15:29] going also to check the data on HDFS [10:15:37] (I should have done it before :P) [10:16:35] dcausse: Sure, today I'm a bit sick, so probably on monday :) [10:16:42] yess data in /mnt/hdfs/wmf/data/raw/webrequest/webrequest_text/hourly/2016/10/21/00 with IMPORTED flag [10:16:47] joal: sure, thanks for your help! [10:25:34] elukey: Have you found the error with 2016-10-21T00:00 ? [10:26:54] joal: from the coordinator logs (hue goes in 500 not sure why) I saw only "waiting for _IMPORTED" [10:27:02] but they have it now so I re-run them [10:27:11] only maps has _SUCCESS though [10:27:19] that I don't remember when it is put in there [10:29:18] the maps coordinator seems to have finished [10:29:22] text is in progress [10:30:09] elukey: awesome. what about the druid cluster? it's still on the wmf2 4.4 kernel [10:30:19] aaaarghhh [10:30:24] I forgot Druid [10:30:41] elukey: 500 from hue is weird, no ? [10:31:26] only for the two failed, not sure why :( [10:31:56] joal: before you go, my understanding is that I can reboot safely one druid host at the time [10:32:13] since there is no special one doing coordination like in hadoop [10:32:47] in the meantime the text failed job is in refine [10:32:53] elukey: Correct, except for clickhouse running on those machines as well [10:33:18] whattt [10:33:23] I didn't know it [10:33:32] elukey: If the issue was on IMPORTED, then rerunning it with the files present is a good idea [10:33:39] and should solve it [10:33:40] !log re-run webrequest-load-wf-text-2016-10-21-00 and webrequest-load-wf-maps-2016-10-21-00 [10:33:52] elukey: no need to rerun, just relauch right? [10:34:20] elukey: test only [10:34:24] (for clikchouse) [10:34:26] is there a diff? I just hit "re-run" a while a go on both in hue [10:35:16] elukey: this is re-launch :) [10:35:26] elukey: or re-run, as you wish [10:35:38] all right didn't do anything stupid then :) [10:35:40] elukey: but no need to restart a coord as we do for upload for instance [10:35:46] :) [10:35:54] ahhh yes yes ok [10:35:59] I got what you meant [10:36:08] elukey: nothing to bother anyway, just multiple refine jobs on same data, no prob:) [10:36:09] so about clickhouse, all good if I reboot druid hosts? [10:36:31] elukey: I don't think so, I think ottomata has them running in screens or something [10:36:45] bad ottomata is bad :D :D [10:36:50] :D [10:36:56] You know him, hmmmmmm ? [10:37:13] So, I have a loading job currently running, should be done by end of afternoon [10:37:33] Would you mind waiting for end-of-day to reboot them, maybe with ottomata to restart clickhouse after? [10:37:50] sure, no problem, I'll do them this evening [10:38:07] Thanks a milion elukey, sorry for bothering :( [10:38:28] elukey: I can't access hue anymore, don't know why [10:39:01] I can, what error do you see? [10:39:19] elukey: nothing loads [10:39:22] blank page [10:39:38] elukey: might be related to chrome update [10:39:46] elukey: will reboot later on today [10:39:55] works fine for me :( [10:40:00] np [10:40:07] elukey: lleaving for now, ok for you? [10:40:17] sure! get some rest :) [10:40:23] Thanks mate :) [10:40:28] Later a-team [12:29:49] !log created 0001387-161020124223818-oozie-oozi-C for webrequest-load-check_sequence_statistics-wf-upload-2016-10-21-11 (oozie errors) [13:33:47] mmmm [13:34:14] I suspect that the date/time that I have for the interview with ottomata is not rifght [13:34:18] *right [13:55:53] (PS2) Milimetric: [WIP] Migrate from bower to npm instead of yarn [analytics/dashiki] - https://gerrit.wikimedia.org/r/316904 (https://phabricator.wikimedia.org/T147884) [14:02:14] elukey: in an interview just noticed https://stats.wikimedia.org/ has some problems with ssl (check out the console) [14:03:38] milimetric: I think it is only a matter or replacing references to 'http://stats.wikimedia.org/cgi-bin/search_portal.pl' [14:03:58] yeah [14:03:58]

[14:04:55] is there a git repo for this? [14:05:24] https://gerrit.wikimedia.org/r/#/admin/projects/analytics/wikistats [14:06:12] I think that a grep for "http://" would be enough, checking [14:10:27] well.. there are a lot of links to change :P [14:30:50] elukey: did something change in the config for that server after the restart? [14:31:13] (just got out of my interview) [14:31:27] nope [14:34:49] hm... wonder why it's failing all of a sudden then [14:51:00] milimetric: are you using chrome? [14:51:08] might be a browser specific thing [14:51:09] yep [14:51:16] elukey: you don't see the problems? [14:51:21] yep I do [14:51:26] but I don't have firefox [14:51:48] oh yeah, no problem in ff [14:51:51] cool [14:52:04] IIRC joseph told me about a chrome update today [14:52:12] mforns: back, wanna hang out in the cave for a bit before standup [14:52:13] so this might be the reason [14:52:18] milimetric, sure [14:52:22] aha, ughhhh security is sux [14:52:26] but we'd need to update the repo asap milimetric :) [14:52:28] maybe they fixed Dirty COW [14:52:46] elukey: yeah, I emailed ezachte about it [14:55:10] super [15:09:36] Analytics-Kanban, ChangeProp, EventBus, Wikimedia-Stream, Services (watching): Write node-rdkafka event.stats callback that reports stats to statsd - https://phabricator.wikimedia.org/T145099#2734517 (Nuria) a:Ottomata>Nuria [15:28:06] Analytics: Put wikistats latest in gerrit - https://phabricator.wikimedia.org/T148842#2734545 (Nuria) [15:28:32] Analytics: Put wikistats latest in gerrit - https://phabricator.wikimedia.org/T148842#2734558 (Nuria) [15:31:57] Analytics, Research-and-Data, Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (DarTar) [15:32:12] mforns: I am going to restart the last failed oozie job :( [15:32:30] elukey, ok [15:33:26] !log created 0001564-161020124223818-oozie-oozi-C to re-run webrequest-load-check_sequence_statistics-wf-upload-2016-10-21-14 (oozie errors) [16:01:33] milimetric: can we get this changes merged before the karma npm refactor? https://gerrit.wikimedia.org/r/#/c/314622/ [16:02:04] (Abandoned) Nuria: Making link to new browser reports more prominent [analytics/wikistats] - https://gerrit.wikimedia.org/r/279965 (https://phabricator.wikimedia.org/T129101) (owner: Nuria) [16:03:23] nuria: actually I can abandon everything but the piwik config changes there [16:03:40] milimetric: but we agree piwik has no effect on perf right? [16:03:44] the new way that htmlreplace means it makes sense to move the piwik script in the head the way you pointed out [16:03:59] *the new way htmlreplace works (with the new version) [16:04:14] milimetric: ok, then should we abandon that patch? [16:04:38] um... you can leave it and I can strip out the build process changes and leave the piwik in there. That way I don't forget [16:06:06] I gotta run to the doctor's be back later [17:08:21] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Patch-For-Review, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2585510 (Niedzielski) ^ This is pretty neat {icon thumbs-up} [17:32:17] github is having problems [17:32:21] going afk team, have a good weekend! [17:42:09] bye elukey ! [18:53:39] nuria, can we schedule an hour for the first meeting? I'm sure there will be plenty to discuss [18:54:02] yurik: can you send an agenda ? that way we can see whether we truly need an hour [18:54:53] nuria, hm, well, i was hoping for a demo of all the possible ways available from analytics - and judging by our chat here yesterday, there is plenty :) [18:55:14] yurik: That would take 3 hours yurik, not 1 [18:55:26] nuria, hehe, i am hoping for a short one :) [18:55:55] yurik: that is why i was trying to see whta are you working on / what needs there are and then other more concrete meetings can be scheduled [18:55:57] we need to figure out how our existing tech and requirements can be solved with what your team has been working on, where we duplicate our efforts and try to eliminate that [18:56:22] i guess we could make it short at first and schedule an extra one later in the week [18:56:23] "all possible ways" of .. ahem ... doing what thing... [18:56:47] visualizing analitical data :) [18:57:05] which means - everything from gathering, to filtering/sorting/aggregating to visualizing [18:57:19] in other words - analitics data pipeline [18:57:20] ok, that is one thing that as i said has several solutions [18:57:33] depends on the usecase [18:57:52] right - and we should understand the difference of the solution, and present our usecases :) [18:58:02] it works the other way arround [18:58:03] anyway, i guess we can start with 30min, and move on from there [18:58:10] we learn about use cases 1st [18:58:16] oki [18:58:30] undersand what are we trying to do and then we work on tech [18:58:38] technology is not teh solution to lal problems [18:58:40] *all [18:58:52] ^ yurik [18:59:12] sorry [18:59:22] "technology is not the solution to all problems" [19:00:02] yurik: Example: This dataset : https://wikitech.wikimedia.org/wiki/Analytics/Data/Browser_general [19:00:17] sorry, prod issue :( [19:01:06] yurik: could have greatly helped with this project: https://github.com/wikimedia-research/Discovery-Portal-Adhoc-JavaScriptSupport [19:09:49] seems that would be perfectly aligned with what you do. Sorry, seems like all cassandra servers just died on us, working on switching maps backend servers [19:10:41] yurik: k [19:44:22] Analytics-Kanban, ChangeProp, EventBus, Wikimedia-Stream, Services (watching): Write node-rdkafka event.stats callback that reports stats to statsd - https://phabricator.wikimedia.org/T145099#2735127 (Nuria) @ottomata: in what repo does this code go, I imagine it no longer goes into kasocki [20:07:03] bye a-team have a nice weekend! [20:21:22] back