[05:56:17] Analytics-Kanban, RESTBase, Services, RESTBase-API, User-mobrovac: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2352096 (GWicke) @Antigng_, request rates are limited per IP address, so multiple bots running on the same host share the quota. [08:09:27] joal: o/ [08:34:33] !log rebooting kafka1012 for kernel upgrades. [08:34:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [08:38:39] !log event logging restarted on eventlog1001 [08:38:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [08:40:11] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: forwarder/legacy-zmq [08:44:09] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [08:52:39] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [09:06:50] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [09:07:56] a big mess with kafka1012 restart, fs issue. Then kafka1022 stopped :/ [09:08:19] I suspect that it was me running preferred replica election [09:33:31] elukey: aouch [09:33:37] elukey: sorry joined late today [09:33:43] elukey: Can I help on anything ? [09:35:20] hellooo [09:35:23] nono all good [09:35:41] I think I triggered some unexpected behavior [09:35:51] or maybe something that I didn't expect [09:36:03] I rebooted kafka1012 and it came up with a disk issue [09:36:09] /dev/sdf1 [09:36:14] :( [09:36:18] so i did a fsck and rebooted again [09:36:26] Man those disks issue seem to happen regularly lately [09:36:38] in the meantime, there were partition replica lag alerts [09:36:53] so I tried on 1022 and 1018 to run the preferred replica election [09:36:56] to rebalance [09:37:07] but kafka1022 started a graceful sto [09:37:09] *stop [09:37:57] :( [09:37:58] almost for sure running the preferred replica election wasn't right on other brokers, but a shutdown? [09:38:11] and I re-checked, I am almost sure that I didn't issue the command :D [09:38:19] elukey: prefered replica election is broker independent [09:38:19] I mean, the stop command [09:38:32] elukey: you run in in one place and it applies to full cluster [09:38:45] elukey: weird: ( [09:43:14] joal: yep but I wanted to run it from different brokers just in case, maybe two times in a row with a broker down was not good? [09:43:25] or maybe I missed a stop command from my keyboard [09:43:51] I can only see a graceful shutdown from the logs [09:43:58] plus other stuff that are a bit cryptic :D [09:44:17] elukey: I think 2 times in a reow is not a good idea: it might have taken time for the brokers to rearrange themselves [09:45:40] probably, I am not disputing that I did something wrong but I want to know what :D [09:45:45] at some point I see [2016-06-03 08:35:10,659] INFO [Kafka Server 22], shutting down (kafka.server.KafkaServer) [09:56:51] elukey: I'm not saying you did something wrong, I'm trying to guess what could have triggered the shutdown ;) [10:03:08] I found it [10:03:10] Jun 3 08:35:10 kafka1022 sudo: elukey : TTY=pts/1 ; PWD=/home/elukey ; USER=root ; COMMAND=/usr/sbin/service kafka stop [10:03:27] I might have had it in the history and executed by mistake [10:03:35] really embarassing [10:03:54] elukey: It happens, nothing to bother [10:04:01] elukey: At least you know where it comes from ! [10:04:20] elukey: Remember me loosing one year of data ;) [10:06:21] joal: thanks for the support, but I have no recall of executing that command [10:06:34] really maybe it was me running commands too quickly [10:06:36] lesson learned [10:07:30] hey team :] [10:08:19] o/ [10:08:50] Hi mforns :) [10:20:07] * elukey away from keyboard for a bit! [10:43:01] Analytics: Investigate where records will al null fields are coming from - https://phabricator.wikimedia.org/T136844#2352631 (elukey) [10:44:29] this one is weird --^ [10:44:41] because not all the fields are null [10:46:51] Analytics: Investigate where Kafka records will almost all null fields are coming from - https://phabricator.wikimedia.org/T136844#2352634 (elukey) [11:18:36] Analytics, Operations, ops-eqiad: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2352719 (elukey) [11:19:41] kafka1012 disk to replace :) [11:19:47] * elukey lunch! [11:29:48] a-team, I'm AFK for a while [11:30:05] ok elukey and joal ttyl! [13:50:37] elukey: hi morning! I just realized I forgot to tell you that I rebooted kakfa1014 yesterday [13:51:07] ottomata: morning! [13:51:29] today I rebooted the last one, kafka1012, and then 1022 by mistake [13:51:44] how it went, let me recap :) [13:52:51] rebooted kafka1012 and found fs issues, umounted the partion, fsck, back in service. Not sure what happened to my brain but I was conviced to have only issued a kafka-preferred-replica election on kafka1020 to clear some graphs, because I wanted to work on kafka1012 [13:53:06] but probably because I had a kafka stop in my recent history, kafka1020 got stopped [13:53:09] :/ [13:53:21] no big deal, but a page was fired [13:53:25] hm ok [13:53:32] good ol' kafka just keeps on truckin though :) [13:53:57] really sorry, it was a stupid mistake, probably because I wanted to do too many things at once [13:54:31] the main thing that I noticed is that partition lag recovery last a lot more with 0.9 [13:54:31] nawww no problem at all :) [13:54:32] :) [13:55:01] also, since the day wasn't too good, I rebooted eventbus codfw [13:55:27] I issued kafka-preferred replica only at the end, maybe this was the error [13:55:30] and god [13:55:32] *got [13:55:32] [KafkaApi-2002] Produce request with correlation id 2 from client kafka-python on partition [codfw.test.eve [13:55:36] 15:46 nt,0] failed due to Leader not local for partition [codfw.test.event,0] on broker 200 [13:56:14] elukey: i think that's a normal error when leadership changes [13:56:34] the client should get a failed response, update metadata, find the new leader, and retry producing again [13:56:35] ottomata: yeah but it triggered [13:56:36] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 [13:56:40] hmm [13:56:41] ok [13:56:45] that's not good then [13:56:50] restarted EL and all good [13:56:59] hm [13:57:10] in the logs I found a stacktrace with something about the kafka producer [13:57:21] you can find them if you ssh [13:59:10] before re-pool each server I tested from palladium the /v1/topics GET call [13:59:21] and it was working fine [13:59:51] the el logs on kafka2002? [14:00:36] yep.. I also checked the kafka ones [14:00:38] and found [14:00:39] WARN kafka.server.ReplicaManager - [Replica Manager on Broker 2002]: Fetch request with correlation id 105 from client ReplicaFetcherThread-0-2002 on partition [codfw.change-prop.retry.mediawiki.revision_visibility_set,0] failed due to Leader not local for partition [codfw.change-prop.retry.mediawiki.revision_visibility_set,0] on broker 2002 [14:01:02] didn't correlated the timings [14:01:05] let me check [14:01:41] those errors started at [2016-06-03 13:26:36,623] on kafka2002 [14:02:06] and the 500 [14:02:06] Jun 3 13:26:46 kafka2002 eventlogging-service-eventbus[1996]: (MainThread) 500 POST /v1/events (10.192.16.169) 6.90ms [14:04:45] that log warn also makes sense if you restarted a broker, tthat also just sounds like leadership change [14:04:52] but, el should recover from that [14:04:58] and it also shouldn't 500 [14:05:37] this is the first time that happened to me in many event bus restart, and the funny thing is that it went away after the first EL restart [14:05:47] I then re-executed kafka-preferred-replica [14:05:52] just to be sure [14:06:01] but the problem cleared before [14:06:07] not sure how frequent we check though [14:06:48] ottomata: also opened https://phabricator.wikimedia.org/T136933 for kafka1012 [14:10:38] cool, nice find [14:22:53] oozie seems to be behind the schedule [14:22:55] Missing hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2016/06/03/14/_IMPORTED [14:30:18] mmmm webrequest-load-coord-text for 13:00 is refining [14:30:55] * elukey was checking yesterday's oozie email [14:31:16] * elukey is waiting joal to laugh [14:31:24] heh, oh? [14:37:48] nono it's me being stupid, today I don't connect very well [14:39:41] ottomata: do you have a minute for a puppet/nagios question? [14:42:47] sho [14:42:49] what's up? [14:44:45] soo aqs100[456] are throwing icinga alerts and I wanted to clear them. One is trivial, namely check_procs checks only one java proc meanwhile we have two [14:44:59] then there is monitoring::service { 'cassandra-analytics-cql': [14:45:12] from https://gerrit.wikimedia.org/r/#/c/250439/3/modules/cassandra/manifests/instance.pp it seems that we already have an instance aware check [14:45:17] by default with cassandra [14:45:27] buuuut I am struggling to figure out how it works [14:45:56] since ${listen_address} in aqs1004.yaml for example gets a different value for each instance [14:46:19] mforns: Heya [14:46:32] I checked the complete list of checks in iciga and it seems that we have alerts for both instances [15:04:09] going to send a code review to remove those AQS alarms, I think that we already have cassandra ones [15:04:53] elukey: they should false positive, right? [15:07:07] joal, hi! sorry no headphones [15:07:38] np mforns :) [15:07:53] what's up? :] [15:07:56] do you want some brainbounce or not needed now? [15:08:03] joal: I thought they were from today :P [15:08:14] joal, yes please! [15:08:17] batcave? [15:08:20] mforns: sure ! [15:08:34] let me grab a banana, no batcave without food :P [15:08:35] elukey: I have not seen them, do you mind showing me? [15:08:42] sure mforns ! [15:09:43] joal: sorry I wrote a lot of things to you, what you are referring? oozie or cassandra? I think the latter but I'd like to double check before starting to write :P [15:09:52] sure elukey :) [15:10:09] I let oozie behind, as you can guess :) [15:10:17] elukey: cassandra errors in icinga? [15:10:55] joal: https://gerrit.wikimedia.org/r/#/c/292568/ [15:13:19] elukey: in batcave with mforns ,I'll need some more explanation ;) [15:13:43] joal: ops session atm, maybe after standup?? [15:13:48] I'll show everything to you [15:13:52] sure elukey [15:13:59] if Eric or Marko or Andrew don't -1 me before :P [15:14:05] huhuhu [15:18:16] oh elukey i'm sorry i missed your q.... readin gnow [15:18:42] ottomata: don't worry just sent a CR for you :P [15:21:18] hmmm, elukey [15:21:24] i think it would be good to keep the java checks [15:21:29] you can change the number of procs [15:21:32] and even make it say at least one [15:21:55] ottomata: I thought the same but CQL will not be available if the java process is down [15:22:11] each time we get two alerts for the same thing IMHO [15:22:28] hm [15:22:44] this is why I am proposing to remove them [15:22:52] since we already have the cassandra CQL ones [15:22:54] true. i dunno though, i thikn both are good. would be better if there is a way to make dependencies [15:22:58] (multi-instance aware) [15:23:10] I checked in icinga and they are working fine for all the aqs* hosts [15:23:14] yeah [15:23:18] its probably ok [15:23:53] elukey: bette rwould be http://docs.icinga.org/latest/en/dependencies.html#definition [15:23:56] but i don't kno wmuch about it [15:24:12] ah nice! [15:24:29] let's see what Erik and Marko thinks about these [15:24:40] I am super skeptical like you to remove alerts [15:24:44] I got your point :) [15:25:14] elukey: also not sure if we do that at all anywhere else. not sure if our puppet icinga stuff supports it [15:43:06] a-team: me and Andrew are following the ops sessions so we might join the standup with a couple of minutes of delay [16:02:15] madhuvishy: standup! [16:02:53] Analytics-Kanban: Enable pivot ui so non analytics engineers can query druid pageview data (poc) - https://phabricator.wikimedia.org/T136331#2353332 (Ottomata) a:Ottomata [17:38:50] a-team: byyyeeeeeeee o/ [17:39:58] Bye elukey ! [17:40:02] Have a good weekend:) [17:40:08] you too!! :) [17:44:21] (PS1) Maven-release-user: Add refinery-source jars for v0.0.30 to artifacts [analytics/refinery] (jenkins-test) - https://gerrit.wikimedia.org/r/292594 [17:45:07] (CR) Madhuvishy: [C: 2 V: 2] "Merging to test branch" [analytics/refinery] (jenkins-test) - https://gerrit.wikimedia.org/r/292594 (owner: Maven-release-user) [17:51:40] byyyeee [17:52:04] ottomata: don't think the permissions are working somehow [17:52:30] git review works with the user - branch exists [17:52:36] but it cant push [17:54:02] ok [17:54:04] let sseeee [17:54:29] madhuvishy: on refinery/source? [17:54:44] haven't tested that one yet - let me try [17:54:51] but refinery isn't working [17:54:51] oh refiner ok [17:55:06] what does it say when you push? [17:55:35] https://www.irccloud.com/pastebin/6lRNqCMw/ [17:56:14] error: src refspec jenkins-test does not match any. [17:56:17] you sure that's a perms problem? [17:56:18] yeah [17:56:19] what's the push command? [17:56:27] git push origin jenkins-test [17:56:53] the error seems to be asying that it jenkins-test doesn't exist, dunno if that means locally or remote [17:56:55] i guess remote? [17:56:58] the branch exists - i merged things to it too [17:57:02] hm [17:58:03] i'm trying with my user from local [17:58:14] k, i just added perms for analytics group to do that [17:58:17] to push [17:58:18] ja try [17:58:44] oh actually, just did now [17:59:00] ya cool trying [17:59:15] madhuvishy: does jenkins-test exist locally? [17:59:16] for you? [17:59:28] yeah [17:59:30] i fetched [17:59:35] also exists in jenkins [17:59:44] a-team, see you on monday! nice weekend to you :] [17:59:52] bye mforns! happy weekend :) [18:00:19] ottomata: yeah worked with my user [18:00:44] madhuvishy: that error is from where? your local? [18:00:52] or from the server where jenkins runs? [18:00:56] ottomata: no from jenkins [18:00:56] yeah [18:01:05] aye, i think that the jenkins-test branch doesn't exist there locally [18:01:11] it probalby needs checked out first or something [18:01:18] hmmm checking [18:01:50] i got the same error locally [18:01:58] before I made a local jenkins-test branch [18:02:28] you might have to do [18:02:34] git checkout -b jenkins-test origin/jenkins-test [18:02:35] first [18:02:41] or actually [18:02:45] hmmm [18:02:51] madhuvishy: no [18:02:52] you don't need to [18:02:59] because you want to push master to the jenkins-test branch, right? [18:03:15] hmmm i guess [18:03:23] let me try that [18:05:12] ottomata: hmmm same thing [18:05:55] madhuvishy: how are you pushing [18:06:03] maybe you can try [18:06:08] ottomata: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/configure [18:06:09] git push origin master:jenkins-test [18:06:35] hmmm its just running the bin/update-refinery-jars script [18:06:56] i run it from my local, as myself and it works fine [18:07:08] let me try running from local as maven-release-user [18:07:21] madhuvishy: https://git-scm.com/book/en/v2/Git-Branching-Remote-Branches#Pushing [18:07:31] it is because you have jenkins-test locally [18:07:37] jenkins does not have that locally [18:07:51] in your push command [18:08:24] jenkins-test is implicitly refs/heads/jenkins-test:refs/heads/jenkins-test [18:08:25] you want [18:08:33] ottomata: hmmm i'm configuring it to check the branch out [18:08:36] oh? [18:08:46] where? [18:09:09] atleast that is what i think its supposed to do [18:09:20] Configuration -> SCM -> Branches to build [18:09:36] i set it to master now to see if it'd do master -> jenkins-test [18:09:43] but it was $GIT_BRANCH before [18:09:45] aye ok, hmmm [18:09:57] a-team: sorry for the late arrival [18:10:19] hm, madhuvishy but that is just building on that branch [18:10:20] ja? [18:10:23] not actually doing the push [18:10:25] right? [18:10:31] the push is done by the update jars script? [18:10:32] ja? [18:10:43] true [18:10:50] but its working off of the same workspace [18:11:01] so i assumed it'd have the branch checked out [18:11:05] aye, so you thikn by putting jenkins-test there it woudl check it out [18:11:13] yes [18:11:15] if that were the case I would think it would work [18:11:20] git review works [18:11:25] but, the error is pretty clearly indicating that jenkins-test doesn't exist locally [18:11:34] madhuvishy: your git review command specifies HEAD [18:11:40] yeah [18:11:43] as the local ref [18:11:48] so, your review is doing [18:11:49] push [18:11:54] HEAD to refs/for/jenkins-test [18:12:00] but your normal push [18:12:00] yup [18:12:01] is doing [18:12:26] push refs/heads/jenkins-test to refs/heads/jenkins-test [18:12:31] right [18:12:35] may be [18:12:42] and, i think it is pretty clearly saying that refs/heads/jenkins-test doesn't exist locally [18:12:47] yup [18:13:04] but it should. going to try specifying explicitly [18:13:08] refs/heads/jenkins-test [18:13:59] hmm, i dunno, ok you can try, but i think you will get the same error [18:16:51] yeah [18:17:24] weird though because the other job has the same setup [18:17:38] aaah [18:17:47] ottomata: figured it out i think [18:18:39] ah ja? [18:20:22] ottomata: there's another setting for checkout to specific local branch [18:20:25] i'd missed it [18:20:48] https://github.com/wikimedia/analytics-refinery/commit/494ef55b52848fdac1e294e76cefc2536d5f9a14 [18:20:52] phew [18:22:33] nice! [18:24:47] ottomata: should probably remove analytics push from refinery though [18:24:51] its a broader group [18:24:55] Analytics-devs is us [18:25:17] 'k [18:25:31] thanks a ton! [18:25:32] done [18:38:44] HaeB: yt? [18:39:53] hi [18:40:21] Analytics, Wikipedia-Android-App-Backlog: Investigate recent decline in views and daily users - https://phabricator.wikimedia.org/T132965#2353918 (Tbayer) Open>Resolved a:JAllemandou Cool, thanks! I agree that the Android DAUs look plausible now, so let's close this ticket. (For the record, t... [18:46:54] HaeB: regarding projects like lazy loading or others that might have a performance impact (not sure if this one does, just an example) [18:47:39] HaeB: i was thinking that given that all over the web faster loading times are linked to higher conversion and longer sessions (in terms of pageviews) [18:48:11] HaeB: we could measure that with a daily average of pageviews/project per country (split desktop mobile) [18:48:56] HaeB: We can track that daily average for a while and majorly impactful projects when it comes to perf or loading times should be visible per project and country in longer sessions (in pageviews, not time) [18:50:27] HaeB: kind of a condensed explanation, hopefully it makes sense [18:54:20] nuria_: yes absolutely, the problem though is how to assess whether a change was due to performance increases, or random variation, or other events [18:54:55] HaeB: random would be easy to see if daily averages are calculated for a period of month [18:54:58] *months [18:55:04] there are some statistical techniques that could help with that, but one would need to invest some work to explore that (i intend to spend a bit of time on this relatively soon). see also e.g. https://meta.wikimedia.org/wiki/Research:Newsletter/2015/September#Predicting_Wikimedia_pageviews_with_98.25_accuracy [18:55:20] HaeB: as in our series is daily verages per project per country (desktop/mobile split) [18:55:39] HaeB: ya, i have seen that [18:55:41] (i'm not focusing on the lazy loading analysis though, i think dr0ptp4kt and/or jdlrobson will look at that) [18:55:53] HaeB: ya, that was just an example [18:56:12] HaeB: I have seen the "predicting pageviews" but that is actually (i think) a harder problem [18:56:47] HaeB: pageviews * i think* would fluctuate much more than pageviews by device by country as teh latter speaks as to connection speeds and browing habbits [18:57:04] *the [18:57:34] ... and in any case it will almost always be inferior to actual A/B testing, so i appreciate your proposal at https://phabricator.wikimedia.org/T135762 ;) [18:57:51] HaeB: inferior meaning ..? [18:58:22] less likely to produce reliable insights [18:58:57] HaeB: less likely yes, but actually quite likely cause in a way that metric "normalizes" pageviews [18:59:28] HaeB: well... normalizes... it reduces the effect on pageviews that a viral link on NYT will have [19:00:38] HaeB: my larger point is that I think this metric would be a lot more useful than this one: https://office.wikimedia.org/wiki/Engagement_metrics#Pageviews_per_session [19:02:39] nuria_: i think there are separate questions: what metric to measure (metrics, or something more session-based that speaks more to reader retention), and how to determine if changes had an effect on that metric [19:02:57] we also don't want to optimize for the wrong metric [19:03:44] HaeB: agreed, but "pageviews per session" without a country split will just give you a histogram that will correlate with loading times [19:04:08] HaeB: needs to be proven but there is plently literature about that [19:04:12] *plenty [19:04:39] (also, as mentioned in the meeting this week, that page contains preliminary notes by kevin from a while ago.. it might not be precisely what Reading will go with in the end) [19:05:21] ok, it may be useful to have that country split if we go with that metric, i agree [19:06:04] HaeB: also (and i just though about that) [19:06:26] HaeB: the pageviews/unique devices per country will help you measure teh impact of hoovercards on pageviews [19:06:28] *the [19:06:47] HaeB: specially since you only need to look at desktop data [19:07:13] HaeB: pageviews/unique devices per project per country that is [19:07:23] HaeB: as hovercards are launched only in specific projects [19:08:36] well yes, the per project data would of course be more directly relevant to that than the per country data [19:08:49] joal and nuria_ : please check https://meta.wikimedia.org/w/index.php?title=Dashiki:PageviewsAnnotations&diff=15672604&oldid=15595518 [19:08:50] HaeB: per project per device [19:09:10] HaeB: looks good [20:42:49] madhuvishy: did you talked to teh outreach folks? [20:42:50] *the [21:05:45] (PS1) Nuria: Updating vital-signs dashboard on analytics.wikimedia.org [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/292636 [21:06:21] (CR) Nuria: [C: 2 V: 2] "Self-merging update of build" [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/292636 (owner: Nuria) [23:11:17] nuria_: No, i emailed him today [23:11:30] the query doesn't work with the filter for pagetitles [23:11:34] it works without [23:11:53] not sure why it won't work with the filter, but I guess it's ok as a starting point for them