[15:11:32] morning guys [15:11:45] ottomata, i am going to write an email to announce the log changes, http://etherpad.wikimedia.org/logchange [15:12:34] mooorniiing [15:12:44] ok cool [15:12:49] this patchset does not have tabs in it [15:13:23] i'm writing a big email about missing data from udp2log kafka producers [15:13:26] :( [15:19:28] k [15:20:43] i know the patch set that you are referring to that introduces the changes but we should also update the filters and change the output filename so we can easily detect whether a file uses the old or new format [15:24:11] ok [15:50:21] geocode now appends to the end of the line? [15:50:37] drdee? [15:50:58] not yet, but would like to enable this [15:51:06] oh wait [15:51:11] in udp-filter? [15:51:11] yes [15:51:12] it doe [15:51:13] s [15:51:21] stefan fixed that [15:51:54] so, if geocoding is on, it will be the 17th field? [15:52:32] yes [15:52:34] i think i'd also like to avoid doing the geocode change right now, sounds like it will cause more trouble too [15:52:54] does that also mean, that if we add another field in the future [15:53:05] then the geo field will change to the 18th field? [15:53:08] yes [15:53:15] so it will always be the last field on a line [15:53:24] hmm, okkkkkkkkkk, seems weird but ok [15:53:28] it is weird [15:54:06] or maybe only enable it for the sampled-1000 because it has massive performance improvement for wikistats [15:56:52] or with the geocoding, do not enable it in the filter [15:57:15] but do it right before the cronjob that copies the file to stat1 [15:57:34] and then only for sampled-1000 [15:57:57] that way we don't have issues with the filters but we still have performance gains for wikistats [16:06:31] morning milimetric [16:06:39] the cron is copying the zipped files [16:06:43] but we could geocode on stat1 [16:06:52] in a cron, and rezip and save each file [16:07:04] maybe that's the 'best' way of doing it [16:07:35] in that case probably it's easiest to enable it for all log files, or not? [16:08:20] could be [16:11:14] are you guys doing fireworks with new year's eve? [16:11:33] milimetric, ottomata ^^ [16:11:41] hey :) [16:11:49] no [16:11:53] naw, prob not [16:11:55] I'm gonna be doing miracle berries [16:12:01] haha [16:12:16] http://en.wikipedia.org/wiki/Synsepalum_dulcificum [16:12:43] i've got big plans to grow some [16:12:56] and help them get FDA approved [16:13:29] here in canada you can't have fireworks [16:13:38] which is a massive disappointment for me [16:13:42] i don't think i ever heard of fireworks on new year's [16:13:49] new year's eve == fireworks for me [16:13:56] REALLY?????? [16:14:03] yea [16:14:06] OHHH MANN! [16:14:10] is it a Dutch thing? [16:14:16] you should go to holland for new year's eve [16:14:27] it's pretty insane [16:14:30] cool, will remember that [16:14:40] hey, if you like fireworks, I got a place for you [16:14:47] Weehawken, NJ [16:15:03] ok, going back to my parenting day :D [16:15:08] it's right across the river from Manhattan and it's the most beautiful fireworks show on earth [16:15:10] i'll check it out on the webz [16:15:15] COOL! [16:15:17] (July 4th) [16:15:32] ottomata can probably testify to that, though maybe there are some spots in brooklyn too [16:15:53] have fun parenting :) [17:29:10] Any chance some one might be able to get a most viewed list of articles on English Wikinews for 2012? [17:29:18] * purplepopple can't recall where that data is found [17:36:11] this might be a good place to look: [17:36:12] http://stats.wikimedia.org/wikinews/EN/Sitemap.htm [17:36:17] not sure if what you are looking for is there though [17:38:24] ottomata: Looking for an article list. :/ [17:43:16] http://stats.grok.se/en.n/top Something like that but more recent. [17:44:53] ottomata, is tomorrow your birthday? [17:45:11] yessuh! [17:45:13] big ol 30! [17:45:26] happy birthday, sir! [17:45:47] i have to say this now, before i create some enormous, awful pile of work for you [17:45:56] brb a moment [17:46:37] hmm, purplepopple, this might be helpful: [17:46:37] http://dumps.wikimedia.org/other/pagecounts-ez/ [17:47:00] i think the data you are looking for should be in those files, but you'd have to crunch it yourself [17:47:10] it'll contain pageview counts per url [17:48:18] * purplepopple nod nods [17:49:10] not really sure what you're looking for, but there's also http://stats.wikimedia.org/wikimedia/requests/reports/2012-11/most-requested-pages-2012-11_wikipedia-EN.html [17:49:18] http://dumps.wikimedia.org/other/pagecounts-ez/monthly/ is missing 2012 totally. [17:49:31] beautiful. [17:49:32] that is it. :) [17:49:35] Thank you dschoon [17:49:37] np [17:49:45] i believe there are comparable pages for other wikis [17:49:50] just replace the language code at the end [17:49:55] i believe it's top 1000 [17:50:25] Thanks muchly. :) [17:50:56] drdee: do you know of any method to get view counts from the commons players? [17:52:30] mwalker: players? [17:52:45] sorry -- that would be singular player [17:52:53] (i think drdee is more or less perma-afk for the next week. he's on vacation, and has family visiting from the netherlands) [17:53:00] aaah [17:53:02] you mean the thankyou videos? [17:53:05] yep :) [17:53:29] we're attempting to see if people are actually watching them in numbers that actually mean anything [17:53:46] *nod* [17:53:54] well, first and most obvious question [17:54:14] does playing a video entail a specific pageview? [17:54:19] or does it use JS or whatever? [17:54:27] (i assume the former, but gotta check) [17:54:54] not actually sure how it works -- it might be counted as a pageview on commons/upload [17:55:10] well, its easy to test [17:55:16] go watch a video [17:55:20] and see if the URL changes :) [17:58:17] ah -- duh. from the banner we dont load the commons player directly because we didn't know if that would break the servers :p so! we're looking for clicks to http://wikimediafoundation.org/wiki/Thank_You_Main?commons=true [17:58:47] thanks for the hammer to the head -- I'll go bug jeff_green about getting that [18:00:04] gotcha :) [18:00:15] we can discuss more in a few after scrum [18:00:16] brb [18:00:43] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 ottomata [18:21:45] brb 10 avec coffee/bagel [18:37:37] back [18:40:10] gonna try to finish maps in limn while waiting on ottomata [18:41:56] i'm actually talking with robla [18:46:36] hey ottomata, excellent detective work.. on which machine did the restarts happen? [18:49:48] hey, in 1on1 with robla, we're kind of looking at this too [18:49:49] but [18:49:55] on analytics1003, 1004, 1005, and 1006 [18:51:39] i don't have access to those machines :/ [19:05:42] you do! [19:05:51] analytics1003.eqiad.wmnet [19:06:00] dschoon, i'm sorta done with robla 1on1 [19:06:04] we talked a bit about htis problem [19:06:10] got some ideas... [19:06:12] looking [19:06:15] okay. [19:06:36] should we hangout to collab? [19:09:20] I think Andrew's got some data to gather at this point [19:09:54] yeah hangout, i just looked robla, not convincing, I see data in zero logs where we thought there'd be none [19:10:02] dschoon, will hangout and explain where I am [19:10:10] ori-l, you are welcome to join if you want to [19:10:43] cool. [19:10:44] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [19:11:06] milimetric: you want to have 1:1 now, or would you rather I chip in on the udp2log thing? [19:11:20] oh now's good [19:11:26] i was on the hangout [19:11:27] oh! [19:11:31] you sent a dif. link [19:11:32] sry [19:12:44] you coming dan? [19:12:46] er, milimetric [19:13:14] i'm meeting with robla [19:13:28] ok [19:22:09] DarTar: try logging into hue again [19:22:28] hey [19:22:32] hold on [19:23:52] still no permission to access specific components (like the shell) after I sign in [19:24:06] hey [19:24:09] lemme see [19:24:12] hey ori-l [19:24:17] ottomata, that is [19:24:20] hey [19:24:25] DarTar, can you log out again [19:24:29] and try to log in with lowercase name? [19:24:33] the same one you use for shell login [19:24:34] ? [19:24:36] sure [19:25:08] hm signing out gives a 500 right now [19:25:15] hah,i think cause I deleted your user [19:25:17] just try to sign in then [19:25:54] http://hue.analytics.wikimedia.org/ gives a 500 too [19:25:59] after logging out [19:26:24] hahaha [19:26:25] oops [19:26:29] um [19:27:01] and now? [19:27:03] DarTar? [19:27:17] better, hold on [19:27:20] ( I should not have deleted your user while you were logged in) [19:27:58] logged in as dartar (lowercase), still no permissions to access the shell [19:28:39] I can only access the file browser [19:31:55] dartar, are you using hue.analytics name based access, or internal proxy analytics1027.eqiad access? [19:33:15] fyi, for those playing along at home: http://etherpad.wmflabs.org/pad/p/MagicalMissingKrakenData [19:35:18] ottomata: I am using my labs login, and I have hue.analytics.wikimedia.org set up in my /etc/hosts file [19:36:19] ottomata: not sure that answers your question [19:36:55] ok yes [19:36:56] ok [19:37:08] try this one [19:37:08] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw [19:37:12] can you view that? [19:37:47] 1 min, brb [19:38:36] * DarTar shaking a fist at those captchas [19:39:28] 4th attempt, no seriously [19:39:35] oops, wrong channel [19:39:37] :) [19:43:58] ottomata, dschoon: [19:44:10] ? [19:44:11] else if (e.code == EPIPE) { [19:44:11] + std::cerr << "Pipe terminated, suspending output: " << command << std::endl; [19:44:11] + restart = true; [19:44:13] + } [19:44:22] so, udplog is getting a SIGPIPE [19:44:27] SIGPIPE [19:44:32] everyone's favourite signal! [19:44:36] right [19:44:42] guess who's fault [19:44:42] so uh [19:44:49] hmm. [19:44:50] hint: early modernist bohemian writer [19:45:04] i vote we blame the french [19:45:08] ok :) [19:46:13] http://stackoverflow.com/questions/8369506/why-does-sigpipe-exist [19:46:13] heh [19:47:36] ori-l: that code snippet is from udp2log? [19:47:46] yeah [19:48:01] i'm writing in the etherpad [19:48:07] making sure i understand -- you're saying that it's being sent SIGPIPE because its downstream FD is being closed [19:48:14] ah, awesome [19:55:39] ty ori-l [19:55:50] to follow up on the PIPE thing [19:56:00] java has weird semantics around stdin/stdout [19:56:09] it tends to be more aggressive in timeout while waiting for streaming data [19:56:14] so i don't think it's due to a crash [19:56:41] but either way, i don't think the producer being restarted would cause an hour's worth of data to vanish [19:56:42] on which machine did this error occur? [19:56:49] ottomata? [19:56:59] analytics1003, 1004, 1005 and 1006 [19:57:03] it's not just an hour, it's large chunks, usually more than 5 hours [19:57:31] exactly. [19:57:54] a restarted process does not result in 5 hours of dataloss unless there's a bigger, systemic problem [19:58:06] and that problem also needs to repeat ~daily at 7:30 UTC [20:00:45] actually [20:00:46] hm! [20:01:10] ori-l: do you think it's possible for the FD to be closed but not signal the upstream? [20:01:40] i'm not sure what you mean [20:02:09] i'm thinking that "restarting the downstream process" isn't an atomic step [20:03:04] the downstream could be non-responsive (like waiting to close or something) but not yet closed [20:03:22] udp2log would be waiting to restart it, and would be sending data to... what? the void? a closed FD? [20:08:27] sorry, debugging eventlogging issue on #wm-e3, bbiab [20:55:24] dschoon, ottomata: sorry for not following up on hue, the temporary conclusion is that pv data obtained via stats.grok.se (which aggregates domas sampled pv data) are totally unreliable for some special pages [21:09:54] or maybe not [21:12:00] hm [21:12:16] that's annoying. we'll see what we can do once we get to the bottom of the 0-byte imports [21:12:24] brb lunch + postoffice [21:39:56] DarTar, i don't think that webstatscollector collects special pages (which powers stats.grok.se) [21:40:17] hoi drdee [21:40:31] it actually does [21:40:46] but we unpacked the internal logic of account creations [21:40:56] and the points at which squid servers are hit [21:41:06] so problem solved [21:42:16] ok [21:54:34] phew, man, dschoon [21:54:50] TOOK me long enough to find a consumer client I could specify the offsets directly to [21:54:51] but [21:59:22] https://gist.github.com/4402327 [22:01:13] cool -- i'll check it out! [22:04:07] k, i wrote this up in the etherpad [22:10:21] awesome [22:10:38] eating atm, then i'll check it out. [22:12:35] just emailed about it [22:12:37] i'm outty [22:12:42] latas boys, thanks for the sleuthing help today [22:12:44] sounds good. [22:12:49] have a great birthday! [22:13:33] happy b-day, ottomata [22:14:19] dschoon: [22:14:21] [2012-12-18 16:43:11,655] 86721113 [ProducerSendThread-742094445] WARN kafka.producer.async.DefaultEventHandler - Error sending messages, 0 attempts remaining [22:14:21] [2012-12-18 16:43:11,656] 86721114 [ProducerSendThread-742094445] ERROR kafka.producer.async.ProducerSendThread - Error in handling batch of 4 events [22:14:21] [2012-12-18 16:43:11,657] 86721115 [ProducerSendThread--1291124682] WARN kafka.producer.async.DefaultEventHandler - Error sending messages, 0 attempts remaining [22:14:23] [2012-12-18 16:43:11,658] 86721116 [ProducerSendThread--1291124682] ERROR kafka.producer.async.ProducerSendThread - Error in handling batch of 4 events [22:14:30] interesting. [22:14:36] which host is that? [22:14:42] analytics1003:/var/log/kafka/kafka.log [22:14:45] oh, that's 12-18 [22:14:57] ignore everything before 12/21-ish [22:15:21] we made a bunch of configuration changes to support new streams and to mod things for udp2log [22:17:23] ah [22:18:33] that's a good find, though [22:19:01] if you see any of that on the 27th, that'd be a big deal [22:19:03] can you chmod a+r syslog? [22:19:11] syslog.* rather [22:19:22] or copy them somewhere and make the copies world-readable? [22:19:27] sure [22:19:28] sec [22:20:39] Opened pipe with factor 1: /usr/bin/awk '{if ($2 % 4 == 1) print $0; }' | /usr/bin/udp-filter -i |/opt/kraken/bin/kafka-produce webrequest-wikipedia-zero 9953 > /dev/null [22:20:40] terminate called after throwing an instance of 'libc_error' [22:20:40] what(): Socket::Bind: Address already in use [22:20:41] Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9950; nested exception is: [22:20:44] java.net.BindException: Address already in use [22:20:59] when was that? [22:21:14] that's the last line of analytics1004:/var/log/udp2log/udp2log.log [22:21:20] not sure when it's from; it isn't timestampped. [22:21:27] all syslogs should be a+r now. [22:21:31] hrm [22:21:33] thanks [22:21:43] (we really need to be better about timestamps) [22:22:00] er, lots of java errors in udp log [22:22:18] i think ottomata went through that, though [22:22:22] olivneh@analytics1004:/var/log/udp2log$ grep java udp2log.log | wc -l [22:22:22] 1002 [22:22:36] gross. [22:23:36] (eating -- will look in a bit) [22:24:48] they group into two errors: can't bind to port, can't write to kafka.log [22:24:52] so probably unrelated [22:25:06] syslog: Dec 28 22:20:37 analytics1004 nslcd[19342]: [7ca2a2] error writing to client: Broken pipe [22:25:33] that's LDAP though, lord knows if that's related. [22:25:53] i'm just pasting stuff here that's potentially relevant; much will be noise, most likely. [22:26:12] *nod* [22:26:21] hm. [22:26:26] oh, syslog? [22:26:37] yeah, there's just going to be a metric shitton of random things in there [22:27:41] oh, I've been forgetting to say it all day [22:27:46] Happy Birthday ottomata! [22:27:47] :) [22:28:42] ottomata's diligent about logging tasks in asana, so you could look at the recently completed stuff to get a feel for where some of the loglines/exceptions are coming from [22:28:43] https://app.asana.com/0/828917834272/828917834272 [22:29:23] LDAP-related stuff is from the HDFS integration he got working earlier in the week [22:30:56] is storm potentially implicated in this? [22:31:01] just trying to wrap my head around your setup [22:31:02] no. [22:31:14] i believe it's set up but not running. [22:31:28] (and even if it does have a process, it's idle.) [22:31:34] storm is screwing with us *from the future* :-) [22:31:38] can you explain the basic data flow? [22:31:42] howdy robla [22:31:43] haha [22:31:54] the current dataflow? [22:32:11] squid -> udp -> .... [22:32:54] let's put it in the etherpad [22:33:15] * ori-l heads to the ether [22:33:23] starting line 69 [22:51:21] dschoon: is there a hole in /var/log/udp2log/webrequest/packet-loss.log ? [22:51:37] a hole? [22:52:11] that should have a continuous record of stuff delivered to analytics100x, no? [22:52:39] andrew didn't get around to giving me access to analytics1001 so I can't look [22:57:00] i think that's a good idea, ori [22:57:13] inject synthetic traffic into oxygen [22:57:31] i might even be able to dig up the script i used for eventlogging :) [22:57:37] woo [22:57:50] umm....don't we have other things relying on oxygen? [22:58:24] robla: looking [22:58:40] we do. [22:59:08] but the filters exist to ensure that only data fitting the pattern ends up in the various client files [22:59:34] so a synthetic URL will only end up in the 1:100 sampled stream heading to kraken [22:59:45] my hunch is that this is what is going on: [23:00:06] something like, say, http://kraken.analytics.wikimedia.org/$SEQ_ID [23:00:18] (remember we have to overwhelm the sampling) [23:00:25] kafka can't keep up, udp2log restarts it, kafka doesn't shut down the socket or set E_REUSADDR or whatever, new instance can't bind [23:00:48] you have lot of this all over the logs: [23:00:50] Opened pipe with factor 1: /opt/kraken/bin/kafka-produce event-unknown 9940 > /dev/null > /dev/null [23:00:50] terminate called after throwing an instance of 'libc_error' [23:00:50] what(): Socket::Bind: Address already in use [23:00:52] my suspicion is that we're not going to see another 0-byte event [23:01:28] ori-l: that's a really good theory [23:01:47] since it appears to have happened regularly for a while, and stopped once we switched the mobile import to 15m windows [23:02:05] *nod* [23:02:12] if the socket is in use, that's definitely a problem [23:02:27] i wish we had timestamps on those exceptions [23:02:36] there are [23:02:41] on analytics1001 [23:02:45] when was the most recent one? [23:02:48] well, you have log lines immediately above and below [23:02:54] ...an01 doesn't run anything related to this. [23:03:03] three hours ago [23:03:33] it doesn't even run a data node [23:04:06] ok, let me look at one of the other machines [23:04:20] though i'm not sure what you mean by "doesn't run anything related to this" when the error cites kafka and udplog [23:04:24] an0{3-6} are the udp2log -> kafka bridge [23:04:41] dunno man. [23:04:48] this is almost certainly a problem, whether or not it's THE problem [23:04:57] you also see the kafka bind errors are bunched together [23:05:04] but unless andrew changed something, an01 shouldn't be involved at all. [23:05:16] which further supports the idea that it's dying repeatedly until the OS frees the socket [23:05:27] *nod* [23:05:29] oh. [23:05:49] that's the pixel-service prototype logger [23:06:08] i bet one of the recent conf changes broke it [23:06:19] and now it's stuck in some awful zombie state [23:06:19] analytics1004 has this as well [23:06:21] from a few hours ago [23:06:43] probably when we bounced things to change the mobile cron to 15m [23:07:03] that was around 11am [23:07:35] you should add your theory to the "to investigate" section at the top of the pad [23:07:45] (about the socket errors, et al) [23:09:10] there are some from 9 hours ago, so 6 AM PST ish [23:09:54] hmm [23:10:10] okay, we'll definitely need to pursue that, then [23:10:25] brb [23:10:34] i'll continue with this in an hour or so -- need to run an errand [23:20:41] https://gist.github.com/4403011 [23:21:04] ^ quick-and-ugly python script you can pipe udp2log.log files into to get the timestamp immediately preceding the exception [23:22:36] and output, when run against analytics1004:/var/log/udp2log/udp2log.log: http://dpaste.de/Cyn7Z/ [23:24:20] robla: i'm about to dash; are there log files you want from the an machines? [23:26:01] ori-l: that's ok. I realized after the fact that the packet loss processor would write to the log regardless [23:28:11] we may want to consider adding some sort of absolute number logging to the packet-loss monitor [23:29:00] right now, it writes: "[2012-12-28T07:54:24] amssq37.esams.wikimedia.org lost: (-0.41464 +/- 0.75163)%", which is only its estimate. we're not getting the raw numbers used to derive that calculation [23:31:49] ah, yes