[05:57:13] (CR) Nuria: [C: 2 V: 2] "Really looking forward to see this data." [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155637 (owner: BearND) [05:57:18] (Merged) jenkins-bot: Add SQL for Apps - Impressions on edit pencil (per day) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155637 (owner: BearND) [06:07:02] holaaa YuviPanda! [11:41:44] nuria: heya! sorry was away [13:32:46] (PS1) Ottomata: Modify check_ip to consider internal proxied requests as valid [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155709 [13:33:58] (PS1) Ottomata: Modify check_ip to consider internal proxied requests as valid [analytics/webstatscollector] - https://gerrit.wikimedia.org/r/155710 [13:34:20] (Abandoned) Ottomata: Modify check_ip to consider internal proxied requests as valid [analytics/webstatscollector] - https://gerrit.wikimedia.org/r/155710 (owner: Ottomata) [13:35:45] GAHHH [13:36:24] (PS2) Ottomata: Modify check_ip to consider internal proxied requests as valid [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155709 [13:36:39] that worked, phew [13:38:26] hehe [13:38:40] ottomata, MAY be coming up to NY to cowork when Leila does, btw [13:38:43] or down, I guess? [13:49:18] (CR) Milimetric: "I see this is merged but I don't see the output datafile here: http://datasets.wikimedia.org//limn-public-data/mobile/datafiles/ so I assu" (3 comments) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155637 (owner: BearND) [13:52:47] eh, you may be? Ironholds? [14:17:57] (CR) QChris: [C: -1] Modify check_ip to consider internal proxied requests as valid (1 comment) [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155709 (owner: Ottomata) [14:23:13] (PS1) Milimetric: Update for July - follow up [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/155719 [14:23:31] ah, qchris, that was leftover deubbing, thanks [14:23:41] (CR) Milimetric: [C: 2] Update for July - follow up [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/155719 (owner: Milimetric) [14:23:49] (CR) Milimetric: [V: 2] Update for July - follow up [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/155719 (owner: Milimetric) [14:26:30] (PS3) Ottomata: Modify check_ip to consider internal proxied requests as valid [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155709 [14:29:02] (CR) QChris: [C: 2 V: 2] "Since it's a one off, we do not care about style at" [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155709 (owner: Ottomata) [14:29:04] (Merged) jenkins-bot: Modify check_ip to consider internal proxied requests as valid [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155709 (owner: Ottomata) [14:31:00] hashar_: I really like the new layout of the Jenkins comments. Thanks! [14:31:11] So pretty! [14:33:04] qchris: and thanks a lot for the regex review! [14:34:02] yw. That was the easy part. But the jenkins comments are so much more readable now. It's great. [14:34:04] (PS1) Milimetric: Tests no longer require high worker count [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/155721 [14:34:47] (PS2) Milimetric: Lower test worker count [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/155721 [16:26:06] (CR) Milimetric: [C: 2] Drop unused imports [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/155644 (owner: QChris) [16:37:00] Analytics / Tech community metrics: Wrong data at "Update time for pending reviews waiting for reviewer in days" - https://bugzilla.wikimedia.org/68436#c10 (Quim Gil) (In reply to Jeroen De Dauw from comment #7) > DataValues still has a copy on Gerrit?! This stuff was moved to GitHub ages > ago https:/... [16:54:05] (PS6) Milimetric: Stop scheduling new recurrent runs if databases lag [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/155003 (https://bugzilla.wikimedia.org/68507) (owner: QChris) [16:54:13] (CR) Milimetric: Stop scheduling new recurrent runs if databases lag (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/155003 (https://bugzilla.wikimedia.org/68507) (owner: QChris) [17:17:24] (CR) Milimetric: [C: -1] "again, sorry for the rebase - I thought I was going to merge it but found a small issue with testing." (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/155003 (https://bugzilla.wikimedia.org/68507) (owner: QChris) [17:17:35] ottomata: bits + upload hosts handle [17:17:46] geoiplookup.wikimedia.org, [17:17:56] bits.wikimedia.org, [17:18:07] upload.wikimedia.org. [17:18:17] All those three are ignored by webstatscollector. [17:18:19] But! [17:18:47] Some browsers are stupid and ask the wrong caches for the wrong domain. [17:19:07] Hive queries are just running to see how much that would affect us. [17:19:31] ? some browsers? why would browsers do that? [17:19:31] (One query is at 73%, one at 31%) [17:19:52] browsers, faked browsers, bots, ... [17:20:26] Mobile browsers (or things disguising as such) sometimes do crazy things. [17:22:28] what, set the Http host differently when requesting a bits ip? [17:22:44] yup. [17:23:31] weird [17:23:32] Like requesting things from en.wikipedia.org from a bits cache. [17:28:22] hm, ok strange [17:30:06] hmm, qchris, ther eis a udp2log instance running on an03 too [17:30:06] hmm [17:30:09] sqstat [17:30:16] yup. [17:30:16] i put it there a while ago for lack of a better place [17:30:28] That's where I thought the udp log errors were coming from [17:30:39] when you first mentioned the errors. [17:31:39] I am not using that instance. [17:32:28] hm, maybe i'll try to move that over to an26, since an26 isn't doing anything right now [17:32:28] hm [17:33:11] Sounds good. [17:45:06] (PS1) BearND: New SQL for "Apps - Impressions on edit pencil (per day)" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155763 [17:50:11] (CR) Milimetric: New SQL for "Apps - Impressions on edit pencil (per day)" (1 comment) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155763 (owner: BearND) [17:58:20] (PS2) BearND: New SQL for "Apps - Impressions on edit pencil (per day)" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155763 [17:59:28] qchris: queries return? [17:59:40] Currently at 89%. [17:59:52] 90%! [18:00:04] ok ok ok [18:02:26] (CR) Milimetric: New SQL for "Apps - Impressions on edit pencil (per day)" (1 comment) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155763 (owner: BearND) [18:13:06] hm, qchris, even so, i'm not entirely sure why there would be udp recv errors in this case [18:13:36] its not like collector is receiving more traffic than it is in production [18:13:42] and this machine is way beefier... [18:14:02] filter should be sending about the same amount of udp traffic to collector as it does in produciton [18:14:06] mabye because its running on the same machine? [18:14:10] send -> recv on the same box? [18:14:12] But tcp traffic increased drastically. [18:14:17] true [18:14:21] and udp + tcp share the same network card. [18:14:22] overall networking traffic maybe then? [18:14:24] yeah m [18:14:25] hm [18:14:37] The graph in ganglia seem to max out around 80MB/s [18:14:43] we saw that on other machines too. [18:15:34] hmmm, yeah? [18:15:37] if so that is very likely, eh? [18:16:09] hmm, well, since removing udp2log from an03 [18:16:17] bytes received is down to 30M / s [18:16:49] oh! and i think recv errors hav estopped too! [18:16:52] ah [18:16:52] nope [18:16:54] just slowed [18:17:02] or not...(i'm checking /proc/net/udp manually) [18:17:17] naw, nm [18:17:18] still the same [18:17:19] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=analytics1003.eqiad.wmnet&r=4hr&z=default&jr=&js=&st=1408731349&v=147394545&m=UDP_InErrors&vl=packets&ti=UDP%20Packet%20Receive%20Errors&z=large [18:19:45] http://dpaste.com/0T4H1PM [18:19:58] ottomata: ^ percentages of domains [18:20:26] ok awesome, i'm going to stop consuming bits and upload then [18:20:27] thanks qhris [18:20:39] i mean, that's good to know that we can cut out a huge porition of the stream for webstatscolelctor [18:20:45] and for your hive replacement query too! [18:20:58] Yup. [18:21:42] That should lower network traffic on analytics1003 and hopefully allow udp to recover. [18:28:02] rats, nope, qchris, still recv errors :/ [18:28:33] Mhmm. [18:28:59] cat /proc/net/udp | grep 0EE7 [18:29:03] analytics1003, if you are curious [18:29:16] dropped packets is the last column [18:31:14] (PS3) BearND: New SQL for "Apps - Impressions on edit pencil (per day)" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155763 [18:31:40] (CR) BearND: New SQL for "Apps - Impressions on edit pencil (per day)" (1 comment) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155763 (owner: BearND) [18:32:30] (CR) Milimetric: [C: 2] New SQL for "Apps - Impressions on edit pencil (per day)" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155763 (owner: BearND) [18:41:03] ottomata: kafka has a buffer of messages (several days?). Since text was added, [18:41:26] does kafkatee now consume all the buffered data, or just the live, new data? [18:41:54] HMMMMM [18:41:56] good q [18:42:00] does it start from the beginning or the end!? [18:42:02] lets see... [18:42:28] If the request offset was not found on broker, or there is no [18:42:28] # initial offset known (no stored offset), then reset the offset according [18:42:28] # to this configuration. [18:42:28] # Values: smallest (oldest/beginning), largest (newest/end), error (fail) [18:42:28] # Default: largest [18:42:29] kafka.topic.auto.offset.reset = largest [18:42:33] should be end [18:43:05] Ok. [18:43:20] Also, kafkatee is very bursty on cpu [18:43:32] which makes filter, and log2udp bursty. [18:43:39] Far more bursty than on oxygen. [18:44:36] Like going from 0% CPU to >100% CPU several times per second (for kafkatee) [18:45:33] While that's different to oxygen, I cannot immediately see whether or not that would explain things. [18:45:48] hmm [18:45:54] Meh. Probably not. [18:49:53] HMM, why can't i see this traffic in tcpdump... [18:49:55] i can see it fine on oxygen... [18:50:08] collector is definitely getting the traffic [18:50:20] i'm going to stop collector for a bit [18:50:21] Are you tcpdumping lo? [18:50:27] hmmm ah, maybe not! [18:50:34] thank you! [18:50:34] :) [18:50:40] yw. [18:51:51] hmm, qchris, maybe loopback interface recv queue behaves differently... [18:52:07] I am just reading up on the same topic. [18:52:10] :-) [19:04:53] qchris, i haven't found anything of interest yet, i'm thikning of sending to eth0 instead, just to see what happens [19:05:10] I am just preparing a patch to bump receive buffer to 512MB. [19:05:41] for collector's socket? [19:06:12] Yes. [19:06:44] (PS1) QChris: Bump collector's receive buffer to 512MB [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155782 [19:07:13] ottomata: ^ [19:08:46] hoookay [19:09:00] Not sure if that would fix it. [19:09:08] But might be worth a try. [19:09:11] qchris, let's try it before merging, got a compiled collector? [19:09:14] copy it to an03 and i'll run it [19:09:23] ok. [19:10:12] ottomata: /home/qchris/collector [19:10:22] (on analytics1003) [19:12:34] hm, qchris ./collector: relocation error: ./collector: symbol db_create, version libdb-4.8.so not defined in file libdb-4.8.so with link time reference [19:13:22] I only looked at ldd which had no "not found" :-) [19:13:57] Let me build on a Ubuntu system then ... [19:14:02] am doing that now... [19:14:02] :) [19:14:22] Thanks. [19:15:19] ok, its running [19:15:41] still see the drops :/ [19:17:08] :-( [19:17:52] qchris: i think you are right about the burstyness though, as i watch /proc/net/udp, i can see the rx_queue fill, drops increase, and then all the sudden it resets to 0, and for a second or two drops doesn't increase [19:18:44] collector is only using a small buffer to read from the socket. [19:18:56] And it's polling. [19:19:27] Would it be worth to try increasing the collector's read buffer? [19:19:48] Meh. Let's try. [19:20:02] sure [19:20:03] :) [19:21:42] uhh qchris [19:21:50] ? [19:21:50] i just switched log2udp to send to eth0 instead [19:21:53] drops seemed to have stopped [19:22:04] \o/ [19:22:11] (PS1) QChris: Increase collector's read buffer [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155787 [19:22:23] that is very strange, i would think loopback would be faster! [19:23:00] Faster it is for sure :-) [19:23:11] But a bit different too. [19:23:56] Not sure. ... Should we try the above buffer increase nonetheless, [19:24:01] or just leave it on eth0? [19:24:46] um, i'd say let's leave the base webstatscollector as is, unless we need to fix it [19:25:13] Ok. So eth0 it is :-) [19:25:54] (Abandoned) QChris: Bump collector's receive buffer to 512MB [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155782 (owner: QChris) [19:26:21] (Abandoned) QChris: Increase collector's read buffer [analytics/webstatscollector] (kafka) - https://gerrit.wikimedia.org/r/155787 (owner: QChris) [19:27:44] hokay, cool! [19:27:56] let's hope this just runs fine ove rthe weekend, and then we can compare some numbers on monday, eh? [19:28:47] Great plan! [19:37:33] ottomata: UDP Packets Received went down to 0 when the errors stopped: [19:37:36] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=analytics1003.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1408736087&v=1184245155&m=UDP_InDatagrams&vl=packets&ti=UDP%20Packets%20Received&z=large [19:37:52] Is the thing still working? [19:40:31] hhaha [19:40:32] uhhh [19:40:41] what you want me to check if it is working too!? [19:40:42] GEEZ [19:40:49] Hahaha :-) [19:40:54] Processes still exist. [19:41:19] but collector is at 0% CPU :-/ [19:45:42] wha? [19:45:42] udp 0 0 10.64.21.103:42315 10.64.21.104:3815 ESTABLISHED 6624/log2udp [19:45:51] .104 is analytics1004 [19:45:53] where'd it get that from? [19:46:37] Try pinging analytics1003.eqiad.wmnet :-) [19:46:48] haha [19:46:48] HMMM [19:47:05] analytics1003.eqiad.wmnet. 3600 IN A 10.64.21.103 [19:47:13] cached dns somehow? [19:47:21] hosts! [19:47:22] WTF [19:47:23] gah [19:47:56] /etc/hosts looks good. [19:48:14] i just deleted an entry [19:48:17] that would do exactly that [19:48:24] :-D [19:48:32] must have been some crappy thing i did for testing a looong time ago [19:48:33] i guess [19:48:34] geez [19:48:36] Hahahahaha. [19:49:21] ok, collector is busier now [19:50:12] There are the drops again :-( [19:50:21] No "eth fixes the issue" [19:50:23] :-(( [19:52:35] rx_queue is stuck at 4MB. [19:52:59] ratso [19:53:18] hmm yup [19:53:19] ok welp [19:53:26] Can we bump the default and max rx_queue. [19:53:33] (System wide) [19:53:52] sure, but i think udp2log puppetization bumps it pretty high [19:54:09] sysctl::parameters { 'big rmem': [19:54:09] values => { [19:54:09] 'net.core.rmem_max' => 536870912, [19:54:09] 'net.core.rmem_default' => 4194304, [19:54:09] }, [19:54:10] } [19:54:22] hmm, [19:54:52] /proc/sys/net/core/rmem_default is still at 4MB. [19:54:57] Right. [19:55:13] udp2log manually sets it in the code too [19:55:23] in init script [19:55:23] rmax=$(cat /proc/sys/net/core/rmem_max) [19:55:23] queue=$((rmax/1024)) [19:55:28] --recv-queue=${queue}" [19:55:32] udp2log is the producer, not the receiver. [19:55:38] collector is the receiver. [19:55:45] i'm saying, in usual udp2log case [19:55:54] so, the system limits are higher on this box already [19:56:12] because udp2log was installed via puppet here [19:56:23] Right. But the default of 4MB is not too high. [19:56:24] so, we can try to up it in code too, but i guess that's what you already had done, right? [19:57:09] I thought so :-) [19:57:09] i will turn your recv buffer collector back on [19:57:16] let's see if it helps on eth0? [19:57:19] Ok. [19:58:19] k its running [19:58:30] still drops [19:58:45] Still the rx_buffer does not go higher than 4MB. [19:59:41] So my patch is probably wrong. [19:59:54] Can we manually bump net.core.rmem_default? [20:00:06] Just to test that the buffer is the issue? [20:00:54] sure [20:01:52] ok done, collector restarted... [20:02:00] Nice! [20:02:05] It gous way above 4MB now. [20:02:07] And no drops. [20:02:15] no drops, and it looks like it clears it out too [20:02:30] oo , gettin full... [20:02:33] ah, cleared [20:02:34] cool [20:03:12] So SO_RCVBUF obviously does not do what I thought it would. [20:03:18] hm [20:03:21] this is what udp2log does [20:03:21] https://github.com/wikimedia/analytics-udplog/blob/master/udp2log/udp2log.cpp#L209 [20:03:23] Do you know the correct SO_*? [20:03:36] looks like the right one [20:03:45] That's where I took it from :-) [20:06:31] Whoa. It's filling up. [20:06:39] Once the buffer is full, we see drops again. [20:08:00] ah yeah, it is fuuul [20:08:04] rats ok [20:08:04] hm [20:10:10] hm [20:10:37] So collector is too slow? [20:11:43] guess so, but still, don't know why this would make a difference, seeing as it runs ok on gadolinium [20:11:51] and the amount of data its sent is about the same [20:11:57] i guess burtyness could cause it [20:12:02] dunno [20:12:48] Not sure yet either. :-/ [20:22:35] (PS1) Milimetric: Fix config and sql [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155803 [20:23:33] (CR) Milimetric: [C: 2] Fix config and sql [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155803 (owner: Milimetric) [20:33:34] qchris: not sure,i have to run pretty soon [20:33:35] hm [20:33:43] No worries. [20:33:49] I'll toy around a bit more. [20:34:28] We could try increasing the buffer for reading from the socket. [20:35:11] (Like the second gerrit change that we have not yet tested) [20:35:28] yeah, [20:35:45] If that does not help, I am out of clues for now :-) [20:35:55] qchris, can I make it so you can swap out collector binaries? [20:36:02] i'll make /usr/local/bin/writeable/i guess [20:36:05] that's the easiest [20:36:14] That would be nice!!!! [20:36:50] Wait ... wouldn't puppet get in the way and everything? [20:37:01] Meh. Let's try. [20:37:01] naw, it doesn't mess with the package stuff [20:37:03] the package installed it [20:37:14] puppet only manages the install of the package [20:37:34] k. [20:37:37] if you messede with /etc/kafkatee.d files, puppet will get in the way [20:37:45] but, ja, there you go [20:37:53] /usr/local/bin/ isgroup writeable by analytics-admins [20:38:02] try to make sure you can write there [20:38:32] Works. [20:39:07] cool [20:39:11] welp, good luck! :) [20:39:14] But I guess I cannot kill processes you started. [20:39:19] ah, hm [20:39:20] right [20:39:24] you'd have to have sudo ahhhh [20:39:36] If you kill the collector, I can restart it. [20:39:55] its running by init [20:40:00] and puppet [20:40:00] Ha :-D [20:40:08] Well. [20:40:13] hm [20:40:13] its ok [20:40:14] Let's continue on Monday then. [20:40:15] we can kill it [20:40:18] i'll stop puppet [20:40:19] its fine [20:40:23] you can just start it as your user [20:40:25] shoudl be fine [20:41:33] ok [20:41:42] you should be able to just fire up your own collector now [20:41:51] Works. [20:41:51] hm, i will make /srv/log/webstats writable by analytics-admins too [20:42:08] You rock! [20:42:21] ok, i think that should do it [20:42:26] Thanks! [20:42:33] Now off into your weekend! .-D [20:45:45] ha, ok thanks qchris, laterrrs [20:45:50] Have fun! [21:45:30] (PS1) Milimetric: Set up layout and basic pieces [analytics/dashiki] - https://gerrit.wikimedia.org/r/155826 [21:49:19] (CR) Milimetric: "Nuria I worked on your patch for a bit but I thought it had a bit too much detail and un-necessary code. So I used a separate commit to d" [analytics/dashiki] - https://gerrit.wikimedia.org/r/155826 (owner: Milimetric) [21:56:27] (PS1) BearND: Add Apps tab with the first app specific graph [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/155830