[04:26:21] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387918 (10ori) [04:57:50] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387932 (10GWicke) Varnish -> RB GET p95 latency: {F4173733} [06:32:31] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2386622 (10ori) From reading the schema, it looks like you're reporting an average for all requests made in the co... [09:36:58] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2388249 (10akosiaris) So, this is happening due to carbon having a public IP and somewhat relax firewall rules. The premise is that hosts with public IPs have that due to t... [09:41:52] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2388256 (10Krenair) phab-01.phabricator.eqiad.wmflabs should no longer be doing it [10:27:50] https://grafana.net/plugins/grafana-worldmap-panel is kind of interesting :) [10:51:00] bblack: I have more logs for the timeout issue on cp1061:/home/elukey/(varnishkafka.log|varnishlog).txt [10:51:29] they are about timeouts for websocket upgrade requests from what I can see [10:51:35] no Timestamp:Pipe [10:52:09] but VSL: timeout and synth as we observed yesterday [10:53:24] Now this messes up a lot out consistency checks on Hadoop, resolvable adding to vk's filter 'VSL ne "timeout"' [10:53:40] I thought about pros/cons of doing/not-doing it [10:54:13] and even if it would be nice to have vk's logs for timeouts, they may not contain enough information to take action or to identify a root cause [10:54:54] so it may be better to filter them at vk's level and then log/identify/alarm on them on each cp* host [10:56:52] does it sound reasonable? The alternative, if you want logs on hadoop, could be to change our consistency checks to not count logs with timestamp "-" and to statistics about how many requests with inconsistent timestamp are logged for each hour (but it might require a bit of time) [10:59:59] s/to statistics/do statistics/ [11:00:12] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388438 (10BBlack) I don't even understand the initial thing the ticket is complaining about. Can you explain to... [11:07:31] elukey: no, we can't just skip timeouts from the vk set. the vk set is basically the only authoritative data we have. if it doesn't get logged by vk, it doesn't exist at all for a lot of other purposes. [11:07:57] ah ok makes sense [11:08:19] if some kind of regression was suddently causing 5 or 10 percent of all requests to suffer some timeout-related perf regression we'd never be able to figure it out. and syslog on caches of anything related to individual requests has the possibility of being way too spammy and filling disks with noise, etc [11:08:51] we don't even log all the requests to nginx, either, for basically the same log-spam reasons. [11:09:06] which is actually another big hole, that we can't/don't want to fix at this time. [11:09:31] I was hoping that we could have sent metrics to graphite from one of the python scripts that use the Varnish Log [11:09:59] (as in, due to some bug or config regression, if 10% of all user requests were being dropped on the floor or refused by nginx, we could only really tell from user reports. We don't start analyzing until the request makes it from nginx to varnish). [11:10:17] in theory catchpoint and icinga would see major public-facing issues with nginx losing requests, but it might miss minor ones. [11:11:06] elukey: the python scripts face all the same problems vk does. if we can't figure out how to categorize it in vk, we're not going to figure it out in python either. [11:12:31] well there is a specific tag for timeouts, so varnishreqstats might be instructed to send metrics to graphite when it sees it the logs? Not super familiar with varnishreqstats so this might be completely stupid [11:18:10] let's step back a sec, because I only really half-understand what we do with vk anyways [11:18:23] all of this is about logging timestamps to webrequest, right? [11:18:41] or is this about one of the other vk loggers? [11:19:58] because when I look at the webrequest output, the only place timestamp is used is the "dt" field, which isn't even tracking any kind of time differential like end-start, right? it's just recording a single datestamp for the request? [11:20:12] in which case, why don't we just use the start timestamp and be done with it, since that's always there? [11:22:52] (rewinding to your last question: we could get metrics on timeouts, but it wouldn't link them to actual requests) [11:24:47] correct, we want to have a timestamp for each webrequest.. We switched from Start to End since it was messing up with our consistency checks on Hadoop. [11:25:19] we bucket data per hour and check sequence number diffs vs actual number of req logged [11:26:22] (we used to log end timestamp with vk 3 version) [11:27:05] sometimes a request can have a start timestamp on one hour and end timestamp in another, for example if it Starts at XX:59 and end to YY:01 [11:28:11] but if you're only logging one of the two anyways, how does that affect consistency checks? either way the rate of reqs/hr is about the same, and the hour rollover is abitrary? [11:29:25] yeah but the sequence number is generated at the very end of the vk processing [11:29:40] after reading the log [11:29:44] how does the timestamp affect that? [11:30:30] nono it doesn't affect that, using the end timestamp facilitate the work of bucketing correctly data in hadoop [11:30:33] either way you're just looking at #reqs/hr, vs start/end seqnos in a given hour, right? [11:31:03] I still don't see how it affects the correctness of the bucketing [11:32:36] there may be a some sequence numbers related to other requests between start and end of a slow req, and you may get a weird sequence number in the wrong hour "bucket" messing up with our checks, since the script will think that there are missing seq numbers [11:32:45] in reality they are all in another hour bucket [11:33:33] let me find an explanation that me/andrew wrote [11:33:39] in a phab task [11:34:13] ok, so you're saying that basically you want the order of actual events from varnish->vk to match the ordering of the timestamps [11:34:29] https://phabricator.wikimedia.org/T136314#2341538 [11:34:33] yes [11:35:24] using format seqno:time, you don't want to see: 1:1 2:2 3:5 4:4 5:3 [11:35:56] (which is the kind of pattern you might see logging start timestamps, but with log emissions happening at the end of requests) [11:36:04] yep! [11:36:16] if 1:1 2:2 and 3:5 get bucketed in the same hour, [11:36:29] then you think you've lost 2 requests [11:36:32] then our script might think that 4 is missing [11:36:42] yeah sorry also 3 [11:36:47] exactly [11:37:10] even with end timestamps, is that actually reliable? [11:37:41] we solved the issue, Joseph double checked with stats that we store for errors etc.. [11:38:12] the main issue is also that the consistency check can stop the data processing pipeline if it thinks that more than 5% of the req are lost [11:38:58] we increased the threshold to 10% but even with it we are seeing many false positives (related to timeouts) and we have to manually fix them to restart the data crunching pipeline [11:40:26] well, fundmanteally, apparently there are many cases in varnish4 that emit no final timestamp [11:40:44] here's the thing, though. [11:41:31] other than the arbitrariness of the consistency check, the datestamp we use is pretty fuzzy anyways. we've switched from start to end before, and there's lots of fuzz around how either of those actually relate to the client or various servers or the exact time of logging with small lags there, etc... [11:42:37] one thing we do know is that a normal request emits to the shmlog for vk pickup around the same time as its End timestamp (at least, in sequence with its End timestamp), and that our varnish->vk lag is probably never very big, since it's a shm buffer and we don't have routine overruns or anything... [11:43:26] what harm would it cause to either (a) completely replace the varnish end timestamp with the local system time from vk at the time vk picks up the shm log entry, or (b) only do that in cases of no end timestamp? [11:44:30] also either way, the loss% you're tracking for never covered realworld->varnish-shm-log->vk loss, only loss from vk onwards, since vk is generating the seqnos [11:45:14] if vk's the seqno source and we don't expect any real delay in varnish_shm->vk, why wouldn't we just have vk generate the timestamp with it? [11:48:35] it could be an option yes, maybe with a different formatter [11:49:37] I didn't want to do it since I preferred to use varnish timestamps fearing small delays in the shm->vk processing, but there should be none [11:54:20] I think the maximum possible shm->vk delay is going to be smaller than things like start-vs-end fuzz, or the fuzz of correlating either timestamp to other things in the world [11:54:27] yeah [11:54:35] also we could just log start and end timestamp [11:54:43] if the shm->vk delay was substantial at all it would just cause completely-lost events and we'd drop events from shm recycling (and hopefully log that) [11:54:47] and modify consistency checks accordingly [11:55:09] yeah that's a good idea too [11:55:36] log varnish start timestamp, and then use a separate fuzzier end timestamp for consistency-checking [11:55:52] you can either use End if it's avail and replace with vk's local time if it's not, or use vk's local time all the time [11:56:32] the latter is simpler, but then again if End's available people could use it to make end-start times available (total request times). and in the cases it's missing, the total time is probably larger anyways, large enough that the small fuzz doesn't matter. [12:13:34] but in the latter case, how would we find that timeouts have occurred? [12:14:29] for example, Start: date/time and End: - indicates that something went really wrong, meanwhile a correct couple of Start/End would have not alerted us.. [12:14:59] it would resolve the consistency check problem, but it may introduce other problems.. [12:16:38] 10Traffic, 10Analytics-Cluster, 06Operations: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388494 (10BBlack) nice find in the old tickets! See also: T120121 [12:19:47] add a flag field [12:21:37] so basically replace today's "dt" with "sdt" (start time), "dt" (end time, which is End timestamp or local replacement if missing, and still gets used for consistency-check), and "tstype:X", where X is various codes for conditions to filter on [12:22:21] X could be: N for normal, T for timeout (no End:, saw timeout tag?), P for pipe (no End:, saw a pipe flag somewhere?), U for unknown new cases? [12:23:04] it doesn't fit well with the model of how tag format is configured and emitted, though. [12:23:47] I guess pipes timeout too, so maybe multiple flags are possible [12:24:14] I guess we don't necessarily need a start timestamp, it just seemed like a good idea. maybe too much data. [12:24:20] we may start only with sdt: and edt: (start/end) and watch all the stats related to req without a proper end timestamp. This may be good for the moment, and then we'll see how it goes when we start switching Upload [12:24:40] well [12:24:55] sdt/edt as raw start/end from shm doesn't give your consistency fix [12:25:20] ah nono this would go along with a change in the script that does the consistency check [12:25:25] we could ignore start for now (we don't have end-start capabilities, why add them as part of fixing this) [12:25:31] removing the logs with edt: = [12:25:34] sorry - [12:25:47] removing where? [12:26:07] would leave holes in seqno unless you mean removing them at the vk level, at which point we're losing requests to logging completely [12:27:33] sorry removing was not the right word, wanted to say that the log will be stored in hadoop but the script will not alarm if the 5%ish data loss is related to req with End: "-" [12:27:49] how would it even know? [12:28:02] if you're binning by time, what bin does end:- go in? [12:28:50] if we really want to have the right data, I think we need some kind of flags thing about these kinds of states [12:29:03] I think that Camus (which reads from kafka and stores in hadoop bucketing per hour) is the one that buckets "-" [12:29:16] ah yeah the flag would be awesome [12:29:21] flags can be empty on normal response, P-flag for pipe-mode, T-flag for saw-timeout, E-flag for "no true end timestamp, we synthesized it" [12:29:26] add more flags as necessary [12:30:14] and then DT always has a value, and it's probably always a good approximation of what End: should've been if it was missing [12:30:22] and the flags can be used to dive deeper on related issues if we need to [12:31:50] "dt":"2016-06-16T16:37:44","flags":"PTE" -> dt is artificial because there was no End:, and we saw log indicators that this was a pipe-mode request and it timed out. [12:32:09] I like it [12:32:29] "dt":"2016-06-16T16:37:44","flags":"P" -> pipe mode, but no timeout (closed in some normal way) [12:33:08] "dt":"2016-06-16T16:37:44","flags":"E" -> synthetic end timestamp, and we have no freaking idea why. pointer to look for more flag-cases to add in the future :) [12:33:46] oh pipe never has end, so I guess the second example would probably always be "PE" [12:33:47] :) [12:34:16] unless we someday see a pipe-flag and and end-timestamp together, at which point we'll see "P" and have a new mystery. [12:34:40] so I am going to ask to my team a short term solution to temporarily patch the problem with our consistency check, and think about a proposal for vk [12:34:50] sounds good! [12:35:16] I wanted also to ask you a suggestion to add "tests" to vk [12:35:41] I thought about unit tests for some functions, for example all the parsing ones, and maybe functional/integration tests? [12:35:53] like log replaying and check of results [12:36:21] each time that I deploy vk, even if it has been carefully reviewed by multiple people, it is like taking a leap of faith [12:37:07] but I have no idea if it is worth to just write a some test C files or to use a framework [12:37:16] never done testing in C :) [12:37:50] it would be a good idea to add them, yeah [12:38:10] unit tests are easier, because they don't involve mocking or invoking varnish to create real shm log entries [12:38:38] still, I really think in the long run we'll dump vk.c for something else, so there's a question of time investment [12:39:21] I think ema's old hack was varnishncsa-based, the other day he pointed out https://code.uplex.de/uplex-varnish/varnishevent too [12:39:40] basically, we can use some other better-maintained varnish4 logging solution and just transform it into kafka output [12:40:34] in any case, there's several small libraries for C unit testing, mostly about macros to make it easier to say ok() or whatever, and TAP or some other standard output for consuming tools [12:40:45] but there's no well-defined standard practice there, and they all have tradeoffs [12:41:33] for some C unit testing in gdnsd, I used http://ccodearchive.net/info/tap.html [12:44:38] thanks! [12:45:10] link saved, will check later on. Now I need to get to the airport :) [12:45:20] have fun! [12:47:18] all the libgdmaps unit tests in here use libtap, but the libtap calls are wrapped up in libgdmaps' own unit testing wrappers: https://github.com/gdnsd/gdnsd/tree/master/t/libgdmaps [12:47:36] so e.g. https://github.com/gdnsd/gdnsd/blob/master/t/libgdmaps/t00_v4db.c is one unit test file [12:47:53] (which generates one executable that outputs TAP outputs for several tests, consumed by a larger framework) [12:48:23] and the gdmaps_test_*() calls in there come from: https://github.com/gdnsd/gdnsd/blob/master/t/libgdmaps/gdmaps_test.h [12:49:06] implemented in e.g.: https://github.com/gdnsd/gdnsd/blob/master/t/libgdmaps/gdmaps_test.c#L59 (where you can see the calls to libtap ok() and such) [12:49:59] thanks! [12:59:08] 10Traffic, 10Analytics-Cluster, 06Operations: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388543 (10Yurik) We should move all proxies out of the zerowiki and into metawiki. This will allow a much more transparent management of the proxy ips, and won't as... [13:02:20] 10Traffic, 10Analytics-Cluster, 06Operations: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388560 (10BBlack) >>! In T56783#2388543, @Yurik wrote: > We should move all proxies out of the zerowiki and into metawiki. This will allow a much more transparent ma... [14:29:03] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388656 (10Mholloway) @ori, thank you, those are excellent points and I've created a new task to address them.... [14:30:27] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388660 (10BBlack) @mholloway - any chance this has interactions with the lazy-images/lazy-refs experiments, which... [14:41:20] bblack: from the same authors, https://code.uplex.de/uplex-varnish/trackrdrd seems more relevant [14:41:25] The source distribution for trackrdrd includes implementations of the MQ interface for Kafka, ActiveMQ and for file output (the latter for testing and debugging); see libtrackrdr-kafka(3), libtrackrdr-activemq(3) and libtrackrdr-file(3) for details. [14:41:32] The trackrdrd demon reads from the shared memory log of a running instance of Varnish, aggregates data logged in a specific format for requests and ESI subrequests, and forwards the data to a messaging system, such as ActiveMQ or Kafka. [14:41:55] interesting [14:41:59] not json-formatted, though [14:42:56] well if they've solved the hard parts right and it's good code, we could always send them a patch for a json flag or something [14:43:08] why do you hate vk? [14:43:14] or dislike, maybe :) [14:43:48] because it's ridiculously complicated and we keep running into bugs and problems [14:43:52] and it's our code to maintain [14:44:24] I'd rather waste 5% more cpu cycles (that we have plenty of anyways) and use some standard solution from elsewhere if we can [14:44:33] or at least hack together standard stuff from elsewhere using python and/or kafkacat [14:45:30] ok :) [14:45:32] regardless of all this bigger-picture stuff, what I really like is their approach to the logging [14:45:48] which approach? [14:45:51] we're trying to extract all these crazy fields from the bulk raw varnishlog [14:46:15] they're doing std.log("Track: ...... relevant stuff extracted in VCL and formatted ..."), and just grabbing that one Track: entry from the shm log [14:46:35] we could re-arrange what we're doing similarly regardless of the kafka-ization tool [14:47:47] if it's too hard to formulate the output in raw VCL, we could even just make a local vmod that does nothing but format and output that one analytics varnishlog-output line [14:48:23] nod [14:48:49] having downstream tools just needing to grab one specific shmlog entry for each request is a nice win, kills a ton of complexity on the shmlog-reading side [14:50:07] if we had done such a thing in the past, the bulk of our problems with varnish4ization of varnishkafka would've been resolved in porting the VCL and just making sure that std.log() output still looked the same [14:50:24] which is much easier than handling the massive change in what raw varnishlog requests look like [14:52:17] (also, it's kinda cool that the exact log entry being sent to kafka would also be visible when debugging with varnishlog-like tools too) [14:54:44] (also, if we went so far as to have it outputting std.log(json-stuff), we really could just do: varnishlg -I somefield | kafkcat) [15:00:06] 10Traffic, 06Operations: unix domain socket listening for varnish4 - https://phabricator.wikimedia.org/T138084#2388691 (10BBlack) [15:38:51] 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2388786 (10BBlack) New usernames in the past 24H: ``` Raboe001 ``` (I figure no point repeating the already-notified list every day, but c... [16:44:56] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2388961 (10BBlack) [16:49:16] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2388975 (10BBlack) For reference, here's the raw data at the end of some munging of a 5-minute sample of received URLs in a single cache... [19:47:44] 10netops, 06Operations: Notify transits of new esams prefixes - https://phabricator.wikimedia.org/T81989#2390595 (10faidon) 05Open>03Resolved a:03faidon I verified via looking glasses that those routes are being successfully propagated by all of our Amsterdam transits but one. I've mailed them for comple... [20:20:05] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390668 (10Mholloway) Mystery solved: due to a session logging schema change, the Android app began sending sessio... [20:20:22] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390669 (10Mholloway) 05Open>03Resolved [22:10:15] 10Traffic, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 06Operations, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390885 (10GWicke) This is the [updated latency graph](https://docs.google.com/spreadsheets/d/1ZcaXdxMhaAEFMferqiR...