[10:28:14] ema: hi, would you have time to help me debug my timeout issues? [10:33:09] dcausse: yup! [10:33:22] ema: nice! [10:34:07] ema: working with one of the mwdebug mw instance will work for you? [10:34:22] dcausse: sure, can you re-apply the patch to mwdebug1002? [10:34:51] ema: sure I can but I need to ask permission for that, one sec [10:35:03] ok [11:12:19] ema: I just realized that I can patch mwdebug1002 if it helps [11:12:43] dcausse: that would help indeed! [11:12:47] ok [11:14:53] dcausse: so far I've tried to repro with a script sleeping 80s to no avail https://pinkunicorn.wikimedia.org/dcausse/eighty [11:16:21] (as expected, we have seen requests taking much longer than that and responses come back just fine) [11:16:31] and it would make sense given the proxy_read_timeout set to 180s right? [11:17:14] y [11:17:19] I was trying to check yesterday if dcausse's change was hitting another 60s proxy limit [11:17:39] (that is the default afaics) [11:20:07] * elukey blames dcausse :P [11:20:37] hm :) [13:20:14] proxy_send_timeout? [13:20:22] it's an odd one, but possibly applicable [13:20:30] and it's 60s [13:24:01] bblack: super ema already solved the mistery, it was the nginx on hassaleh.codfw.wmnet with default proxy_read_timeout [13:54:41] haha [13:54:47] so many proxies! [14:40:49] 10Traffic, 10DNS, 06Operations, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2872576 (10Dzahn) [17:06:50] 10netops, 06Discovery, 06Operations, 10Wikidata, and 2 others: wdqs2003 switch port configuration - https://phabricator.wikimedia.org/T153094#2873019 (10RobH) 05Open>03Resolved network port enabled, description set, and put in the internal vlan. [17:27:48] 10Traffic, 10Citoid, 06Operations, 10RESTBase, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2873094 (10Mvolz) [18:38:25] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Release-Engineering-Team, and 4 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2873344 (10Aklapper) [18:38:28] bblack: question re: gdnsd rcode status counters, is their sum meaningful in any way? e.g. the total number of rcodes sent ? [18:38:45] IOW if summed together there won't be any double accounting [18:40:45] godog: yeah, the rcode ones can be summed [18:41:13] you can consider "dropped" a virtual rcode as well, even though it implicitly means no response and no rcode [18:41:33] but there are multiple ways to slice that and think of it [18:41:51] you could also just total up tcp_reqs + udp_reqs to get total inbound reqs. It should sum the same as all rcodes including "dropped" [18:42:38] (although now that I think about that, I'd have to look again to know if the _recvfail or _sendfail ones count in rcode as dropped or not. possibly not) [18:45:10] bblack: thanks! yeah I was wondering in https://gerrit.wikimedia.org/r/#/c/325975/5/modules/prometheus/files/usr/local/bin/prometheus-gdnsd-stats to have each rcode as a separate metric e.g. gdnsd_rcode_noerror or as key/value e.g. gdnsd_rcode{status="noerror} [18:45:21] ATM it is the former [18:47:23] oh I see [18:47:35] does that make a diff to offering a total? [18:47:43] only the latter can be easily summed? [18:57:05] godog: [18:57:07] const stats_uint_t this_reqs = l_noerror + l_refused + l_nxdomain [18:57:07] + l_notimp + l_badvers + l_formerr + l_dropped; [18:57:31] if(this_stats->is_udp) { [18:57:32] statio.udp_reqs += this_reqs; [18:57:51] else { [18:57:53] statio.tcp_reqs += this_reqs; [18:58:07] so those are the summable ones and how they're summed in gdnsd [18:58:35] 6x actual rcodes (noerror, refused, nxdomain, notimp, badverse, formerr) + "dropped" which is a virtual rcode meaning we didn't respond at all [18:58:58] you can sum all of those directly for a total request-count, or you can sum udp_reqs+tcp_reqs to get the same [18:59:27] the per-protocol sendfail/recvfail stats are independent of all of that [18:59:53] bblack: ah ok, so yeah summing all non-dropped rcodes is the number of answers given [19:00:09] kinda :) [19:00:10] so it could be gdnsd_answers{rcode="noerror"} [19:00:28] are you trying to see successful answers sent over the network? [19:01:03] if so, that would all rcodes other that dropped, - (udp_sendfail + tcp_sendfail) [19:01:09] because those locally failed to send out [19:01:31] (they weren't dropped by gdnsd, the network call to send them failed or whatever) [19:01:52] whereas dropped is because a request was so malformed we couldn't even send a formerr [19:01:53] ah ok, yeah in this case noerror was an example but your explanation is useful too [19:02:23] noerror doesn't imply that we sent it successfully (could still sendfail) [19:03:28] but mostly I don't think it's useful to try to do math against sendfail/recvfail [19:03:49] those are error counters which should probably just be displayed independently to look for spikes/issues, should normally be very small [19:04:29] ah ok, yeah that makes sense, I guess "logical" answers as far as only gdnsd is concerned [19:04:57] in terms of useful stats info to stare at, I would tend to think of it this way: [19:05:32] 1) It's good to see a total on inbound requests, which can be derived two ways: sum all 7x rcodes (including dropped), or sum tcp_reqs+udp_reqs. [19:06:19] 2) It's good to see an rcode breakout on those including "dropped" (or mabye do 1+2 in a single graph with stacking and a total shown) [19:07:25] 3) The idea of a separate graph for output is kind of redundant at that point. All the reqests that aren't "dropped" imply logical response-attempts [19:08:35] 10Traffic, 10DNS, 06Operations, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2873430 (10Dzahn) [19:08:58] 4) v6, edns, and edns_clientsub should be able to make a graph making them percentage-relative to total requests as in (1) [19:09:49] thanks! that's useful, I'll paste that in T147426 so it doesn't get lost, if that's ok [19:09:49] T147426: Port gdnsd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147426 [19:09:56] 5) that leaves udp_tc, udp_edns_bug, and udp_edns_tc .... udp_tc would be a fraction of udp_reqs [19:10:20] well all 3 of those are fractions of udp_reqs I guess [19:10:45] yeah [19:10:50] there's a million ways to slice stats! :) [19:11:06] we can always do the math in grafana and play with useful views [19:11:24] the important thing is it should be easy to sum the 7x rcodes as a total in grafana [19:11:29] (including dropped) [19:12:36] ok that answers my question re: key/value for rcodes, I'll send the review your way too [19:13:16] yeah I'd tend to do minimal processing when exporting the metrics and do the math in prometheus and/or grafana [19:15:31] 10Traffic, 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port gdnsd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147426#2873453 (10fgiunchedi) Pasting a conversation in `#wikimedia-traffic` re: status codes and dashboarding ``` 19:04 in terms of... [19:17:20] 10Traffic, 10DNS, 06Operations, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2873463 (10Krenair) [23:19:16] 10netops, 06Operations, 10ops-eqiad: asw-a2-eqiad PEM 0 not powered - https://phabricator.wikimedia.org/T153273#2874808 (10faidon)