[06:04:09] 10netops, 10Operations, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3334660 (10ayounsi) [06:48:33] 10netops, 10Operations, 10Patch-For-Review: Rancid improvements - https://phabricator.wikimedia.org/T167288#3334694 (10ayounsi) For reference, this is now possible: git clone ssh://netmon1001.wikimedia.org:/var/lib/rancid/core/ rancid-configs Devices also now have a "config" tab in LibreNMS. [07:37:39] 10Wikimedia-Apache-configuration, 10Mobile: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3334778 (10Aklapper) [09:21:16] 10netops, 10Operations, 10Patch-For-Review: Rancid improvements - https://phabricator.wikimedia.org/T167288#3334923 (10ayounsi) 05Open>03Resolved All done. [09:23:19] 10netops, 10Operations, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3334929 (10ayounsi) [09:25:02] 10netops, 10Operations, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3250758 (10ayounsi) 05Open>03Resolved List items commented in the description. No real value for now for http basic auth, will reopen task if needed. [09:32:37] 10Traffic, 10netops, 10Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3334946 (10ema) 05Open>03Resolved Looks good! [10:47:48] ema: out of curiosity, any idea what can cause the CP* servers to receive ton of "ICMP Destination Unreachable" ? https://grafana.wikimedia.org/dashboard/db/network-performances?orgId=1&var-server=cp1073&var-datasource=eqiad%20prometheus%2Fops [10:52:31] XioNoX: interesting [10:53:13] I can run a packet capture if you want some sample data [10:53:30] staring at tcpdump -n 'icmp[0] = 3' already [10:54:25] 2/s or so? [10:55:00] for the amount of traffic those boxes are handling that doesn't seem a ton [10:56:28] mark: I'm looking at your task from 2011 :) https://phabricator.wikimedia.org/T79755 prometheus/graphana seems a better location to tackle it rather than netflow [10:56:43] mark: up to ~30/s on busier machines (like cp3038) [10:56:54] yeah, it's not much but unexpected still [10:56:57] 2011 is when I installed our first ops ticketing system, RT [10:57:09] sorry I didn't foresee prometheus and grafana at the time ;p [10:57:28] 2011 is when I streamed a lot of ideas in my head into tasks/tickets for the first time [10:57:45] we suddenly had a bit of a team forming, and they couldn't read my mind ;) [10:58:42] i guess that ticket was a bit later [10:58:58] that was an interesting problem [12:16:45] 10Wikimedia-Apache-configuration, 10Chinese-Sites, 10Mobile: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3335129 (10LNDDYL) [13:19:44] 10Wikimedia-Apache-configuration, 10Mobile: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3335215 (10Liuxinyu970226) [13:20:05] 10Wikimedia-Apache-configuration, 10Mobile: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3334698 (10Liuxinyu970226) @LNDDYL this is not about zh.wikipedia.org, thx [14:23:32] ema: I still had PS1 loaded from earlier when I commented, but the comment on PS1 applies to PS4 too :) [14:24:03] bblack: sounds good! [14:26:00] bblack: I thought of calling the rate limit code from vcl_miss and vcl_pass instead of vcl_backend_fetch to be able to call synth without going crazy [14:26:34] ema: yeah makes sense! [14:29:10] 10Traffic, 10Wikimedia-Apache-configuration, 10Operations, 10Mobile, 10Patch-For-Review: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3335419 (10BBlack) I think this is because our mobile-redirect log... [14:39:13] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations: Add an redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513#3335451 (10Liuxinyu970226) [14:39:30] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations: Add an redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513#3335387 (10Liuxinyu970226) Following T105999 tags [14:40:32] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations: Add an redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513#3335456 (10vjudge404) [14:47:26] bblack: amended using 25/5s this time. The hardest part was fixing the vtc :) [14:51:19] :) [15:07:32] fascinating, varnish returns "HTTP/1.1 429 Unknown HTTP Status" instead of too many requests [15:10:59] sigh, that's because it's not in include/tbl/http_response.h [15:19:19] we've done arbitrary reason codes before though [15:19:31] I mean, reason texts [15:20:12] mmh [15:20:52] none of the old ones work? maybe we never checked post-varnish4 heh [15:21:25] yeah they're all overwritten from varnish internal stuff [15:21:42] https redirect says "HTTP/1.1 301 Moved Permanently" [15:22:37] it used to say "HTTP/1.1 301 TLS Redirect" [15:24:12] maybe we have to set resp.reason directly now [15:24:57] let's see [15:25:08] well, wait where is synth() documented, or is that still our macro? [15:25:43] no it's not our macro [15:26:07] synth(status code, reason) [15:26:08] Transition to vcl_synth with resp.status and resp.reason being preset to the arguments of synth(). [15:26:16] ^ is the docs on it [15:26:47] maybe it's a regression in some point update since our transition, too [15:28:11] possibly, for sure it doesn't seem to work as advertised :) [15:29:35] bin/varnishtest/tests/c00066.vtc tests the behavior when synth is called from vcl_recv [15:30:14] lol, and fails [15:32:52] oh wait, I've got v5 installed on my workstation [15:33:13] it passes on 4.1.6-1wm1 [15:34:05] maybe it only works from recv then [15:34:17] but we do call it from recv, don't we? [15:34:40] does manually setting resp.reason during cluster_fe_err_synth() work? [15:34:51] let me try [15:35:00] I mean inside of the block like: if (resp.status == 742) { set resp.status = 429; [15:36:54] it does [15:37:57] interesting [15:38:07] I wonder where the text from synth() went off to :P [15:52:57] oh it's broken from vcl_pass probably [15:59:20] uh [15:59:38] it seems to be because of what we do in vcl_synth [16:00:28] yep, the problem seems to be triggered by re-writing status in vcl_synth [16:01:03] Today's "what the how the" [16:03:28] this test passes: https://phabricator.wikimedia.org/P5562 [16:04:48] commenting out `set resp.status = 201`, msg is "Synth test" [16:07:55] ok so I kinda get that [16:08:06] but isn't the whole reason we use artificial codes so that we can tell things apart? [16:08:20] I guess we could invent (and later delete) more req headers for it [16:09:36] but basically the reason we do: return (synth(751, "TLS Redirect")); ... (instead of 301 directly there) [16:09:55] is so we can later do this block only for that case: [16:09:56] if (resp.status == 751) { [16:09:56] set resp.http.Location = req.http.Location; [16:09:56] set resp.status = 301; [16:09:56] set resp.http.Content-Length = "0"; // T64245 [16:09:56] T64245: "DoCoMo/2.0" feature-phone browsers get 502 errors from ja.wikipedia.org - https://phabricator.wikimedia.org/T64245 [16:09:58] return(deliver); [16:10:13] to avoid the custom error page, right? [16:10:15] (otherwise it would arrive there with 301 already and we couldn't tell it from the many other sources of 301s [16:10:18] ) [16:10:31] oooh right [16:11:44] so rewinding a bit: the basic problem is? synth() does set status+reason correctly, but anytime you re-set resp.status, that automatically overwrites resp.reason with a new built-in string for that code as well? [16:12:02] every time you set resp.status in vcl_synth, at least [16:12:37] and yes, synth does set status+reason correctly [16:16:19] that's really annoying to fix in any clean way [16:16:32] yeah [16:17:09] if only user-defined functions could return stuff [16:17:13] and take parameters [16:17:22] wouldn't that a crazy good world to live in? [16:19:48] oh god, there's a hack, sort-of [16:20:20] so.... [16:20:33] I don't know that it's any cleaner than any of the other obvious ways around this [16:20:50] but after some counter-intuitive varnish diving I found this: [16:21:00] /* [16:21:00] * We allow people to use top digits for internal VCL [16:21:00] * signalling, but strip them from the ASCII version. [16:21:00] */ [16:21:24] and then the limit in the code that gets invoked (but can hardly be found) for VRT_l_resp_status limits the upper end to 65535 [16:21:38] and the cutoff for legit codes is done via n%1000 [16:21:47] ewwww [16:21:48] * ema is scared [16:22:03] so what that means is, using the HTTPS redirect example above [16:22:36] we could do: return (synth(11301, "TLS Redirect")); [16:22:44] and if (resp.status == 11301) [16:23:03] and not set resp.status = 301, because it will be 301 on the wire (digits >1k are stripped) [16:23:20] we get 01-65, basically as 65 possible variations on each return code [16:23:26] this is what vcl does to people [16:24:20] the real question is what did somethign to the people that inflicted vcl on the other people? [16:24:26] decades of heroin abuse, I imagine [16:25:24] what if we `return (synth(751))` instead of `return (synth(751, "TLS Redirect"))` and then set resp.reason = 'TLS Redirect' in the if (resp.status==751) branch? [16:26:57] and pretend we've never seen any of this? :) [16:27:10] is the second arg optional? [16:27:14] I mean I guess we can use "" regardless [16:27:21] I think it's optional, yes [16:27:43] that seems like a reasonable approach :) [16:28:36] another option is we could put the correct return code in the synth() call and do our vcl_synth if-conditions on the reason text [16:28:53] if (resp.reason == "TLS Redirect") { do other magic like Location and CL=0 } [16:29:02] I like that! [16:29:30] but then that may open up some window where the application layer can return "200 TLS Redirect" on a response, and then we synth() afterwards for some reason, and the logic gets confused? [16:29:45] I donno, would application-set headers and synth() happen together? [16:31:03] resp.reason == "HATE YOU - TLS Redirect" and then remove 'HATE YOU -' with resub? :) [16:31:10] no that doesn't make sense, if we ended up calling synth() somewhere legit after parsing the application-sent headers, we'd be re-setting reason in the synth() call anyways [16:31:28] does varnish ever call vcl_synth() for us in cases we didn't explicitly return(synth())? [16:32:32] the nice thing about the resp.reason-testing version, is that in the case where there's no vcl_synth magic to do, we don't need a matching block there at all [16:32:44] (e.g. the current 429 block) [16:33:07] we only need the resp.reason-testing block if we have to muck with other response headers like the Location field or whatever [16:33:34] but how about the error page? [16:33:44] oh, right [16:33:53] hmmm [16:34:11] should we be emitting an error page anyways in these kinds of cases, for when an actual UA sees the response? [16:34:22] well actually yes :) [16:34:52] it does seem wasteful in some common cases [16:35:09] I donno, have to look at all the existing ones where we short-circuit with deliver I guess [16:35:41] wait a sec, this example here probably doesn't even work does it? [16:35:42] http://book.varnish-software.com/4.0/chapters/VCL_Subroutines.html#example-redirecting-requests-with-vcl-synth [16:36:00] you'd end up with Moved permanently instead of Permanently moved [16:36:08] right, I'm pretty sure not just examples but even built-in VCL is borked like this [16:37:23] oh no, none of the builtin vcl uses it, just the common examples [16:37:23] so, two things [16:37:42] 1) the resp.reason-testing idea seems nicer than the artificial status codes thing [16:38:10] 2) isn't it a bug that setting status in vcl_synth overwrites resp.reason? [16:38:52] well I donno. it might be a bug if it was an unintentional regression [16:39:20] but it runs deep. at first I assumed it was just a VCL thing (e.g. re-setting resp.reason on resp.status might've been done in VRT_l_resp_reason()) [16:39:46] but it's actually in http_SetStatus() that gets called from lots of places I think [16:40:19] * Setting the status will also set the Reason appropriately [16:40:19] */ [16:40:19] void [16:40:19] http_SetStatus(struct http *to, uint16_t status) [16:40:19] { [16:40:21] char buf[4]; [16:40:24] CHECK_OBJ_NOTNULL(to, HTTP_MAGIC); [16:40:26] /* [16:40:30] * We allow people to use top digits for internal VCL [16:40:31] * signalling, but strip them from the ASCII version. [16:40:34] */ [16:40:37] to->status = status; [16:40:39] status %= 1000; [16:40:41] assert(status >= 100); [16:40:44] bprintf(buf, "%03d", status); [16:40:46] http_PutField(to, HTTP_HDR_STATUS, buf); [16:40:49] http_SetH(to, HTTP_HDR_REASON, http_Status2Reason(status)); [16:41:12] although now that I say that, when I look at it, http_SetStatus() currently only has one actual callsite, which is VRT_l_resp_reason basically [16:41:23] err I meant VRT_l_resp_status [16:41:41] in any case, it seems pretty intentional [16:42:22] it just goes against the grain of established legacy examples using status>599 and then re-setting status later and expecting reason to stay custom [16:42:54] right [16:43:52] 3) why aren't all the RFC-defined status codes and reasons included in include/tbl/http_response.h :) [16:44:14] even if they were, custom ones should be an allowed case [16:44:48] obviously, yes, I was just thinking of the specific case that prompted all this [16:45:25] yeah [16:45:35] well they're missing older ones than that even, so clearly they're being selective [16:45:57] "418 I'm a teapot" is RFC-defined back in 1998 in https://www.ietf.org/rfc/rfc2324.txt [16:46:04] but not in varnish's table :P [16:46:34] lame! [16:53:44] so yeah, hmmm [16:54:08] maybe go through and evaluate which ones really need custom header-mangling and/or lack-of-output, basically [16:54:20] as part of switching to resp.reason-testing [16:54:51] yep I was doing that :) [16:55:09] we already do: [16:55:10] if (resp.status >= 400) { [16:55:10] call synth_errorpage; [16:55:17] so I guess no real question on the sub-400 ones [16:56:44] (in other words, e.g. the https 301+deliver block doesn't really need its return(deliver), it doesn't really do anything) [16:58:14] and the default vcl won't get called anyways as we do return deliver in our vcl_synth [16:58:45] right [16:59:06] and cases like upload's CORS thing which use synth(667,"OK") then later turn into a 200 [16:59:34] we could do instead synth(200, "CORS Preflight") and then check resp.reason == "CORS Preflight" and re-set it to "OK" in that block (which is still necc) [16:59:41] it fits the overall model better and it's clearer [16:59:58] * ema nods [17:07:27] setting CL:0 seems important though https://phabricator.wikimedia.org/T64245 [17:08:29] (I was thinking whether the whole sub https_error_redirect could be removed) [17:10:14] well we have to set the response Location regardless [17:11:04] ah, also quite important :) [17:12:07] but the CL:0 thing I kind of hate [17:13:06] for reasons: [17:13:36] 1) It's a proxy and/or a custom carrier UA -specific issue. Nothing in the relevant standards requires CL on anything. [17:13:59] 2) this originally came up in a pre-HTTPS era (with mobile redirects, not our HTTPS redirects), we've just been cargo-culting it around a bit since then. So like, I don't know that we really want to help them try to proxy HTTPS with fake certs if that's what they're still trying to do with their customers [17:14:59] 3) last updates about it were ~3y ago, and the reference link to DoCoMo documentation now gives a 404 on their site [17:15:48] 4) We probably have lots of other 30[12] cases (esp coming from applayer?) where we're not setting CL:0 on the empty response and nobody's specifically complaining about it. [17:16:47] if anything, if it's even still an issue that affects users, the bug and workaround belongs on docomo's end, not ours :P [17:24:39] I've gotta run, untested WIP here https://gerrit.wikimedia.org/r/#/c/358057/ [17:25:48] bblack: uh, 503s spike on text [17:26:58] looks to have been brief [17:27:03] but significant [17:27:32] I see wikiScrapeFull/0.0.5 and wikiScrape/0.0.2 in 5xx.json [17:27:46] :/ [17:28:02] cp1055 [17:28:19] seems to have generated it, as a backend [17:28:28] "cp1055 int, cp3043 miss, cp3040 miss" [17:28:28] "cp1055 int, cp1055 miss" [17:28:29] "cp1055 int, cp2010 miss, cp4016 miss, cp4010 miss" [17:28:38] lots of them end up in cp1055 int for 503s during the peak minute [17:29:35] I donno [17:32:12] and there's some stats shifts on cp1055's machine stats around then [17:32:53] "cached" mem dropped by 10GB suddenly, disk i/o slowed off briefly, etc [17:33:01] around 17:19 [17:33:25] object expiration rate spiked too [17:33:40] another one is starting I think [17:35:18] maybe, too early to tell [17:35:32] something is odd there, but it's hard to say if varnishd is at fault or it's traffic-driven somehow, indirectly? [17:45:37] and now we've got a smaller spike, but it looks like cp1066 in the logs, same kind of pattern [17:45:40] odd... [17:45:53] is something rolling through hosts and blipping them [17:46:04] (or some delayed effect of the expiration of a ton of objects at a single timestamp?) [17:48:07] Jun 9 17:36:46 cp1066 dbus[1097]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' [17:48:10] Jun 9 17:36:46 cp1066 systemd[1]: Starting Time & Date Service... [17:48:12] Jun 9 17:36:46 cp1066 dbus[1097]: [system] Successfully activated service 'org.freedesktop.timedate1' [17:48:15] Jun 9 17:36:46 cp1066 systemd-timedated[11587]: /etc/localtime should be a symbolic link to a time zone data file in /usr/share/zoneinfo/. [17:48:18] Jun 9 17:36:46 cp1066 systemd[1]: Started Time & Date Service. [17:48:21] why did this just start recently? [17:50:00] hmmm apparently that's normal, every ~30m? [17:50:01] Jun 9 06:29:51 cp1066 systemd[1]: Starting Time & Date Service... [17:50:01] Jun 9 06:37:51 cp1066 systemd[1]: Starting Time & Date Service... [17:50:02] Jun 9 06:59:51 cp1066 systemd[1]: Starting Time & Date Service... [17:51:19] funny how that unit/service name doesn't exist in systemctl :P [17:51:20] it's using systemd timesyncd? [17:51:45] I have no idea, I didn't follow those changes closely [17:51:59] me neither, but I remember we said we were migrating to it [17:52:25] I kind of thought that meant we were moving to some kind of systemd-bundled ntp client [17:52:32] if (os_version('debian >= jessie')) { [17:52:36] include ::standard::ntp::timesyncd [17:52:37] yes [17:52:40] that is the systemd one [17:53:03] at least is what I thought [17:53:07] looking at puppet code [17:53:43] and seems so: /etc/systemd/timesyncd.conf [17:54:14] bblack: the unit is systemd-timesyncd.service [17:54:44] yeah I wonder what timedate1 is? [17:54:47] on cp1066: Active: active (running) since Wed 2017-05-31 13:25:22 UTC; 1 weeks 2 days ago [17:55:19] apparently it controls timesyncd and other related things, I think [17:55:41] yeah, https://www.freedesktop.org/wiki/Software/systemd/timedated/ [17:55:45] ok whatever, in any case not 503-related I don't think [17:55:49] just happened to catch it in the logs [17:56:01] seems red herring [17:56:06] and unrelated [17:56:54] while we're on that though, timedatectl's output "RTC in local TZ: no" is actually false [17:57:13] I think they mean that to mean "RTC fails to be in UTC like it should be: no" [17:57:36] but since our local TZ is UTC, the correct answer to the question as stated is "RTC in local TZ: yes" :P [17:58:33] lol [18:04:08] hmm cp1066 just did its weekly backend restart [18:04:22] confusing, as that caused a big stats blip right after the one I was looking at [18:05:07] I was about to ask if the restart was before or after the spike mentioned above [19:24:41] the timedate issue is T157797 [19:24:41] T157797: Run Icinga check for systemd-timedated less often - https://phabricator.wikimedia.org/T157797 [19:24:59] not really an issue [19:25:02] just socket activation [21:25:23] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3313139 (10Dzahn) importing text from (almost) duplicate ticket T166964 (merging into this ticket) ``` TASK AUTO-GENERATED by Nagios/Icinga RAID event handler A degraded RAID (md) was detected on hos... [21:25:59] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3313134 (10Dzahn) 05Open>03Invalid closing as duplicate of T166965 imported text over there [21:26:42] 10Traffic, 10DNS, 10Operations: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#3336503 (10Dzahn) p:05Triage>03Normal [21:31:20] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3336512 (10Dzahn) [21:31:23] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3336514 (10Dzahn)