[00:43:55] 10netops, 06Operations, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2747305 (10Dzahn) Everything is gone, including DNS, just that i can't get it out of Icinga, even though i ran puppet node deactivate more than once, and even on both masters. [00:44:44] 10netops, 06Operations, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2747307 (10Dzahn) [07:57:38] and all the TFO stats are also there, that's nice [07:58:15] curl -s localhost:9100/metrics|grep -i tcpf [09:09:33] 10Traffic, 06Operations, 10ops-eqiad: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#2747798 (10ema) [10:04:11] so, in order to write varnish-version agnostic test cases for text we can pass -Dallow_inline_c='-p vcc_allow_inline_c=true' and -Dcc_command='-p cc_command="exec cc -fpic -shared -Wl,-x -L/usr/local/lib/ -o %o %s -lmaxminddb"' to varnishtest and then use varnish v1 -arg "${cc_command} ${allow_inline_c} ..." in the test case [10:04:33] however, on v3 I think the default varnishtest buffer space is not enough and the VCL fails to compile :( [10:05:05] starting with v4 the added a nice -b parameter to change that, but of course it's not there on v3 [10:05:46] on v3 the failure looks like this: [10:05:47] **** v1 0.8 CLI RX| (That was just a warning)\n [10:05:47] **** v1 0.8 CLI RX| Message f... [10:05:47] ---- v1 0.8 FAIL VCL does not compile [10:06:53] tl;dr: I propose to just write the tests for v4 and stop wasting time on this [11:05:49] 10Traffic, 06Operations, 10ops-esams: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2747977 (10ema) p:05Triage>03Low [11:37:08] sounds fine to me :) [13:04:06] // normalize to boolean post-netmapper (varnish-3.0.4...) [13:06:17] heh [13:06:34] I remember that, but it's been so long I can't remember which way is now-correct [13:07:03] basically, in the midst of the varnish 3.0.x release cycle, somewhere along the way they changed the VCL language subtly without any changelog info [13:07:32] then in a later version they documented what they changed in a past version, and it was something crazy about boolean evaluation of 0 and/or "" [13:07:43] and our VCL had to survive through it all [13:10:01] there are lots of interesting things in our text vcl :) [13:10:52] like XFP only being trusted from our networks, should be the TLS terminator only but internal apps set it to fake HTTPS [13:12:39] yes [13:12:51] we should probably audit who's left doing that [13:13:07] there's no good reason to support it unless some legacy internal code cannot make outbound https connections at all [13:13:14] I think parsoid may have been a past case [13:18:32] finally found the only changelog mention: [13:18:35] "Note: In between 3.0.3 and 3.0.4 the VCL truth value for empty strings changed. Please see Bug #1218 and Bug #1406 for the details." [13:18:56] thanks for the explanation :P [13:21:36] so I think the way this works out is: [13:21:52] netmapper's map() function returns STRING. It can return NULL at the C-level [13:23:14] when you do set req.http.foo = netmapper.map(...), if it returns NULL, it's equivalent to: set req.http.foo = "" [13:23:28] which, depending on varnish version, is different than: unset req.http.foo [13:23:37] at least as far as boolean interpretation goes, not sure about header output [13:25:29] I guess the right way to be sure is to test it [14:10:11] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission psw1-eqiad - https://phabricator.wikimedia.org/T149224#2745777 (10faidon) Removed from LibreNMS, rancid, smokeping, torrus, Icinga & DNS. [14:39:52] 07HTTPS, 10Traffic, 06Operations, 06Performance-Team, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2748382 (10BBlack) >>! In T140128#2684764, @BBlack wrote: > We could perhaps enable apache logging of the X-Client-IP header to see through the caches for this... [14:40:19] 07HTTPS, 10Traffic, 06Operations, 06Performance-Team, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2748398 (10BBlack) At a glance, it seems like the bulk of the query traffic comes from GCE and AWS, and the bulk of it's still not HTTPS. [14:42:32] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2748403 (10BBlack) I've briefly reviewed the python code at https://github.com/wikimedia/mediawiki-services-rcstream/blob/master/rcstream/rcstream and I don't see whe... [15:48:18] the python-varnishapi -c issue has been fixed https://github.com/xcir/python-varnishapi/issues/65 [15:48:37] I guess we can import the latest version soon [15:49:20] elukey: ^ [15:51:22] \o/ [15:51:50] ah he removed the whole block [15:51:51] nice [17:04:18] 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2749070 (10elukey) 05Open>03Resolved a:03elukey Just deployed, now https://noc.wikimedia.org/conf/fc-list loo... [17:57:14] http://patchwork.ozlabs.org/patch/687814/ [18:04:34] SSL_read() failed (SSL: error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad rec [18:04:37] ord mac) while processing HTTP/2 connection, client: [18:04:55] that's the actual error message on the nginx side, on an actual repro of the connection reset [18:05:08] (it's logged at level "info", so it doesn't show up in our normal logs) [18:05:30] (and that was for my client IP, the count of the messages and their timing matched up with my client-side disconnects) [18:06:26] also notable in this round of testing: the package is up-to-the-minute nginx master codebase, so we're not missing any new bugfixes that are relevant. And I took out the cloudflare dynamic record sizing patch JIC [18:07:05] when I first tried this repro, I had ssl_buffer_size on the server at 4k. it seemed harder to repro than before, but still did. [18:07:25] then I upped it to 16k and it got easier to repro (and I got the bad record mac errors on the client side, not just disconnects) [18:07:45] then I dropped it down to "1300", and it got much harder to repro, with the disconnects being rarer [18:08:01] the disconnects at 1300 are what generated the nginx/openssl log output above [18:11:58] https://trac.nginx.org/nginx/ticket/215 [18:12:12] ^ apparently someone reopened an ancient ticket with the same error message, but recently [18:12:51] note the final two comments, the reporter is using the same basic software revs we are [18:13:13] also, nginx's response isn't quite right: it's logged at the "info" level, not "error" level [18:13:44] nginx says it's an openssl regression [18:13:50] (categorically) [18:14:11] but I think it's still possible that the way nginx is using/abusing the OpenSSL API may be valid in 1.0.x but invalid in 1.1.x [18:14:26] because I'm not finding similar reports with other server software yet [18:57:45] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2705418 (10Krinkle) >>! In T147845#2748403, @BBlack wrote: > I've briefly reviewed the python code at https://github.com/wikimedia/mediawiki-services-rcstream/blob/ma... [19:00:10] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2749511 (10BBlack) Ok, I was only considering the websockets case. Still, since the python code is unaware of X-Client-IP... what is it tying the session to internal... [19:07:27] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2749547 (10BBlack) I guess really the answer to that doesn't matter either way. We'd still like to get this service to conform to the pattern of every other service... [22:02:20] cp2003:~$ curl localhost:9131/metrics -s | grep -v ^# -c [22:02:20] 384 [22:02:29] cp2003:~$ curl localhost:9331/metrics -s | grep -v ^# -c [22:02:29] 358 [22:02:32] \o/ [22:04:55] double negative [22:05:06] oh nevermind [22:05:09] the ^ is anchoring [22:05:11] sorry! [22:05:49] ori: tut tut! interview question :P [22:05:54] haha