[08:07:35] it looks like our If-None-Match regex on varnishrls.mtail is too weak [08:08:00] nice catch by Krinkle @ https://gerrit.wikimedia.org/r/#/c/431608/ [08:16:25] also.. our TLS info logging format needs some love [08:17:51] take this a sample "/ cache_status int-front http_status 301 http_method HEAD cache_control - inm - h2 tls_version session_reused key_exchange auth cipher full_cipher" [08:18:12] empty values for cache_control or inm are represented by "-" [08:18:30] but not for h2/tls_version/session_reused/key_exchange/auth/cipher/full_cipher [08:21:14] I'm glad we put tests in place for mtail, madness otherwise [08:21:43] yup [08:59:23] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4189645 (10Vgutierrez) >>! In T184942#4187502, @Krinkle wrote: > @Vgutierrez @ema I'm working on using the Prometheus metrics for the ResourceLoader dash... [09:12:11] 10Traffic, 10Operations: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521#4189695 (10Vgutierrez) p:05Triage>03Normal [09:16:15] nice! [09:16:21] and yes, god save the tests [09:38:08] ema: (morning!) let me know what you think about https://gerrit.wikimedia.org/r/#/c/431712/ [09:41:18] I've tried to be as flexible as possible with inm values cause basically it can contain almost any char https://tools.ietf.org/html/rfc7232#section-2.3 [09:48:08] yeah looks good! [11:03:23] yay! grafana annotation for varnish restarts: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats [11:04:12] the idea is that you need to come up with a prometheus querying returning 1 whenever you want the annotation to be displayed [11:04:42] in the case of varnish-fe restarts, for instance: `delta(varnish_main_uptime{instance=~"$server:.*", layer="frontend"}[10m]) < bool 0` [11:09:35] it's far from perfect, by using 10m for the delta function range and choosing a short enough timeframe in the visualization you'll get multiple annotations for one single restart event [11:13:09] 5m seems to work better [12:39:53] neat! [12:41:07] I think you can also resets() [12:41:13] to achieve the same effect that is [12:41:27] actually resets() > 0 [12:46:43] resets(varnish_main_uptime{instance=~"$server:.*", layer="backend"}[5m]) > bool 0 [12:46:55] yup that seems to do the trick too ^ [12:49:32] dashboard updated [12:50:13] we should probably aggregate that metric so that we can see cross-DC restarts [12:51:34] and keep track of puppet-merges somehow! [12:52:05] option (1) would be exposing that to prometheus, option (2) sending the information to grafana [12:52:39] (2) http://docs.grafana.org/http_api/annotations/#create-annotation [12:53:12] indeed, for (1) we'd need sth like a "git_exporter" to e.g. export the timestamp of HEAD as a metric [12:54:06] does git even really track the right timestamp? we'd want to know when it was merged into master, not when it was last rebased or merged into gerrit's repo. [12:54:35] I guess we can emit the timestamp event manually on "puppet-merge" with a list of commit shas merged at that time [12:55:03] yeah that'd be more accurate [13:12:22] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4190310 (10Marostegui) @ayounsi today we have failed over x1 master which was in row C, to a new host in row A. The x1 blocker is now gone and you should be go... [13:22:47] bblack: I'm trying to fill the offsite trip form sent by Karen, the budget holder is Mark, right? [13:23:33] vgutierrez: yes [13:23:41] thx ema :D [13:24:22] category for travel expenses: "All staff Meeting" or "Other: Staff meetings"? [13:24:55] I think the former only applies to all-hands [13:25:29] yup.. checking office wiki, "Staff meetings" fits perfectly [13:25:53] * ema nods [13:25:57] hmm but we have another one for team offsites [13:26:06] "Staff convening and offsites travel and event expense" [13:26:30] (╯°□°)╯︵ ┻━┻ [13:26:35] :) [13:27:45] hopefully they're gonna send us to Prague even if we get it wrong! [13:27:53] hahahah [13:28:18] well.. I've my doubts regarding my full name as it is in the WMF VS my passport [13:29:16] oh that *is* important. Passport version IMHO [13:29:41] right :D [13:29:43] team offsite seems the most-appropriate label here [13:29:55] "all staff meeting" is about our annual one for the whole org in january [13:30:14] bblack: right, but they link an office wiki page as reference [13:30:15] "other: staff meetings" is probably for one-offs where some subset of people travel to SF to meet up cross-team? I donno [13:30:53] and "Staff convening and offsites travel and event expense" looks like is the one including team offsites xD [13:31:22] (it feels harder than it should be) [13:33:19] I personally went for "Other: team offsite" and mentioned "SRE Offsite 2018 [13:33:45] the latter as "What is the purpose of your travel" [13:34:43] submitted -- alea jacta est [13:34:59] haha [13:35:27] in case of second-thoughts, you can actually re-send the form btw :) [13:36:06] let's fuzz our lovely travel department <3 [13:38:12] I always thought (in english) that "the die is cast" was refering to a https://en.wikipedia.org/wiki/Die_(manufacturing), as in you've casted a new metal-shaping die in a mold, and that die will be used to shape future things. [13:38:37] apparently according to https://en.wikipedia.org/wiki/Alea_iacta_est the "die" here is actually dice, as in you have cast the dice in some hypothetical game. [13:38:49] TIL! :) [13:40:37] probably the confusion is only possible in english, and I have no idea where I came up with my alternate incorrect interpretation, probably just randomly in my own head when I first heard it. [13:41:21] ha! funny that the saying still works like that [13:42:12] in Italian we also use "crossing the Rubicon" as an expression to mean that some hard decision has been taken, possibly with no way back [13:42:33] refering to the same historical event as "the die is cast" [13:43:52] in spanish we use the latin one [13:45:02] or the spanish translation "la suerte está echada" [13:45:49] ema: hmm do you speak German? :P [13:46:00] vgutierrez: yup [13:46:36] nice, I'm checking the well behaved bots list using AES128-SHA [13:47:03] I've one spanish owned bot, several germans, one from vietnam... [13:47:43] I guess that we should hit their discussion page asking for an update [14:00:25] yeah, basically try to find a user talk page for the owner/bot to ask them about it. it could be they just need some simple platform/library update or whatever. in many cases specific bots are using some upstream MW API library which needs upstream updates to its code (or maybe already has them in newer versions, but they copied in an old one) [14:00:58] importantly we're not blocking on them, we're just trying to give them a heads up that they need to update or they'll get broken. [14:01:10] (it might be helpful to have a timeline already established so you can give them an idea how long they have) [14:02:25] we've run into one major case in the past (during some of the origina switch to HTTPS-only) where a bot was both community-crucial and had effectively lost its code maintainership, which can get tricky... [14:15:22] yup.. #1 in terms of traffic is using https://metacpan.org/pod/MediaWiki::Bot, that still says that the default transport is http instead of https /o\ [14:22:54] regarding the timeline... I do believe that we can be a little bit aggresive on this one... checking the top #20 deprecated human UAs, almost everything is non-updateable (Playstation 3 is top #1) and we have a lot of deprecated phones and similar devices [14:23:19] to some Windows XP / Windows Server 2003 users we can suggest upgrading to Firefox 52 ESR [14:31:30] IE8/XP users are already-gone in this sense, from when we killed DES-CBC3-SHA [14:31:43] if they're still connecting, it's because of some proxy and thus not an IE/XP issue [14:31:59] (so probably upgrading to FF doesn't help them) [14:32:12] I've some Chrome 49 on Windows XP [14:32:32] Chrome49 (or anything recent) on XP should do better than AES128-SHA, unless proxied [14:32:53] right.. so they're being MiTM victims and not deprecated :D [14:33:08] (also there's another proxy case to keep in mind, sometimes it's a one-host proxy software on the user's machine itself, part of some "antivirus" defender thing, with horrid outbound TLS) [14:33:43] so... the only "updateable" UA should be "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)" that's featuring on Top #18 [14:33:55] IE6.0 in Windows Server 2003 [14:34:09] IE8 was the last version of IE that worked on XP, and only supported DES-CBC3-SHA with us when we killed it, no AES-based ciphers. [14:34:42] but basically anything-else you can install in the modern era to replace that IE (any downloadable FF/Chrome/Opera/etc you can manage to get working) will work [14:35:00] we recommended FF52 ESR during that deprecation, as the only one that was still actively-supported and easily available/installable. [14:35:13] yep [14:35:51] Chrome-on-XP is a nice option too, but you have to go find a backdated version from when they still supported it, and no future updates. There might be some unofficial chromium builds though. [14:36:16] ESR52 is still available and supported for a few more months, even [14:36:27] yep, is still there [14:36:43] until 2018-09-05 [14:36:48] also, another thing to think about when you see odd low-probability corner cases: [14:37:50] *technically* it actually is possible to get IE8/XP to negotiate better-than-3DES (I think just AES128-SHA though). There's a DLL you can hack in manually to do it. Microsoft released an updated DLL for certain commercial variants (POSReady Terminal) that are based on XP's core, but never released the update for actual WinXP/2003. [14:38:06] but some users have figured out how to get that DLL from some update server and copy it around and make it work on XP to ugprade their TLS [14:38:14] it's just so off-the-beaten-path we can't recommend it [14:38:33] that remembers me a similar hack to get RAID-5 software support on Windows XP [14:40:10] hmmm if we trust https://www.ssllabs.com/ssltest/viewClient.html?name=IE&version=8&platform=XP&key=101 [14:40:29] IE8 will be out of business on June with PCI deprecating TLS 1.0 support [14:40:47] right, at least for commercial sites that take CCs [14:41:28] vendors are supposed to be getting on top of that even earlier, I imagine many have already moved past it at this point with the deadline so close. All the PCI-auditing tools have already been red-flagged it and requiring specific exceptions/waivers for it even ahead of the firm deadline. [14:41:54] but there will probably be a number of them that hold out for the last minute :) [14:42:42] (and there will probably be users that don't care and don't do any business online with PCI sites and hate our deprecation, we just have to have thick skin in this case) [14:45:16] from https://www.alexa.com/topsites/category/Top/Shopping, top 5 sites won't pass PCI in a few days (being etsy.com the exception) [14:45:56] s/days/weeks/ [14:46:29] an interesting point is also the top 3 us banks for end-user consumer accounts: wellsfargo gets an A+ and has killed TLSv1.0 already and has (non-preloaded) HSTS. [14:47:30] chase bank gets a B-rating on ssllabs lol, has horrible ciphersuite ordering, fails to establish forward secrecy with some modern browsers, still has tlsv1.0 [14:48:31] bank of america also B-rating, and is even worse. they support TLSv1.0-1.2 but they *only* offer non-forward-secret ciphers (but they did kill 3des heh) [14:48:49] hmm that must be a feature :P [14:49:02] yo, disabling pybal on lvs2004, please don't do anything related to LVS in codfw :) [14:49:10] probably makes it easier for them to sniff+log+audit all the traffic on their side, yeah. no logging FS session keys. [14:49:14] XioNoX: go for it [14:49:32] you'd think banks would be on top of this shit :P [14:50:01] (if they really need internal logging without session keys, they could always do that after their first public-facing tier of SSL-terminating proxy :P) [14:50:13] bblack: well.. here (Spain) I've seen huge differences between the bank's main page and their "online service" subdomain [14:52:31] XioNoX: BTW, downtime it on icinga or expect some "noise" :) [14:52:46] it's downtimed on icinga [14:52:50] <3 [14:52:51] nice [14:53:28] for chase, I checked mfasa.chase.com which is where their login form submits to, separately from www, still the same :/ [14:53:44] oh wow [14:54:35] www.ing.es --> A+ on ssllabs, ing.ingdirect.es (login hostname) --> F [14:54:58] vulnerable to ROBOT [14:55:00] lovely [14:55:24] bblack: can I failover traffic to lvs2004 to check if the issue is solved? [14:57:39] FWIW we didn't see afectation when lvs2001 was reimaged (and lvs2004 took the traffic) [14:58:47] vgutierrez: you mean user issues? [14:58:53] XioNoX: right [14:59:48] vgutierrez: yeah, the amount of errors was low enough to not be an issue for now, but needs to be fixed [15:00:28] XioNoX: yeah if some work has been done, we can failover to look at the logged interface error rate, but [15:00:56] well, no real but I guess, nevermind [15:01:06] Papaul re-seated the DAC [15:01:16] need to figure out if we need to do more than that [15:01:34] just puppet-disable on 2001 and stop pybal, and log about it, and then start pybal + re-enable puppet when done observing for some period [15:01:56] will do, thx! [15:25:20] bblack (or someone else): Any chance you could glance at https://gerrit.wikimedia.org/r/#/c/431659/ and make sure that I'm correctly changing the backend hosts in a director? Seems straightforward to me, but I don't know what I don't know... [15:27:28] bblack: Was troubleshooting peering with an ISP in Thailand, but realised Thailand is still going to ulsfo becaused of Zero. Can I do more than pointing them to https://phabricator.wikimedia.org/T189250 ? [15:28:08] marlier: looks basically-sane. note that when defined like that, perf requests will route active/active to 1001+2001 for various users. I assume that's ok with the applayer and it doesn't have just 1 active primary at any given time or whatever... [15:29:06] (users will tend to be consistently on one side or the other for all their requests, though, unless they travel or we have cache-site outage events) [15:29:33] That'll work fine, yes -- ironically performance will suffer slightly if they're routed to 2001, but not significantly, and it doesn't matter anyway. For my own edification, is there a way to define it such that 1001 is active, and 2001 is failover? [15:30:08] marlier: yeah, basically just comment-out the line for 2001, and we'd make further commits there (commenting out one side or the other) to switch later [15:30:32] the switching for active/pasive, at this level, is manual [15:31:07] Gotcha. I'll just leave it as is, then -- don't want anyone to have to make changes related to this in the event of a dc-level event. [15:31:46] XioNoX: yeah there's nothing we can do. I wouldn't even bother pointing them at that ticket, it will only confuse them further since none of it makes sense. Just say we'd like to get peering working so we can better serve thailand, but we don't expect to actually send users to the new site for TH for another month or two [15:32:22] (or something, or ignore it and hope they don't notice the current pointlessness) [15:32:58] (or, we can dig into things and try to figure out if this specific ISP routes for the problematic carrier's DNS at all and make exceptions, but that gets pretty tricky for a temporary fix) [15:33:19] yeah, it's good that we have a timeline [15:33:35] mutante: good timing on that CR comment, see bblack a couple of lines above :-) [15:33:58] I can't ignore them as they have been asking why is the traffic not going through peering [15:34:43] yeah if they seem like they can understand you can try [15:35:02] but understanding why zero works the way it does and why it's temporarily blocking and then going away soon, etc... it's a deep dive in a strange direction :) [15:37:11] marlier: heh, nice timing indeed. let me just get out of the breakfast cafe and on faster internet [15:37:26] Whenever works, no rush [15:51:04] marlier: merging and running puppet on the "misc:web" hosts [15:51:26] mutante: rockin [15:54:36] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4191082 (10fgiunchedi) The correctable errors check has been deployed and it is yielding some results already. Myself and @herron took at the list of hosts and ther... [15:54:59] i have an alias for this. currently there are 11 hosts [15:55:58] 10/11 11/11 done [15:56:17] marlier: all requests should go to the new backend now [15:56:36] testing now [15:56:45] maybe tail -f the apache log on old and new [15:57:53] seems to work and i see nothing on graphite1001 (anymore, as expected) [15:59:09] the Flame Graphs seem to have an issue though [16:04:15] 10Traffic, 10netops, 10Operations: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4191107 (10ayounsi) 05Open>03Resolved [16:05:19] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191109 (10jcrespo) [16:06:59] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191122 (10jcrespo) @Vgutierrez suggested using https://github.com/vstakhov/hpenc , which I don't think is a bad idea at all- it would just change some of the executions of openssl and netcat... [16:41:09] 10Traffic, 10netops, 10Operations, 10ops-codfw: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677#4191293 (10ayounsi) 05Open>03Resolved No more errors. [16:45:09] mutante: pretty sure there's a firewall issue. https://gerrit.wikimedia.org/r/#/c/431779/ I think might be right, but I haven't dealt with our ferm configs at all. [16:47:06] marlier: oh yea, of course, heh. [16:47:17] that is right [16:50:12] marlier: seems fine [16:52:41] marlier: applied on 1001 [16:52:41] [webperf1001:~] $ sudo iptables -L | grep dpt:http [16:52:48] Boom [16:52:50] Looks great [16:52:54] Site is working! [16:53:08] great [16:53:21] runs puppet on 2001 [16:54:03] marlier: i'll write a wiki page how to setup a site with httpd module and caching.. then link to it from README or so [16:54:21] mutante: That would be pretty great :-) [16:54:27] forgot about the ferm rule as well [16:54:28] Thanks so much for the help on this! [16:54:34] you're quite welcome [17:26:34] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191453 (10jcrespo) The recommended cipher, which is an easier change, is chacha20 or, alternatively, AES-GCM rather than the randomly selected one on the commit. [18:29:30] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, and 2 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#4191642 (10jmatazzoni) [21:38:50] 10netops, 10Cloud-Services, 10Operations: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4192431 (10ayounsi) @chasemp Can you provide an ETA for returning the /25? [22:04:07] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4192548 (10cwdent) The problems I see are: - content served over http - weak DH supported (https://weakdh.org/) resulting in "B" grade from Qualys I d...