[08:13:41] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2715534 (10MoritzMuehlenhoff) Beside ocg we have a other precise/trusty systems not using nodejs 4: - sca1* still has it installed, but the only remainin... [08:52:48] https://goo.gl/P8Pqc9 - this is what pivot.w.o shows for the past days of Mac OS Sierra pageviews [08:56:02] times are in UTC right? [08:56:38] afaics yes [08:56:46] sorry afaik :) [08:57:00] funny the different patterns between safari and chrome users :) [08:57:19] 6am-11am :D [09:08:42] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2715606 (10akosiaris) >>! In T148076#2714544, @Volans wrote: > FYI it's worth noticing that the upgrade of NodeJS for this service looks a bit broken by... [09:23:39] elukey: nice data! thanks [09:52:54] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715740 (10BBlack) [09:53:15] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715756 (10BBlack) [09:53:18] 07HTTPS, 10Traffic, 06Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10BBlack) [09:53:31] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715740 (10BBlack) p:05Triage>03High [09:54:20] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715740 (10BBlack) [10:02:51] interesting, https://gerrit.wikimedia.org/r/#/c/315920/ does not seem to compile properly [10:02:59] https://puppet-compiler.wmflabs.org/4363/cp1008.wikimedia.org/change.cp1008.wikimedia.org.err [10:03:15] I've tried a very similar change in labs on my self-hosted puppetmaster and it works fine there [10:03:26] 10Traffic, 06Operations: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2715770 (10BBlack) [10:03:33] the hiera lookup part seems ok: [10:03:39] ./utils/hiera_lookup -v --site=eqiad --fqdn=cp1008.wikimedia.org --roles=cache::text cache::text::nodes [10:03:45] {"eqiad"=>["cp1008.wikimedia.org"]} [10:04:04] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715784 (10BBlack) [10:04:07] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2715785 (10BBlack) [10:04:10] 10Traffic, 06Operations: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2715783 (10BBlack) [10:08:55] 10Traffic, 06Operations: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134#2715807 (10BBlack) [10:11:26] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715822 (10BBlack) [10:11:29] 07HTTPS, 10Traffic, 06Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#2715821 (10BBlack) [10:11:43] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715740 (10BBlack) [10:11:46] 07HTTPS, 10Traffic, 06Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10BBlack) [10:13:05] 10Traffic, 06Operations: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2715828 (10BBlack) [10:13:08] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2715831 (10BBlack) [10:13:32] 10Traffic, 06Operations: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134#2715807 (10BBlack) [10:13:35] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715835 (10BBlack) [10:13:38] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1149975 (10BBlack) [10:13:41] 10Traffic, 06Operations: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2715770 (10BBlack) [10:14:10] 10Traffic, 06Operations: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2715770 (10BBlack) p:05Triage>03Normal [10:14:26] 10Traffic, 06Operations: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134#2715807 (10BBlack) p:05Triage>03Normal [10:15:59] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2715846 (10BBlack) [10:16:03] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2715843 (10BBlack) 05Open>03Resolved a:03BBlack Resolving for now, as we've covered what we can cover here in Ops. We'll need this ticket as a reference if w... [10:19:50] ema: I think it has to do with all the other complex bits around backend definitions, e.g. in instances.pp [10:20:18] probably it's that there's no codfw backends defined for cp1008 [10:20:23] I think? [10:20:27] something related to all that mess [10:20:38] oh that's easy to verify [10:20:46] I'll just add cp1008 for codfw too [10:21:07] yeah, or add it to the production exclusion bit [10:21:42] modules/role/manifests/cache/instances.pp, the part that filters out codfw if $::realm != 'production' [10:22:07] but maybe it's less hacky in the net to just have eqiad+codfw defined in cp1008's file, with both pointing at self [10:22:10] I donno [10:26:31] awesome, that was it [10:27:24] <_joe_> bblack, ema where is the "restart varnish randomly if it was running for N days" cron? [10:27:46] <_joe_> I might want to use it for hhvm as well [10:28:09] modules/varnish/templates/varnish-backend-restart.cron.erb [10:28:40] <_joe_> btw, you might find https://github.com/wikimedia/operations-puppet/blob/production/modules/conftool/manifests/scripts/service.pp interesting [10:29:13] <_joe_> specifically, pooler-loop could be interesting: https://github.com/wikimedia/operations-puppet/blob/production/modules/conftool/files/pooler_loop.rb [10:29:47] <_joe_> this not only does run depool/pool, but it verifies on the LVSs if pybal picked up the action [10:29:49] _joe_: the interesting bits are here https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/cache/upload.pp#L128 [10:29:57] <_joe_> ema: thanks [10:30:43] nice [10:30:53] <_joe_> I know it's ruby [10:31:02] <_joe_> that's Marko's fault [10:31:03] can we get the loop behavior on the all-services ones that exist already? "pool" and "depool" [10:31:30] <_joe_> it works both for pooling and depooling [10:31:41] <_joe_> but I'm not sure I understood your question [10:31:43] I mean, the existing scripts that pool or depool all services on the host [10:31:51] that we're hooked into for host shutdown/reboot [10:32:00] <_joe_> well you can just run pooler-loop instead [10:32:30] <_joe_> pooler-loop without the selector in the call will work for all services on the host IIRC [10:33:12] ok [10:33:44] <_joe_> so e.g. for restbase in codfw we have [10:33:58] <_joe_> uhm [10:34:02] <_joe_> now that I think about it [10:34:27] <_joe_> you'd probably need to have multiple services checked, not the one of the pool_name you give [10:34:43] <_joe_> which won't happen [10:34:53] yeah [10:35:02] <_joe_> yeah let me think about that later :P [10:35:02] or we can break up what we're doing and make it per-service [10:35:10] <_joe_> that too [10:35:31] probably we should get back to that sort of idea anyways [10:35:49] <_joe_> you could just do conftool::script::service['varnish','nginx','varnish-fe'...]: [10:35:59] making it a fixed thing in systemd terms that pool/depool always happens around the state of a given service [10:36:04] <_joe_> and then add all of the generated convenience scripts there [10:36:14] right now we don't have that hard hook in there [10:36:48] <_joe_> actually I'm thinking of creating a default systemd sidekick to pool/depool services [10:36:49] either via ExecStartPost/ExecStopPre, or dependency magic [10:37:07] <_joe_> bblack: we have systemd::sidekick that /should/ work [10:37:17] <_joe_> but this is all an abstraction I can't work on now :P [10:37:19] if you do, make sure it can be parameterized to include some additional fixed sleep or something after confirming depool [10:37:25] to drain at least some of existing connections [10:38:05] <_joe_> yes [10:38:09] in some theoretically ideal world, we could poll the LVS's ipvsadm output and know when 0 connections are left too [10:38:15] <_joe_> actually we should test how to do that correctly [10:38:17] but that sometimes takes forever and it's unrealistic [10:38:38] <_joe_> bblack: once I've got my patches merged, we could expose ipvs status from pybal [10:39:06] the bottom line is TCP connections can sometimes fail and that's considered "normal" in some sense. if we've switched things up for fresh connections and we interrupt a few lingerers, it shouldn't be a big deal operationally. [10:39:36] as long as we provide some kind of buffer to let the common/short ones drain so the effect isn't huge [10:40:18] if some service or UA or User can't deal with a statistically-unlikely occasional need to retry or reconnect, the problem lies there, not in the depooling process :P [10:41:51] ema: defining codfw worked? [10:42:40] bblack: it worked as far as pcc was concerned, but now puppet fails because the backends used in directors.frontend.vcl are not defined [10:42:49] hah [10:42:51] so I guess we need to add cp1008 to the !production hack [10:43:02] yeah I think so [10:43:08] let's try [10:43:20] maybe cp1008 needs its own realm, I'm sure that would cause fun fallout for everyone :) [10:44:15] really we should also give cp1008 its own IP in the eqiad public text subnet and put it behind LVS so it's not the only .wikimedia.org cache, too [10:57:06] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Certificate errors due to GlobalSign - https://phabricator.wikimedia.org/T148045#2715888 (10Aklapper) [11:00:58] gaah that wasn't enough we're using dynamic directors, be_cp1052_eqiad_wmnet and friends can't be found [11:01:08] ok food is needed at this point [11:01:18] :) [11:04:28] heh [11:04:42] yeah there are a lot of assumptions built in about the production clusters [11:05:31] well, maybe that's just a race condition? [11:05:40] (with confd) [11:06:50] oh, right [11:06:59] confd gets its list directly from etcd [11:07:28] trying to "fix" cp1008 is going to lead to a lot of pain on that front [11:07:44] it might be best left alone and revert all this [11:07:48] all things considered [11:08:25] the number of hacks and special cases will be ugly, and they'll get in the way of future simplification efforts, too [12:00:13] 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2659577 (10akosiaris) Noting that there are no errors in TX or RX on the interface of neither the host nor the switch. [12:28:46] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715740 (10faidon) I was actually thinking the same for keeping both certs live. One way to get around the subtle differences/coalesce issue etc. is to deploy them in different regions. esams/ulsfo could get ven... [12:42:58] 10Traffic, 06Operations: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134#2715807 (10faidon) I was researching this a little bit last night. The tradeoff you mention (inflating response size) is definitely real and it looks like [[ https://www.ietf.org/mail-archive/web/tls/current/ms... [12:47:54] 10Traffic, 06Operations: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134#2716295 (10BBlack) [12:50:21] 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2716372 (10faidon) This happened twice yesterday, unfortunately during the GlobalSign event. I investigated it both times, but in both times the downtime was brief which limited my tro... [12:53:17] bblack: perhaps we could try setting dynamic_directors to false in hiera for cp1008 as a last attempt? [12:54:59] ema: it's worth a shot, but if it works I reserve the right to break it again in the relatively-near future [12:55:08] deal [13:04:20] pcc output seems promising :) https://puppet-compiler.wmflabs.org/4370/cp1008.wikimedia.org/ [13:07:41] yup, it worked [13:10:18] now it would be good to find out why the frontend always misses, my money is on the geoip stuff [13:11:10] the frontend always misses? [13:11:16] I've been getting hits [13:11:30] really? [13:11:33] cp1054 hit/4, cp1008 hit/12 [13:11:47] you're using https://pinkunicorn.wikimedia.org/ ? [13:11:51] (or hacking other names to it?) [13:11:51] yep [13:11:58] may be your cookies [13:12:29] oh, I was trying HEAD requests with curl [13:12:31] the geoip cookie shouldn't have any impact though, I mean the various mediawiki-level session/token cookies [13:12:38] oh ok [13:16:25] bblack: I have an openssl 1.1 ready on copper, running some tests will upload to carbon once completed [13:18:34] 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2716561 (10grin) The time the link went away has there been any VRRP change? (Either .1 didn't get/accept the arp req or havent answered it, or answered it on a different interface, I... [13:24:52] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Certificate errors due to GlobalSign - https://phabricator.wikimedia.org/T148045#2716594 (10BBlack) Resolving this. The mitigation deployed yesterday (alternate intermediate->root chain) seems to have wor... [13:28:22] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Certificate errors due to GlobalSign - https://phabricator.wikimedia.org/T148045#2716627 (10faidon) 05Open>03Resolved a:03BBlack [15:46:45] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, and 2 others: Windows 10 & MacOS Sierra Certificate errors due to GlobalSign - https://phabricator.wikimedia.org/T148045#2716979 (10Aklapper) [16:24:16] 10Traffic, 06Operations: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2717174 (10BBlack) Yeah, regional split might make sense. We probably don't want to mix within the US, where we might see "bouncy" GeoIP resolution. Perhaps one for all US sites and one for all non-US sites (i... [16:36:57] 10Traffic, 06Operations, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2717232 (10greg) [16:43:17] 10Traffic, 06Operations, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2717255 (10BBlack) [17:55:52] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2717425 (10BBlack) I've re-evaluated nginx's stapling support today. We last evaluated it deeply many versions ago and found that for... [21:36:08] 23 [23:51:32] Platonides? [23:54:52] yes? [23:55:06] tell me, Krenair [23:59:42] 23