[11:12:20] 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3702993 (10Chicocvenancio) While @BBlack's response does seem to make sense to me, I am wondering why pywikibot sends thes... [12:07:48] 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3703068 (10BBlack) Yeah there's a few different layers of issue wrapped up in this `Authorization` mess: 1. Pywikibot pro... [12:41:45] in some broader sense, I wonder if we shouldn't be completely replacing all the UA's headers on cacheable requests to backends. [12:42:38] if it's an "anonymous" request from a UA for cacheable content, which would normally be servicing by a cache hit if the hit is present in storage.... basically if anything on the backend applayer side is looking at any client-specifics in the headers, that's Wrong anyways. [12:43:37] so you could make the argumen that varnish's req to the backend should be whitened in those cases (strip out all spurious client-sent headers/cookies/etc, making all varnish->app requests in such cases look truly client-neutral, as if they originated at Varnish, basically) [12:44:22] (part of the problem is we don't always know it a URI will end up cacheable or not at initial request tie, of course) [12:44:26] s/tie/time/ [13:20:03] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo: Decom cp4009,10,17,18 (4 nodes) - https://phabricator.wikimedia.org/T178801#3703144 (10BBlack) [14:49:30] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3703355 (10BBlack) 05Open>03Resolved [14:52:08] 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3703369 (10fgiunchedi) >>! In T178567#3700598, @BBlack wrote: > The original request did have an `Authorization` header fu... [14:52:24] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3703371 (10BBlack) [14:52:25] 10Traffic, 10Operations, 10Patch-For-Review: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3703370 (10BBlack) 05Open>03Resolved [14:53:05] 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3703376 (10BBlack) 05Open>03Resolved I'm assuming there's nothing left to do here, re-open otherw... [14:53:41] 10Traffic, 10Operations, 10Patch-For-Review: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3481679 (10BBlack) 05Open>03Resolved [14:53:43] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack) [14:53:59] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack) [14:54:02] 10Traffic, 10Operations, 10Patch-For-Review: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481656 (10BBlack) 05Open>03Resolved [14:59:28] 10Traffic, 10netops, 10Operations: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3420148 (10BBlack) Are these fetch-failure spikes still happening? [15:00:35] 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3703417 (10BBlack) Anything to do here? Should we look further at whether most of these ICMPs seem related to real TCP connections to our services? [15:01:53] 10Traffic, 10Analytics, 10Operations: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270089 (10BBlack) Is this something we still need answers for, or have we just moved past it into a new normal? [15:05:38] 10Traffic, 10Operations: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#3703428 (10BBlack) 05Open>03Resolved a:03BBlack Resolved long ago! [15:07:03] 10Traffic, 10Operations, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#3703435 (10BBlack) [15:07:05] 10Traffic, 10MediaWiki-extensions-CentralNotice, 10Operations: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#3703432 (10BBlack) 05Open>03Resolved a:03BBlack Not much left to discuss in this stale ticket. We have the information we need, and were able... [15:07:57] 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703436 (10BBlack) We abandoned the original intent of this ticket, I think? [15:10:28] 10Traffic, 10netops, 10Operations: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3703437 (10ema) 05Open>03Resolved a:03ema [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text&v... [15:11:24] 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3703456 (10BBlack) [15:11:27] 10Traffic, 10Operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#3703453 (10BBlack) 05Open>03Resolved a:03BBlack Closing this ticket as it's getting rather long in the tooth. We did reduce our TTL caps down to 1d across the board at all layers, with up to ~7d kee... [15:11:43] 10Traffic, 10Operations: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#3703459 (10BBlack) Anything left to look at here? [15:12:25] 10Traffic, 10Operations, 10codfw-rollout: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#3703467 (10BBlack) [15:12:28] 10Traffic, 10Operations, 10codfw-rollout: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#3703468 (10BBlack) [15:12:31] 10Traffic, 10Operations, 10Patch-For-Review, 10codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#3703465 (10BBlack) 05Open>03Resolved a:03BBlack [15:16:12] 10Traffic, 10Operations, 10Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#3703485 (10BBlack) 05Open>03Resolved a:03BBlack [15:16:47] 10HTTPS, 10Traffic, 10Operations, 10WMF-Communications, 10Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#3703488 (10BBlack) 05stalled>03Resolved a:03BBlack No updates in 11 months. I assume this is just the new normal... [15:17:50] 10HTTPS, 10Traffic, 10Operations: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#3703493 (10BBlack) We still haven't had time to work on doing this "right". Most likely the effort is better invested doing similar things on the TLSv1.3 side at this point, ra... [15:18:03] 10HTTPS, 10Traffic, 10Operations: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#3703498 (10BBlack) [15:18:06] 10Traffic, 10Operations, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567#3703496 (10BBlack) [15:18:29] 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10ema) @dbarratt can you please provide some examples, including request/response headers and body, the behavior you're seeing and the one you'd expect... [15:18:58] 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703501 (10BBlack) It seems like we've made a lot of progress on this front since late 2015. Should we consider this resolved now? @Robh? [15:19:19] 10Traffic, 10Operations: more robust certificate chain creation in puppet - https://phabricator.wikimedia.org/T84543#3703502 (10BBlack) 05Open>03Resolved a:03BBlack [15:19:50] sorry for the spam, trying to at least make a quick pass through all the tickets and find easily-killable ones :) [15:20:36] 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703517 (10RobH) All certificates are now tracked in icinga so I think this can indeed be resolved. (We've also transitioned over to LE for the bulk of non-wildcards!) [15:33:46] 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703590 (10Nuria) Ticket can be closed. [15:40:55] 10HTTPS, 10Traffic, 10Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3703623 (10BBlack) [15:41:01] 10Traffic, 10Operations, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#3703620 (10BBlack) 05Open>03Resolved a:03BBlack https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [15:42:30] 10Traffic, 10Operations: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057797 (10BBlack) Was this resolved or are we still getting failures here? [15:43:34] 10Traffic, 10Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3703643 (10BBlack) This came up again recently. We really should make the switch to `nginx-light` (carefully, to avoid mass-restart!) [15:46:42] 10Domains, 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3703654 (10BBlack) 05Open>03Resolved a:03BBlack The immediate error noted at the start of this ticket is expected. wikispecies.org is not in ou... [15:47:15] 10Traffic, 10Operations: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3703659 (10faidon) We get occasional rare failures depending on the availability of the CT log servers. I don't see a way around this unless we make our cronjobs quite a bit more sophisticated (e.g.... [15:47:49] 10Traffic, 10Operations, 10Patch-For-Review: Improve OCSP fetching and monitoring strategies - https://phabricator.wikimedia.org/T172116#3703660 (10BBlack) 05Open>03Resolved a:03BBlack Seems pretty robust as of the changes above. I don't think it's worth pursuing this further at this time. We might r... [15:48:47] 10Traffic, 10Operations: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3703664 (10BBlack) 05Open>03Resolved a:03BBlack So far in all the cases I've seen, when we've hit the ratelimit it's been a useful signal to tell us we've got broken software and/... [15:48:53] 10Traffic, 10Operations: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3703668 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [15:50:22] 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703670 (10BBlack) 05Open>03Resolved a:03BBlack [15:50:34] 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703672 (10BBlack) 05Open>03Resolved a:03BBlack [15:51:23] 10Traffic, 10Operations: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3703677 (10BBlack) 05Open>03Resolved a:03BBlack Ok I'm gonna say it's not a pressing issue for now then. To revisit the next time it really bothers us! [15:58:11] 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3703701 (10faidon) pmacct 1.7.0-1 (with GeoIP2 support too!) was uploaded to sid yesterday. This should be as easy as a backport-and-install now. [16:01:04] 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3703711 (10elukey) Could be a good candidate for the Kafka Jumbo cluster! In this case it could use librdkafka 0.11 to negotiate API without c... [16:03:57] 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3703725 (10Multichill) Did AS1126: maartend@vancis-asd01-r01> show bgp summary | match 14907 80.249.209.176 14907 6... [16:18:08] 10Traffic, 10Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3703753 (10BBlack) [16:19:03] 10Traffic, 10Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3703755 (10BBlack) 05Open>03Resolved [16:20:02] 10Traffic, 10Operations: cp2017 froze and stopped serving traffic - https://phabricator.wikimedia.org/T159056#3703766 (10BBlack) 05Open>03Resolved a:03BBlack No recurrence AFAIK, closing. [16:21:43] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3703779 (10BBlack) [16:21:46] 10Traffic, 10Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3703775 (10BBlack) 05Open>03Resolved At this point, we'll just do the new 3-server setup on the new lvs400[567] systems in T178436 and ignore this until decom, basically. [16:23:01] 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3703782 (10Multichill) I was updating RIPE db and I noticed some of the records are still lagging. * Old AS record is at https://apps.d... [16:25:39] 10Traffic, 10Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3703786 (10BBlack) Apparently this machine is back in service (since when I'm not sure, but it's been a while I think). It's still showing temp alerts in dmesg.... [16:27:04] 10Traffic, 10Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3703789 (10BBlack) Interestingly, the IPMI sensors check in icinga is showing this machine as being fine. I wonder what the discrepancy is between that and the MCEs and dmesg? [16:32:10] 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10RobH) [16:32:42] 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10RobH) [16:33:49] 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10BBlack) [16:33:52] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4009,10,17,18 (4 nodes) - https://phabricator.wikimedia.org/T178801#3703820 (10BBlack) [16:54:52] ema: so I saw you restarted 4021 for mailbox lag which was a little surprising, and I see 4024 starting to warn now [16:55:20] we should look into this, since we were thinking it should be mostly-fixed, and we haven't reverted any workarounds, and it used to be mostly just in eqiad for upload... [16:55:28] is something new causing this? [17:00:26] bblack: mmh no, I don't think we've deployed anything potentially related lately [17:00:59] this morning's lag on 4021 wasn't causing errors FTR [17:01:22] 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703905 (10Dzahn) There are (seperate) Icinga checks for the *.planet.wikimedia.org and the *.wmfusercontent.org cert that recently alerted on upcoming expiry of the main unified cert. They have be... [17:04:43] bblack: I'll take a closer look tomorrow :) [17:08:15] ok [17:08:36] I might let this one on 4024 ride a bit and see what happens, maybe it self-recovers [17:13:09] 10Traffic, 10Operations: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3703942 (10Dzahn) Yep, at the time of writing this ticket we weren't aware that the rate-limiting issue was ultimately caused by a software issue specific to stretch machines (openssl o... [17:56:11] 10Traffic, 10Operations: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3272649 (10BBlack) I don't think anything has changed since on Google's end. Do we try harder or just accept it? [18:08:55] 10Domains, 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3704153 (10Framawiki) 05Resolved>03declined (the problem is not resolved, so I change the status of this task) [18:20:30] 10Traffic, 10netops, 10Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3704183 (10BBlack) Bump, I had to re-dig into this ticket a bit to catch myself up, so, re-summarizing: 1) In `src/http/ngx_http_request.c` at the top of `ngx_http_ssl_... [18:25:04] 10HTTPS, 10Traffic, 10Operations, 10Patch-For-Review: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#3704189 (10BBlack) 05Open>03Resolved a:03BBlack [18:39:32] 10Traffic, 10DNS, 10Operations: Consider DNSSec - https://phabricator.wikimedia.org/T26413#3704222 (10BBlack) 05Open>03stalled I've tried in the past to keep myself fairly open to the eventual inevitability of DNSSEC and keep my comments even-handed on the matter. I was willing to capitulate to mass op... [20:21:32] bblack: about T167691 (ICMP dest unreachable), my main interogation is why is esams sending a lot more of them compared to the other sites [20:21:33] T167691: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691 [22:07:06] bblack: if you're still around: https://phabricator.wikimedia.org/T178841#3704955 [22:19:17] I am [22:22:54] ok so, the current issue is that basically the LetsEncrypt puppetization is self-dependent? it fails and then can't renew itself? [22:22:57] anyways, looking [22:24:17] Krenair: ping? it looks like it got (auto?)-upgraded to varnish5? [22:24:35] hi [22:24:54] So far I've found that varnish won't start due to a libvmod-vslp incompatibility [22:25:04] that's because it's upgraded to varnish5 [22:25:24] varnish has upgraded but libvmod-vslp hasn't? [22:25:27] (which is do-able, but we haven't done it yet for text/upload in prod, just misc) [22:25:40] I don't think the upgrade was intentional, on our end [22:25:53] hm [22:25:55] in any case: there's hieradata related to varnish version compatibility [22:26:12] now that varnish5 is installed, those nodes need: "profile::cache::base::varnish_version: 5" [22:26:24] which will change the puppetization and get rid of the dependency on vslp [22:26:44] let's see if I can remember how to do that without having admin on the labs project [22:26:51] (vslp is only for varnish4, it's replaced by "shard" on v5) [22:27:06] yeah I think I can do it, I just don't recall offhand where it is in the horizon web UI stuff [22:27:58] relying on puppet for this may be problematic since puppet won't work due to LE failing due to Varnish being down [22:28:15] might have to comment some stuff out [22:29:17] we'll see [22:29:27] I'm in horizon now digging [22:29:49] I just made a hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml file on the puppetmaster [22:29:56] heh [22:29:57] with that line in you gave [22:30:29] I've also commented out the letsencrypt stuff I put in tlsproxy::localssl, so puppet runs [22:30:39] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704999 (10hashar) Puppet fails with: Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server. /usr/local/sbin/acme_tiny.py --account-ke... [22:30:48] that killed the vslp import [22:30:51] neat [22:30:52] there shouldn't be any need to comment that stuff out [22:30:54] Krenair: my summary is https://phabricator.wikimedia.org/T178841#3704999 [22:31:22] bblack, well it relies on LE being able to do a challenge, so [22:31:30] Krenair: apparently traffic to port 80 is blocked somewhere and the letsencrypt challenge can not complete as a result [22:31:57] puppet saying something like: couldn't download http://beta.wmflabs.org/.well-known/acme-challenge/XXXXX [22:31:58] nothing seems to be running on port 80 [22:31:59] Krenair: nothing should be circular-dependent like that, I think if we fix varnish side of things, it will fix itself [22:32:03] hey wait that should be nginx shouldn't it? [22:32:07] anyways, one problem at a time! [22:32:19] let's fix varnish5 puppetization/install issues first, the rest will probably solve itself [22:32:43] alright now we've got [22:32:43] if you remove the letsencrypt part, maybe that will unblock puppet [22:32:52] Message from VCC-compiler: [22:32:52] Incompatible VMOD netmapper [22:32:54] File name: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_netmapper.so [22:32:54] VMOD version 3.2 [22:32:54] varnishd version 6.0 [22:33:01] hasharDinner, yeah I did and it partially unblocked some of it, not all [22:33:08] 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3705007 (10dbarratt) >>! In T174960#3703499, @ema wrote: > @dbarratt can you please provide some examples, including request/response headers and body, the beha... [22:33:22] Krenair: yeah I just installed it, re-running puppet agent now [22:33:35] varnish package was upgraded, but not libvmod-netmapper... [22:33:37] that's something puppet would take care of if it weren't broken right? [22:33:38] and varnishd is version 5 but I guess with obsoletes/stall VCL [22:33:57] no, varnish version upgrades are "special", there's manual steps in doing them, and hieradata changes, etc [22:34:09] I'm not sure why varnish was partially upgraded, that's the real question [22:34:16] in any case, puppet runs clean now [22:34:19] (on text) [22:34:22] we might have unattended upgrade [22:34:25] I grepped /var/log/apt/history.log [22:34:26] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install varnish [22:34:26] Upgrade: varnish:amd64 (4.1.8-1wm1, 5.1.3-1wm1), varnish-dbg:amd64 (4.1.8-1wm1, 5.1.3-1wm1) [22:34:49] our v5 packages are in the experimental repo, too [22:34:55] looks like varnish is up now [22:34:59] yes, it is [22:35:13] I think it's working [22:35:23] can you re-enable the LE stuff? [22:35:26] yeah [22:35:44] running puppet with LE now [22:36:05] so basically, there's two things wrong here in the net: [22:36:05] FWIW the stuff I commented was the letsencrypt::cert::integrated block in modules/tlsproxy/manifests/localssl.pp [22:36:13] 1) Random Varnish v5 package installs [22:36:26] 2) Hieradata in Horizon is not being kept up to date with prod in general [22:36:41] okay now puppet fails due to nginx reload failure [22:37:00] loooking... [22:37:04] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705021 (10hashar) @Krenair / @BBlack are looking into it. They both know about Letsencrypt/Varnish. [22:37:23] krenair@deployment-cache-text04:~$ sudo /usr/sbin/nginx -g 'daemon on; master_process on;' -s reload [22:37:23] nginx: [emerg] unknown directive "lua_shared_dict" in /etc/nginx/sites-enabled/tlsproxy-prometheus:3 [22:37:35] yeah that's a package problem again [22:37:38] where have I seen this before [22:37:43] ah I got that one as well and just deleted the /etc/nginx/sites-enabled/tlsproxy-prometheus file :D [22:37:52] or, install the lua vmod :P [22:37:56] s/vmod/mod/ [22:37:58] you just deleted the file hasharDinner? [22:38:04] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705026 (10Paladox) If it uses ferm, it will not take notice of security group settings. I found that out with jenkins-slave-01 and other instances. try this sudo iptables -A INPUT -p tc... [22:38:09] ok [22:38:18] but puppet probably added it back? [22:38:20] I guess you could theoretically delete the file, make nginx work, then run puppet [22:38:31] apt-get install libnginx-mod-http-lua libnginx-mod-http-ndk [22:38:38] ^ that's what makes it work [22:38:49] my idea was to try to get http://beta.wmflabs.org/ to serve the LE challenge. But I never got port 80 reacheable from outside :( [22:39:08] anyways, that stuff is fixed now [22:39:29] well, almost. upgrade restarts failed due to bad config... [22:40:22] it looks like those packages have been installed, but nginx still fails to start [22:41:33] the install didn't finish because of the broken config, technically [22:41:43] what precipitated all this carnage? [22:43:15] so [22:43:24] apt-get won't install the module [22:43:28] because nginx fails to start [22:43:33] because the module isn't installed? [22:43:36] have I got this right? [22:44:57] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Dzahn) > sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT please don't. it will just conflict / be reverted by ferm or ferm service will be stopped leading to more manual thi... [22:46:29] So I removed the file and made apt-get finish installing the package (which it did), but nginx still won't reload [22:46:38] (after returning the file) [22:48:44] Krenair: you solved it previously apparently: https://phabricator.wikimedia.org/T174746 [22:49:06] hey see I thought I recognised this problem from somewhere [22:49:15] and some magic https://gerrit.wikimedia.org/r/#/c/375772/5/modules/thumbor/templates/nginx.conf.erb [22:49:24] Start-Date: 2017-10-04 18:05:41 [22:49:24] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libvarnishapi1 [22:49:29] not even that long ago [22:49:40] so probably the same fix would apply [22:49:40] ^ so I guess the upgrade was intentional, ~19 days ago. has this been broken since then? [22:51:59] 13, I believe [22:53:01] I don't see messages in syslog about the vslp issue though, until today [22:53:27] I guess something caused varnish to restart today? [22:53:28] maybe the package got upgraded (but without other accompanying changes) back on 10-04, but today was the first restart since the package install? [22:53:41] (although I could've sworn the package install also restarts the daemon) [22:53:41] most probably yes [22:54:02] and maybe puppet kindly upgraded it [22:54:06] or, the hieradata was fixed back when the package was upgraded, but something got changed/reverted with the hieradata for deployment-prep? [22:54:18] puppet doesn't upgrade varnish on its own [22:54:29] (or handle the fallout of a partial upgrade process) [22:55:03] 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10EBernhardson) I suppose i can add that the reason it has to be GET, rather than POST, is because the kibana application that receives these requests... [22:58:30] 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10BBlack) I doubt Varnish in default config does anything about GET request bodies, they're a fairly non-standard thing. I think our current versions... [22:58:58] why is it nginx remains completely unable to handle that lua_shared_dict line? [22:59:24] 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3705072 (10EBernhardson) Actually on closer review, kibana is allowing some POST requests, but not your _search endpoint: https://github.com/elastic/kibana/blo... [23:00:15] oh [23:00:16] hieradata/role/common/cache/text.yaml:cache::lua_support: true [23:00:19] that won't apply [23:00:48] grhgh [23:00:57] Krenair: yeah the hieradata/role are not applied on labs :( [23:01:05] heh [23:01:06] gotta copy pasta to horizon [23:01:14] more hieradata disconvergence! [23:01:22] +load_module modules/ndk_http_module.so; [23:01:22] +load_module modules/ngx_http_lua_module.so; [23:01:49] alright [23:02:08] https://phabricator.wikimedia.org/T136080 (closed) and https://phabricator.wikimedia.org/T120165 to make labs role aware [23:02:22] nginx now reloads properly [23:02:43] btw, hasharDinner [23:02:50] krenair@deployment-cache-text04:~$ sudo lsof -i :80 [23:02:53] COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME [23:02:53] varnishd 25058 varnish 3u IPv4 519741739 0t0 TCP *:http (LISTEN) [23:02:53] varnishd 25058 varnish 5u IPv6 519741740 0t0 TCP *:http (LISTEN) [23:02:53] cache-mai 25066 vcache 3u IPv4 519741739 0t0 TCP *:http (LISTEN) [23:02:53] cache-mai 25066 vcache 5u IPv6 519741740 0t0 TCP *:http (LISTEN) [23:03:46] I can curl http://beta.wmflabs.org [23:04:17] !!!!!!!!!!!!!!!!!!!!!!!!!! [23:04:33] hasharDinner, how was it that you tried to set up port 80 for nginx? [23:04:40] I have no idea why with nginx it did not work [23:05:19] Krenair: I was trying to unblock LE challenge ( https://phabricator.wikimedia.org/T178841#3704999 ) :D [23:05:51] but obviously past 11pm and not knowing anything about LE .. that was prone to failure [23:06:10] yeah the LE challenge failing was a red herring [23:06:24] just a symptom of the problem [23:06:27] $ curl https://en.wikipedia.beta.wmflabs.org/ [23:06:27] curl: (51) SSL: no alternative certificate subject name matches target host name 'en.wikipedia.beta.wmflabs.org' [23:06:31] well it is not happy still :) [23:06:34] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705082 (10Krenair) Between the three of us it's been brought back up. [23:06:57] thanks Krenair bblack and hasharDinner [23:07:01] en.wikipedia.beta.wmflabs.org uses an invalid security certificate. The certificate is only valid for beta.wmflabs.org [23:07:12] I guess it is missing a few certs? [23:07:14] hmph, wtf [23:07:30] I thought this was working just now [23:08:18] Is it possible I broke the list of certs by creating hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml in puppet? [23:08:39] what is the hiera key? [23:10:11] I look them up from the puppet master using: /var/lib/git/operations/puppet/utils/hiera_lookup -v --fqdn=deployment-cache-text04.deployment-prep.eqiad.wmflabs [23:10:36] example: ssh deployment-puppetmaster02.deployment-prep.eqiad.wmflabs /var/lib/git/operations/puppet/utils/hiera_lookup -v --fqdn=deployment-cache-text04.deployment-prep.eqiad.wmflabs classes [23:12:24] maybe there is a glitch in puppet and it fails to regenerate them all [23:12:34] or nginx needs a restart to catch them [23:12:51] Krenair: sorry but I am failling asleep. Thank you for the fix up! [23:12:56] ok [23:13:03] (and thanks bblack for the assistance! ) [23:13:16] It does actually have broken a cert [23:13:34] issued like today [23:13:44] ema: re: cp4024, there was a small spike of 503s around 16:15-16:45 or so, but I left it going to see what happened. the 503s went away but the lag kept spiraling out. So I restarted the backend eventually (because I've gotta take off for little league stuff and can't stare anymore) [23:15:10] ema: it's annoying, there must be some change that lead to this new behavior? there's more vhtcpd reconnects, but I think even varnish-fe already gave it plenty of tcp reconnects... [23:24:18] greg-g, okay so now https seems to be back to normal [23:24:35] trick was to set profile::cache::ssl::unified::le_subjects to the big list you see in https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-cache-text04 [23:24:46] should probably commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml instead of leaving it lying on the puppetmaster [23:25:41] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705126 (10Krenair) HTTPS should now work again too. Need to commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml on the puppetmaster: ```profile::cache::base::varnish_v...