[11:12:20] <wikibugs>	 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3702993 (10Chicocvenancio) While @BBlack's response does seem to make sense to me, I am wondering why pywikibot sends thes...
[12:07:48] <wikibugs>	 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3703068 (10BBlack) Yeah there's a few different layers of issue wrapped up in this `Authorization` mess:  1. Pywikibot pro...
[12:41:45] <bblack>	 in some broader sense, I wonder if we shouldn't be completely replacing all the UA's headers on cacheable requests to backends.
[12:42:38] <bblack>	 if it's an "anonymous" request from a UA for cacheable content, which would normally be servicing by a cache hit if the hit is present in storage.... basically if anything on the backend applayer side is looking at any client-specifics in the headers, that's Wrong anyways.
[12:43:37] <bblack>	 so you could make the argumen that varnish's req to the backend should be whitened in those cases (strip out all spurious client-sent headers/cookies/etc, making all varnish->app requests in such cases look truly client-neutral, as if they originated at Varnish, basically)
[12:44:22] <bblack>	 (part of the problem is we don't always know it a URI will end up cacheable or not at initial request tie, of course)
[12:44:26] <bblack>	 s/tie/time/
[13:20:03] <wikibugs>	 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo: Decom cp4009,10,17,18 (4 nodes) - https://phabricator.wikimedia.org/T178801#3703144 (10BBlack)
[14:49:30] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3703355 (10BBlack) 05Open>03Resolved
[14:52:08] <wikibugs>	 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3703369 (10fgiunchedi) >>! In T178567#3700598, @BBlack wrote: > The original request did have an `Authorization` header fu...
[14:52:24] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3703371 (10BBlack)
[14:52:25] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3703370 (10BBlack) 05Open>03Resolved
[14:53:05] <wikibugs>	 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3703376 (10BBlack) 05Open>03Resolved I'm assuming there's nothing left to do here, re-open otherw...
[14:53:41] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3481679 (10BBlack) 05Open>03Resolved
[14:53:43] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack)
[14:53:59] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack)
[14:54:02] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481656 (10BBlack) 05Open>03Resolved
[14:59:28] <wikibugs>	 10Traffic, 10netops, 10Operations: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3420148 (10BBlack) Are these fetch-failure spikes still happening?
[15:00:35] <wikibugs>	 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3703417 (10BBlack) Anything to do here? Should we look further at whether most of these ICMPs seem related to real TCP connections to our services?
[15:01:53] <wikibugs>	 10Traffic, 10Analytics, 10Operations: Artificial spike in offset of unique devices  from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270089 (10BBlack) Is this something we still need answers for, or have we just moved past it into a new normal?
[15:05:38] <wikibugs>	 10Traffic, 10Operations: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#3703428 (10BBlack) 05Open>03Resolved a:03BBlack Resolved long ago!
[15:07:03] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#3703435 (10BBlack)
[15:07:05] <wikibugs>	 10Traffic, 10MediaWiki-extensions-CentralNotice, 10Operations: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#3703432 (10BBlack) 05Open>03Resolved a:03BBlack Not much left to discuss in this stale ticket.  We have the information we need, and were able...
[15:07:57] <wikibugs>	 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703436 (10BBlack) We abandoned the original intent of this ticket, I think?
[15:10:28] <wikibugs>	 10Traffic, 10netops, 10Operations: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3703437 (10ema) 05Open>03Resolved a:03ema [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text&v...
[15:11:24] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3703456 (10BBlack)
[15:11:27] <wikibugs>	 10Traffic, 10Operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#3703453 (10BBlack) 05Open>03Resolved a:03BBlack Closing this ticket as it's getting rather long in the tooth.  We did reduce our TTL caps down to 1d across the board at all layers, with up to ~7d kee...
[15:11:43] <wikibugs>	 10Traffic, 10Operations: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#3703459 (10BBlack) Anything left to look at here?
[15:12:25] <wikibugs>	 10Traffic, 10Operations, 10codfw-rollout: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#3703467 (10BBlack)
[15:12:28] <wikibugs>	 10Traffic, 10Operations, 10codfw-rollout: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#3703468 (10BBlack)
[15:12:31] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#3703465 (10BBlack) 05Open>03Resolved a:03BBlack
[15:16:12] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#3703485 (10BBlack) 05Open>03Resolved a:03BBlack
[15:16:47] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10WMF-Communications, 10Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#3703488 (10BBlack) 05stalled>03Resolved a:03BBlack No updates in 11 months.  I assume this is just the new normal...
[15:17:50] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#3703493 (10BBlack) We still haven't had time to work on doing this "right".  Most likely the effort is better invested doing similar things on the TLSv1.3 side at this point, ra...
[15:18:03] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#3703498 (10BBlack)
[15:18:06] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567#3703496 (10BBlack)
[15:18:29] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10ema) @dbarratt can you please provide some examples, including request/response headers and body, the behavior you're seeing and the one you'd expect...
[15:18:58] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703501 (10BBlack) It seems like we've made a lot of progress on this front since late 2015.  Should we consider this resolved now? @Robh?
[15:19:19] <wikibugs>	 10Traffic, 10Operations: more robust certificate chain creation in puppet - https://phabricator.wikimedia.org/T84543#3703502 (10BBlack) 05Open>03Resolved a:03BBlack
[15:19:50] <bblack>	 sorry for the spam, trying to at least make a quick pass through all the tickets and find easily-killable ones :)
[15:20:36] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703517 (10RobH) All certificates are now tracked in icinga so I think this can indeed be resolved.  (We've also transitioned over to LE for the bulk of non-wildcards!)
[15:33:46] <wikibugs>	 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703590 (10Nuria) Ticket can be closed.
[15:40:55] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3703623 (10BBlack)
[15:41:01] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#3703620 (10BBlack) 05Open>03Resolved a:03BBlack https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates
[15:42:30] <wikibugs>	 10Traffic, 10Operations: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057797 (10BBlack) Was this resolved or are we still getting failures here?
[15:43:34] <wikibugs>	 10Traffic, 10Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3703643 (10BBlack) This came up again recently.  We really should make the switch to `nginx-light` (carefully, to avoid mass-restart!)
[15:46:42] <wikibugs>	 10Domains, 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3703654 (10BBlack) 05Open>03Resolved a:03BBlack The immediate error noted at the start of this ticket is expected.  wikispecies.org is not in ou...
[15:47:15] <wikibugs>	 10Traffic, 10Operations: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3703659 (10faidon) We get occasional rare failures depending on the availability of the CT log servers. I don't see a way around this unless we make our cronjobs quite a bit more sophisticated (e.g....
[15:47:49] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Improve OCSP fetching and monitoring strategies - https://phabricator.wikimedia.org/T172116#3703660 (10BBlack) 05Open>03Resolved a:03BBlack Seems pretty robust as of the changes above.  I don't think it's worth pursuing this further at this time.  We might r...
[15:48:47] <wikibugs>	 10Traffic, 10Operations: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3703664 (10BBlack) 05Open>03Resolved a:03BBlack So far in all the cases I've seen, when we've hit the ratelimit it's been a useful signal to tell us we've got broken software and/...
[15:48:53] <wikibugs>	 10Traffic, 10Operations: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3703668 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None
[15:50:22] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703670 (10BBlack) 05Open>03Resolved a:03BBlack
[15:50:34] <wikibugs>	 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703672 (10BBlack) 05Open>03Resolved a:03BBlack
[15:51:23] <wikibugs>	 10Traffic, 10Operations: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3703677 (10BBlack) 05Open>03Resolved a:03BBlack Ok I'm gonna say it's not a pressing issue for now then.  To revisit the next time it really bothers us!
[15:58:11] <wikibugs>	 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3703701 (10faidon) pmacct 1.7.0-1 (with GeoIP2 support too!) was uploaded to sid yesterday. This should be as easy as a backport-and-install now.
[16:01:04] <wikibugs>	 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3703711 (10elukey) Could be a good candidate for the Kafka Jumbo cluster! In this case it could use librdkafka 0.11 to negotiate API without c...
[16:03:57] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3703725 (10Multichill) Did AS1126:  maartend@vancis-asd01-r01> show bgp summary | match 14907 80.249.209.176        14907          6...
[16:18:08] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3703753 (10BBlack)
[16:19:03] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3703755 (10BBlack) 05Open>03Resolved
[16:20:02] <wikibugs>	 10Traffic, 10Operations: cp2017 froze and stopped serving traffic - https://phabricator.wikimedia.org/T159056#3703766 (10BBlack) 05Open>03Resolved a:03BBlack No recurrence AFAIK, closing.
[16:21:43] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3703779 (10BBlack)
[16:21:46] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3703775 (10BBlack) 05Open>03Resolved At this point, we'll just do the new 3-server setup on the new lvs400[567] systems in T178436 and ignore this until decom, basically.
[16:23:01] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3703782 (10Multichill) I was updating RIPE db and I noticed some of the records are still lagging. * Old AS record is at https://apps.d...
[16:25:39] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3703786 (10BBlack) Apparently this machine is back in service (since when I'm not sure, but it's been a while I think).  It's still showing temp alerts in dmesg....
[16:27:04] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3703789 (10BBlack) Interestingly, the IPMI sensors check in icinga is showing this machine as being fine.  I wonder what the discrepancy is between that and the MCEs and dmesg?
[16:32:10] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10RobH)
[16:32:42] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10RobH)
[16:33:49] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10BBlack)
[16:33:52] <wikibugs>	 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4009,10,17,18 (4 nodes) - https://phabricator.wikimedia.org/T178801#3703820 (10BBlack)
[16:54:52] <bblack>	 ema: so I saw you restarted 4021 for mailbox lag which was a little surprising, and I see 4024 starting to warn now
[16:55:20] <bblack>	 we should look into this, since we were thinking it should be mostly-fixed, and we haven't reverted any workarounds, and it used to be mostly just in eqiad for upload...
[16:55:28] <bblack>	 is something new causing this?
[17:00:26] <ema>	 bblack: mmh no, I don't think we've deployed anything potentially related lately
[17:00:59] <ema>	 this morning's lag on 4021 wasn't causing errors FTR
[17:01:22] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703905 (10Dzahn) There are (seperate) Icinga checks for the *.planet.wikimedia.org and the *.wmfusercontent.org cert that recently alerted on upcoming expiry of the main unified cert. They have be...
[17:04:43] <ema>	 bblack: I'll take a closer look tomorrow :)
[17:08:15] <bblack>	 ok
[17:08:36] <bblack>	 I might let this one on 4024 ride a bit and see what happens, maybe it self-recovers
[17:13:09] <wikibugs>	 10Traffic, 10Operations: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3703942 (10Dzahn) Yep, at the time of writing this ticket we weren't aware that the rate-limiting issue was ultimately caused by a software issue specific to stretch machines (openssl o...
[17:56:11] <wikibugs>	 10Traffic, 10Operations: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3272649 (10BBlack) I don't think anything has changed since on Google's end.  Do we try harder or just accept it?
[18:08:55] <wikibugs>	 10Domains, 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3704153 (10Framawiki) 05Resolved>03declined (the problem is not resolved, so I change the status of this task)
[18:20:30] <wikibugs>	 10Traffic, 10netops, 10Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3704183 (10BBlack) Bump, I had to re-dig into this ticket a bit to catch myself up, so, re-summarizing:  1) In `src/http/ngx_http_request.c` at the top of  `ngx_http_ssl_...
[18:25:04] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Patch-For-Review: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#3704189 (10BBlack) 05Open>03Resolved a:03BBlack
[18:39:32] <wikibugs>	 10Traffic, 10DNS, 10Operations: Consider DNSSec - https://phabricator.wikimedia.org/T26413#3704222 (10BBlack) 05Open>03stalled I've tried in the past to keep myself fairly open to the eventual inevitability of DNSSEC and keep my comments even-handed on the matter.   I was willing to capitulate to mass op...
[20:21:32] <XioNoX>	 bblack: about T167691 (ICMP dest unreachable), my main interogation is why is esams sending a lot more of them compared to the other sites
[20:21:33] <stashbot>	 T167691: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691
[22:07:06] <greg-g>	 bblack: if you're still around: https://phabricator.wikimedia.org/T178841#3704955
[22:19:17] <bblack>	 I am
[22:22:54] <bblack>	 ok so, the current issue is that basically the LetsEncrypt puppetization is self-dependent? it fails and then can't renew itself?
[22:22:57] <bblack>	 anyways, looking
[22:24:17] <bblack>	 Krenair: ping? it looks like it got (auto?)-upgraded to varnish5?
[22:24:35] <Krenair>	 hi
[22:24:54] <Krenair>	 So far I've found that varnish won't start due to a libvmod-vslp incompatibility
[22:25:04] <bblack>	 that's because it's upgraded to varnish5
[22:25:24] <Krenair>	 varnish has upgraded but libvmod-vslp hasn't?
[22:25:27] <bblack>	 (which is do-able, but we haven't done it yet for text/upload in prod, just misc)
[22:25:40] <bblack>	 I don't think the upgrade was intentional, on our end
[22:25:53] <Krenair>	 hm
[22:25:55] <bblack>	 in any case: there's hieradata related to varnish version compatibility
[22:26:12] <bblack>	 now that varnish5 is installed, those nodes need: "profile::cache::base::varnish_version: 5"
[22:26:24] <bblack>	 which will change the puppetization and get rid of the dependency on vslp
[22:26:44] <Krenair>	 let's see if I can remember how to do that without having admin on the labs project
[22:26:51] <bblack>	 (vslp is only for varnish4, it's replaced by "shard" on v5)
[22:27:06] <bblack>	 yeah I think I can do it, I just don't recall offhand where it is in the horizon web UI stuff
[22:27:58] <Krenair>	 relying on puppet for this may be problematic since puppet won't work due to LE failing due to Varnish being down
[22:28:15] <Krenair>	 might have to comment some stuff out
[22:29:17] <bblack>	 we'll see
[22:29:27] <bblack>	 I'm in horizon now digging
[22:29:49] <Krenair>	 I just made a hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml file on the puppetmaster
[22:29:56] <bblack>	 heh
[22:29:57] <Krenair>	 with that line in you gave
[22:30:29] <Krenair>	 I've also commented out the letsencrypt stuff I put in tlsproxy::localssl, so puppet runs
[22:30:39] <wikibugs>	 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704999 (10hashar) Puppet fails with:       Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server.   /usr/local/sbin/acme_tiny.py --account-ke...
[22:30:48] <Krenair>	 that killed the vslp import
[22:30:51] <hasharDinner>	 neat
[22:30:52] <bblack>	 there shouldn't be any need to comment that stuff out
[22:30:54] <hasharDinner>	 Krenair: my summary is https://phabricator.wikimedia.org/T178841#3704999
[22:31:22] <Krenair>	 bblack, well it relies on LE being able to do a challenge, so
[22:31:30] <hasharDinner>	 Krenair: apparently traffic to port 80 is blocked somewhere and the letsencrypt  challenge can not complete as a result
[22:31:57] <hasharDinner>	 puppet saying something like:   couldn't download http://beta.wmflabs.org/.well-known/acme-challenge/XXXXX
[22:31:58] <Krenair>	 nothing seems to be running on port 80
[22:31:59] <bblack>	 Krenair: nothing should be circular-dependent like that, I think if we fix varnish side of things, it will fix itself
[22:32:03] <Krenair>	 hey wait that should be nginx shouldn't it?
[22:32:07] <bblack>	 anyways, one problem at a time!
[22:32:19] <bblack>	 let's fix varnish5 puppetization/install issues first, the rest will probably solve itself
[22:32:43] <Krenair>	 alright now we've got
[22:32:43] <hasharDinner>	 if you remove the letsencrypt part, maybe that will unblock puppet
[22:32:52] <Krenair>	 Message from VCC-compiler:
[22:32:52] <Krenair>	 Incompatible VMOD netmapper
[22:32:54] <Krenair>	 	File name: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_netmapper.so
[22:32:54] <Krenair>	 	VMOD version 3.2
[22:32:54] <Krenair>	 	varnishd version 6.0
[22:33:01] <Krenair>	 hasharDinner, yeah I did and it partially unblocked some of it, not all
[22:33:08] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3705007 (10dbarratt) >>! In T174960#3703499, @ema wrote: > @dbarratt can you please provide some examples, including request/response headers and body, the beha...
[22:33:22] <bblack>	 Krenair: yeah I just installed it, re-running puppet agent now
[22:33:35] <bblack>	 varnish package was upgraded, but not libvmod-netmapper...
[22:33:37] <Krenair>	 that's something puppet would take care of if it weren't broken right?
[22:33:38] <hasharDinner>	 and varnishd is version 5 but I guess with obsoletes/stall VCL  
[22:33:57] <bblack>	 no, varnish version upgrades are "special", there's manual steps in doing them, and hieradata changes, etc
[22:34:09] <bblack>	 I'm not sure why varnish was partially upgraded, that's the real question
[22:34:16] <bblack>	 in any case, puppet runs clean now
[22:34:19] <bblack>	 (on text)
[22:34:22] <hasharDinner>	 we might have unattended upgrade
[22:34:25] <Krenair>	 I grepped /var/log/apt/history.log
[22:34:26] <Krenair>	 Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install varnish
[22:34:26] <Krenair>	 Upgrade: varnish:amd64 (4.1.8-1wm1, 5.1.3-1wm1), varnish-dbg:amd64 (4.1.8-1wm1, 5.1.3-1wm1)
[22:34:49] <bblack>	 our v5 packages are in the experimental repo, too
[22:34:55] <Krenair>	 looks like varnish is up now
[22:34:59] <bblack>	 yes, it is
[22:35:13] <Krenair>	 I think it's working
[22:35:23] <bblack>	 can you re-enable the LE stuff?
[22:35:26] <Krenair>	 yeah
[22:35:44] <Krenair>	 running puppet with LE now
[22:36:05] <bblack>	 so basically, there's two things wrong here in the net:
[22:36:05] <Krenair>	 FWIW the stuff I commented was the letsencrypt::cert::integrated block in modules/tlsproxy/manifests/localssl.pp
[22:36:13] <bblack>	 1) Random Varnish v5 package installs
[22:36:26] <bblack>	 2) Hieradata in Horizon is not being kept up to date with prod in general
[22:36:41] <Krenair>	 okay now puppet fails due to nginx reload failure
[22:37:00] <bblack>	 loooking...
[22:37:04] <wikibugs>	 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705021 (10hashar) @Krenair / @BBlack are looking into it. They both know about Letsencrypt/Varnish.
[22:37:23] <Krenair>	 krenair@deployment-cache-text04:~$ sudo /usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
[22:37:23] <Krenair>	 nginx: [emerg] unknown directive "lua_shared_dict" in /etc/nginx/sites-enabled/tlsproxy-prometheus:3
[22:37:35] <bblack>	 yeah that's a package problem again
[22:37:38] <Krenair>	 where have I seen this before
[22:37:43] <hasharDinner>	 ah I got that one as well and just deleted the  /etc/nginx/sites-enabled/tlsproxy-prometheus file :D
[22:37:52] <bblack>	 or, install the lua vmod :P
[22:37:56] <bblack>	 s/vmod/mod/
[22:37:58] <Krenair>	 you just deleted the file hasharDinner?
[22:38:04] <wikibugs>	 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705026 (10Paladox) If it uses ferm, it will not take notice of security group settings. I found that out with jenkins-slave-01 and other instances.  try this  sudo iptables -A INPUT -p tc...
[22:38:09] <Krenair>	 ok
[22:38:18] <hasharDinner>	 but puppet probably added it back?
[22:38:20] <Krenair>	 I guess you could theoretically delete the file, make nginx work, then run puppet
[22:38:31] <bblack>	 apt-get install libnginx-mod-http-lua libnginx-mod-http-ndk
[22:38:38] <bblack>	 ^ that's what makes it work
[22:38:49] <hasharDinner>	 my idea was to try to get http://beta.wmflabs.org/ to serve the LE challenge.  But I never got port 80 reacheable from outside :(
[22:39:08] <bblack>	 anyways, that stuff is fixed now
[22:39:29] <bblack>	 well, almost.  upgrade restarts failed due to bad config...
[22:40:22] <Krenair>	 it looks like those packages have been installed, but nginx still fails to start
[22:41:33] <bblack>	 the install didn't finish because of the broken config, technically
[22:41:43] <bblack>	 what precipitated all this carnage?
[22:43:15] <Krenair>	 so
[22:43:24] <Krenair>	 apt-get won't install the module
[22:43:28] <Krenair>	 because nginx fails to start
[22:43:33] <Krenair>	 because the module isn't installed?
[22:43:36] <Krenair>	 have I got this right?
[22:44:57] <wikibugs>	 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Dzahn) > sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT  please don't. it will just conflict / be reverted by ferm or ferm service will be stopped leading to more manual thi...
[22:46:29] <Krenair>	 So I removed the file and made apt-get finish installing the package (which it did), but nginx still won't reload
[22:46:38] <Krenair>	 (after returning the file)
[22:48:44] <hasharDinner>	 Krenair: you solved it previously apparently: https://phabricator.wikimedia.org/T174746
[22:49:06] <Krenair>	 hey see I thought I recognised this problem from somewhere
[22:49:15] <hasharDinner>	 and some magic https://gerrit.wikimedia.org/r/#/c/375772/5/modules/thumbor/templates/nginx.conf.erb
[22:49:24] <bblack>	 Start-Date: 2017-10-04  18:05:41
[22:49:24] <bblack>	 Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libvarnishapi1
[22:49:29] <Krenair>	 not even that long ago
[22:49:40] <hasharDinner>	 so probably the same fix would apply
[22:49:40] <bblack>	 ^ so I guess the upgrade was intentional, ~19 days ago.  has this been broken since then?
[22:51:59] <greg-g>	 13, I believe
[22:53:01] <bblack>	 I don't see messages in syslog about the vslp issue though, until today
[22:53:27] <Krenair>	 I guess something caused varnish to restart today?
[22:53:28] <bblack>	 maybe the package got upgraded (but without other accompanying changes) back on 10-04, but today was the first restart since the package install?
[22:53:41] <bblack>	 (although I could've sworn the package install also restarts the daemon)
[22:53:41] <hasharDinner>	 most probably yes
[22:54:02] <hasharDinner>	 and maybe puppet kindly upgraded it
[22:54:06] <bblack>	 or, the hieradata was fixed back when the package was upgraded, but something got changed/reverted with the hieradata for deployment-prep?
[22:54:18] <bblack>	 puppet doesn't upgrade varnish on its own
[22:54:29] <bblack>	 (or handle the fallout of a partial upgrade process)
[22:55:03] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10EBernhardson) I suppose i can add that the reason it has to be GET, rather than POST, is because the kibana application that receives these requests...
[22:58:30] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10BBlack) I doubt Varnish in default config does anything about GET request bodies, they're a fairly non-standard thing.  I think our current versions...
[22:58:58] <Krenair>	 why is it nginx remains completely unable to handle that lua_shared_dict line?
[22:59:24] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3705072 (10EBernhardson) Actually on closer review, kibana is allowing some POST requests, but not your _search endpoint:  https://github.com/elastic/kibana/blo...
[23:00:15] <Krenair>	 oh
[23:00:16] <Krenair>	 hieradata/role/common/cache/text.yaml:cache::lua_support: true
[23:00:19] <Krenair>	 that won't apply
[23:00:48] <hasharDinner>	 grhgh
[23:00:57] <hasharDinner>	 Krenair: yeah the hieradata/role are not applied on labs :(
[23:01:05] <bblack>	 heh
[23:01:06] <hasharDinner>	 gotta copy pasta to horizon 
[23:01:14] <bblack>	 more hieradata disconvergence!
[23:01:22] <Krenair>	 +load_module modules/ndk_http_module.so;
[23:01:22] <Krenair>	 +load_module modules/ngx_http_lua_module.so;
[23:01:49] <Krenair>	 alright
[23:02:08] <hasharDinner>	 https://phabricator.wikimedia.org/T136080  (closed)  and  https://phabricator.wikimedia.org/T120165  to make labs role aware
[23:02:22] <Krenair>	 nginx now reloads properly
[23:02:43] <Krenair>	 btw, hasharDinner 
[23:02:50] <Krenair>	 krenair@deployment-cache-text04:~$ sudo lsof -i :80
[23:02:53] <Krenair>	 COMMAND     PID    USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
[23:02:53] <Krenair>	 varnishd  25058 varnish    3u  IPv4 519741739      0t0  TCP *:http (LISTEN)
[23:02:53] <Krenair>	 varnishd  25058 varnish    5u  IPv6 519741740      0t0  TCP *:http (LISTEN)
[23:02:53] <Krenair>	 cache-mai 25066  vcache    3u  IPv4 519741739      0t0  TCP *:http (LISTEN)
[23:02:53] <Krenair>	 cache-mai 25066  vcache    5u  IPv6 519741740      0t0  TCP *:http (LISTEN)
[23:03:46] <Krenair>	 I can curl http://beta.wmflabs.org
[23:04:17] <hasharDinner>	 !!!!!!!!!!!!!!!!!!!!!!!!!!
[23:04:33] <Krenair>	 hasharDinner, how was it that you tried to set up port 80 for nginx?
[23:04:40] <hasharDinner>	 I have no idea why with nginx it did not work
[23:05:19] <hasharDinner>	 Krenair: I was trying to unblock LE challenge ( https://phabricator.wikimedia.org/T178841#3704999 ) :D
[23:05:51] <hasharDinner>	 but obviously past 11pm and not knowing anything about LE .. that was prone to failure
[23:06:10] <Krenair>	 yeah the LE challenge failing was a red herring
[23:06:24] <Krenair>	 just a symptom of the problem
[23:06:27] <hasharDinner>	 $ curl https://en.wikipedia.beta.wmflabs.org/
[23:06:27] <hasharDinner>	 curl: (51) SSL: no alternative certificate subject name matches target host name 'en.wikipedia.beta.wmflabs.org'
[23:06:31] <hasharDinner>	 well it is not happy still :)
[23:06:34] <wikibugs>	 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705082 (10Krenair) Between the three of us it's been brought back up.
[23:06:57] <greg-g>	 thanks Krenair bblack and<tab> hasharDinner 
[23:07:01] <hasharDinner>	 en.wikipedia.beta.wmflabs.org uses an invalid security certificate. The certificate is only valid for beta.wmflabs.org
[23:07:12] <hasharDinner>	 I guess it is missing a few certs?
[23:07:14] <Krenair>	 hmph, wtf
[23:07:30] <Krenair>	 I thought this was working just now
[23:08:18] <Krenair>	 Is it possible I broke the list of certs by creating hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml in puppet?
[23:08:39] <hasharDinner>	 what is the hiera key?
[23:10:11] <hasharDinner>	 I look them up from the puppet master using:  /var/lib/git/operations/puppet/utils/hiera_lookup -v  --fqdn=deployment-cache-text04.deployment-prep.eqiad.wmflabs  <somekey>
[23:10:36] <hasharDinner>	 example: ssh deployment-puppetmaster02.deployment-prep.eqiad.wmflabs  /var/lib/git/operations/puppet/utils/hiera_lookup -v  --fqdn=deployment-cache-text04.deployment-prep.eqiad.wmflabs classes
[23:12:24] <hasharDinner>	 maybe there is a glitch in puppet and it fails to regenerate them all
[23:12:34] <hasharDinner>	 or nginx needs a restart to catch them
[23:12:51] <hasharDinner>	 Krenair: sorry but I am failling asleep. Thank you for the fix up!
[23:12:56] <Krenair>	 ok
[23:13:03] <hasharDinner>	 (and thanks bblack for the assistance! )
[23:13:16] <Krenair>	 It does actually have broken a cert
[23:13:34] <Krenair>	 issued like today
[23:13:44] <bblack>	 ema: re: cp4024, there was a small spike of 503s around 16:15-16:45 or so, but I left it going to see what happened.  the 503s went away but the lag kept spiraling out.  So I restarted the backend eventually (because I've gotta take off for little league stuff and can't stare anymore)
[23:15:10] <bblack>	 ema: it's annoying, there must be some change that lead to this new behavior? there's more vhtcpd reconnects, but I think even varnish-fe already gave it plenty of tcp reconnects...
[23:24:18] <Krenair>	 greg-g, okay so now https seems to be back to normal
[23:24:35] <Krenair>	 trick was to set profile::cache::ssl::unified::le_subjects to the big list you see in https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-cache-text04
[23:24:46] <Krenair>	 should probably commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml instead of leaving it lying on the puppetmaster
[23:25:41] <wikibugs>	 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705126 (10Krenair) HTTPS should now work again too. Need to commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml on the puppetmaster: ```profile::cache::base::varnish_v...