[08:31:37] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2340456 (10elukey) I discussed with @ema the inconsistency that we are seeing and we came to the conclusion that this change could be... [10:06:15] 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Firefox: Secure connection failed when attempting to send POST request (if connection has been idle for a while; disabling HTTP/2 helps) - https://phabricator.wikimedia.org/T134869#2340659 (10Aklapper) [10:11:25] 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Firefox: Secure connection failed when attempting to send POST request using HTTP/2 (if connection has been idle for a certain time) - https://phabricator.wikimedia.org/T134869#2340670 (10Danny_B) [15:06:43] ema, bblack: FYI, there's a new libgd security update. nginx links against libgd, but only as part of the image_filter module, which we don't use. so I'll install these for completeness, but we don't need a restart [15:10:29] ok [15:21:46] moritzm: does nginx link against libgd? [15:22:04] ah, yeah, next time I'll read the whole sentence :) [15:22:11] thanks! [15:24:15] ema: re: reboots, I'll try one of the esams today depooled and see if I can get more info and/or confirm whether the new kernel fixes it for future reboots [15:24:39] (and maybe look at the drac firmware issue, if there is one) [15:25:04] unless you already are or are planning to soon [15:25:21] bblack: I'll say two words only: latex beamer [15:25:27] so yeah, please go ahead :) [15:25:31] ok :) [15:26:46] I've left T131961 open waiting for the 4.4 reboots BTW [15:26:46] T131961: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961 [15:26:57] if machines come up fine wrt raid we can close it [15:29:40] 10Traffic, 06Labs, 10Labs-Infrastructure, 06Operations: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#2341254 (10chasemp) p:05Triage>03Normal [15:30:12] ema: ok [15:30:41] I think my next meeting and the frontend restarts will finish up around the same time and then I'll start in on the cp30xx mess [16:10:31] latest 4.4 kernel is linux-image-4.4.0-1-amd64 4.4.2-3+wmf2 [16:10:52] I think we have that installed on them all [16:10:57] just not rebooted to it yet [16:11:01] ok, nice [16:11:32] we're going to get the bulk of the 4.4 reboots done this week I think, but looking at the esams reboot issue first [16:12:55] two of the cp1 and a few of the cp3 hosts have the +wmf1, I'll update these [16:13:33] although, these are likely the to-be-decom systems [16:14:04] indeed they are, so all are up-to-date [16:18:36] 10Traffic, 06DC-Ops, 06Operations, 10ops-esams: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2341497 (10BBlack) Supporting the theory that these need firmware updates.... cp2001 racadm getversion: ``` Bios Version = 1.2.10 iDRAC Version... [16:27:55] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2341533 (10Ottomata) > This is a problem in the way we check data integrity rather than in vk itself, so we should fix our calculation... [16:29:29] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2341538 (10elukey) >>! In T136314#2340456, @elukey wrote: > 1) vk is correctly adding the start timestamp to our logs but this trigger... [16:33:38] 10Traffic, 06DC-Ops, 06Operations, 10ops-esams: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2341551 (10BBlack) Latest on Dell's site seems to be http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=5GCHC - going to reconfirm we still have issues, then t... [16:40:51] 10Traffic, 06DC-Ops, 06Operations, 10ops-esams: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2341578 (10BBlack) So... cp3032 rebooted fine via software, after I had done a preemptive `racadm racreset`. Will move on to a few others that were known-problems in the past and se... [16:45:32] <_joe_> bblack: I would love some feedback on https://gerrit.wikimedia.org/r/291949 [16:45:45] <_joe_> I think it *might* be of interest for you too [16:45:52] <_joe_> but consider this a WiP [17:07:11] _joe_: yeah looks interesting :) [17:14:43] provider => systemd is pretty evil [17:15:15] there's a "debian" provider that is more specific to debian systems [17:15:35] also I kinda hate having this (and service_unit) in base really [17:15:53] can't we just have service::unit or something [17:16:28] although unit has always been a misnomer, the name is systemd-specific ;) [17:16:56] linux is systemd-specific [17:17:01] increasingly, anyhow [17:18:23] paravoid: also, here's another one to make you twitch: we have ./hieradata and ./conftool-data [17:18:38] haha [17:23:37] * bblack advocates for camelCase [17:29:30] bblack: hi! Has varnishd been restarted during the past hours by any chance? [17:30:42] yup [17:31:02] which os the ~202 varnishd are you interested in and why? [17:31:05] s/os/of/ [17:31:26] https://grafana.wikimedia.org/dashboard/db/varnishkafka?from=now-24h&to=now [17:31:34] I was watching the sequence numbers :) [17:31:50] just wanted to double check that vk didn't go on fire [17:32:20] ah jemalloc changes [17:33:05] I put in the dashboard all the metrics that we were discussing the other day (txerr, etc..) [17:38:06] <_joe_> paravoid: feel free to rename both :P [17:40:42] elukey: do you have any idea if sequence numbers are even supposed to not-skip with v4+varnishkafka? [17:44:42] bblack: sorry didn't get the "not-skip" part. What do you mean? [17:47:04] if you mean not having holes it should be true [18:11:19] (going afk but I'll read later!) [20:34:30] 10Traffic, 06DC-Ops, 06Operations, 10ops-esams: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2342590 (10BBlack) 05Open>03Resolved a:03BBlack All of cache_text in esams (8/12 of the nodes considered affected) have rebooted into 4.4.2-3+wmf1 today without issue. It coul... [21:27:04] 10Traffic, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#2342765 (10BBlack) [21:30:04] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 06Operations: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2342785 (10BBlack) Unclear from the description: Is it intended that moon always redirects to this... [21:30:24] 10Traffic, 10Varnish, 06Operations: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2342795 (10BBlack) [21:30:26] 10Traffic, 06Operations, 13Patch-For-Review: Sort out vcl_deliver vs vcl_synth mess with v4 VCL - https://phabricator.wikimedia.org/T135696#2342793 (10BBlack) 05Open>03Resolved a:03BBlack [21:32:26] 10Traffic, 06Commons, 06Operations, 10media-storage, and 2 others: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files - https://phabricator.wikimedia.org/T133819#2342800 (10BBlack) [21:32:29] 10Traffic, 06Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2342802 (10BBlack) [21:33:27] 10Traffic, 06Commons, 06Operations, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2342807 (10BBlack) [21:33:31] 10Traffic, 06Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245593 (10BBlack) [21:37:01] 10Traffic, 10Varnish, 06Operations: Varnish: the lower the Age value, the slower the request - https://phabricator.wikimedia.org/T84980#2342830 (10BBlack) 05Open>03Resolved No movement in over a year, and is more an observation than a question. [21:40:53] 07HTTPS, 10Traffic, 10Wikimedia-Apache-configuration, 06Operations: HTTP->HTTPS redirects need to unconditional send Vary header - https://phabricator.wikimedia.org/T98990#2342859 (10BBlack) 05Open>03declined Varnish is now doing all the redirects directly rather than the applayer. [21:46:46] 07HTTPS, 10Traffic, 06Operations: Getting ssl_error_inappropriate_fallback_alert very rarely - https://phabricator.wikimedia.org/T108579#2342885 (10BBlack) 05Open>03Resolved a:03BBlack Assuming not, re-open if so. [21:50:05] 07HTTPS, 10Traffic, 06Operations: When user is logging out via HTTPS, insecure HTTP cookies keeping logged in state should be cleared as well - https://phabricator.wikimedia.org/T34144#2342891 (10BBlack) 05Open>03Resolved a:03BBlack Assuming this is no longer an issue, since login via HTTP is impossible. [21:51:24] 10Traffic, 06Operations, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2342896 (10BBlack) Is this still reproducible? Did we decide whether varnish or hhvm was at fault? [21:51:57] 10Traffic, 06Operations: 3 Varnish cache_upload servers crashed in a short time window - https://phabricator.wikimedia.org/T125401#2342897 (10BBlack) 05Open>03Resolved a:03BBlack Haven't seen much of this since, and 4.4.x upgrades are in-progress this week. [21:53:25] 10Traffic, 06Operations: Varnish leaks memory - https://phabricator.wikimedia.org/T122455#2342900 (10BBlack) 05Open>03Resolved a:03BBlack We've kept TBF reverted ever since. At this point the VCL wouldn't un-revert easily anyways, so we'll look again at TBF or similar post-Varnish4, and we don't have an... [21:54:23] 10Traffic, 06Operations, 07Puppet: Clean up nginx / nginx::ssl classes and usage - https://phabricator.wikimedia.org/T118078#2342904 (10BBlack) 05Open>03Resolved a:03BBlack eh, this is a "refactor things better" ticket. We're always doing that and we're never done. [21:55:11] 10Traffic, 10Varnish, 06Operations: Reintroduce rejection for requests with null user agents - https://phabricator.wikimedia.org/T111140#2342912 (10BBlack) 05Open>03declined [21:56:59] 10Traffic, 10MediaWiki-extensions-CentralNotice, 06Operations, 10Wikimedia-Fundraising: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#2342913 (10BBlack) This ticket is getting stale, is it still relevant and up-to-date with cu... [21:58:03] 10Traffic, 06Operations: Varnish Assert error in VGZ_Ibuf() - https://phabricator.wikimedia.org/T122462#2342915 (10BBlack) 05Open>03Resolved a:03BBlack It hasn't been a huge issue over the past several months, and everything about this will change with Varnish4 which is in the process of being deployed. [22:03:03] still 173 open backlog tickets, after running through the ones I could trivially resolve/reject on the spot. [22:50:27] ema: re 4.4 upgrades: I did cp30[34][0123] and no reboot issues, assuming for now the rest are ok too. I did cp1008 too. The rest all need reboots still if you want to start (even the ones test-ugpraded before are on wmf1 not wmf2). [22:51:32] ema: keep in mind the traffic-pool service still works for reboots. it will auto-depool on shutdown, and if you touch /var/lib/traffic-pool/pool-once before rebooting it will self-repool afterwards. We can probably be fairly aggressive so long as no more than ~2 per cluster per site are down at a time. [22:52:09] (I was going to start tonight, but I won't be around enough to keep an eye on it) [22:53:40] (also, after the 4.4 upgrade, we should experiment with getting rid of the vm compaction cron. it's probably not necessary anymore, but need to test that on some upload caches for an extended period to be sure) [23:07:03] 10Traffic, 06Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2343168 (10MZMcBride) Related: * {T56902} * {T130901} * {T135964}