[08:42:45] so the 503 spikes caused by eqiad varnish backends seem to happen more and more frequently... if https://gerrit.wikimedia.org/r/#/c/376751/ doesn't fix the issue, I'm afraid we'd have to go from weekly to daily restarts, at least in eqiad [08:43:48] in the specific case of text, we don't even get alerted by icinga before fetches start failing [08:44:53] mbox lag goes from 0 to 200k in a couple of minutes, with fetches already failing, and then goes back to 0 (and fetches succeed again) [08:45:07] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1067&var-datasource=eqiad%20prometheus%2Fops&from=1505621186913&to=1505629471109 [08:45:51] didn't know about https://gerrit.wikimedia.org/r/#/c/376751, really nice [09:54:42] bblack: miss2pass patch amended adding backend_warming to vcl_config based on the cache::backend_warming hiera flag [09:54:51] pcc output seems sane https://puppet-compiler.wmflabs.org/compiler02/7905/ [14:29:06] ema: one thing I noticed about the new cache_text issues, is I think the triggering change was https://gerrit.wikimedia.org/r/#/c/364605/ (we didn't have these before it, but they've started since, and it's reasonable that the change in text's keep times could affect how the lag thing plays out...) [14:30:59] with upload, the shift from keep=1d to keep=7d was expect to help (but perhaps given all the complexities in this, it doesn't, or it hurts? we might be able to stare at graphs or alert logs and get a sense of whether it got better or worse) [14:31:39] with text, the keep changed from 7d(capped by current TTL) to 1d(fixed) [14:32:04] yeah I was thinking of the keep changes too [14:32:07] which depending how things play out, could've gone either way on increasing or decreasing avg object expiry rate [14:32:10] the patch has been merged on the 6th [14:32:37] there's been a first spike on the 7th (cp1066) [14:33:35] and then the huge ones affecting cp1053 and friends some days later [14:33:37] in any case, after staring a little harder today, we can revert that patch and maybe replace it with something better (perhaps keep it effectively reverted on text but just bump upload to 7d? or move both to fixed 7d and see how the 304 issue plays out) [14:34:47] so the 304 issues on text are due to MW saying not modified when in fact the resource has been modified, right? [14:35:57] yeah, IIRC it's basically that a plain (unconditional) get might fetch a newly-edited copy of an article, but conditionals could continue saying 304-not-modified for a while longer and suppress refreshes from seeing the new copy [14:36:20] because MW's conditional-request semantics aren't quite right [14:37:06] there was some deeper back and forth on it not too long ago in a related ticket, with max.sem supplying the mw-side info [14:37:58] err, no, it was Krink.le :) [14:38:02] https://phabricator.wikimedia.org/T124954#3421257 [14:41:17] ok so actual content changes are reflected by MW's IMS handling, while config changes, skins etc. are not [14:42:00] I think! [14:42:24] I also *think* this issue doesn't affect the other, short-lived objects like actual resourceloader components (e.g. skin js/css fetches) [14:42:59] but the output of a content page contains versioned links to those, which is what gets false-refreshed in a 304 on a skin-only change [15:16:58] so CPU usage seems to have gone up (at least on cp1053) starting on the 6th [15:17:03] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-30d&to=now [15:19:08] various graphs have changed, Varnish Backend Storage N large free smf in particular [15:19:30] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-30d&to=now&panelId=36&fullscreen [15:19:35] ^ that graph is telling, too [15:20:03] there's some perturbations (possibly most of them from out-of-cycle restarts), but otherwise the general pattern is a huge instant upwards spike of free storage space about once a day [15:20:27] which probably means the fixed 1d keeps are hitting all together because the TTLs coming from MW work the way they do... [15:21:22] oh wait, I'm reading that all wrong, as it's not really 1/day, and it goes on before the change, too [15:21:27] but it still looks creepy [15:22:28] probably the more important metric to investigate is to figure out if the upload change from 1d to 7d moved us in a good or bad direction in the rate of mailbox lags and/or related alerts/restarts [15:23:13] anecdotally, seems worse [15:23:18] yeah [15:23:20] more of them popping up in sync (like now) [15:23:44] gonna restart some of those [15:23:52] ok [15:25:29] if the pattern fixup you amended helps (it seems like a more-solid theory than other things lately), and the revert of the keep change undoes recent damage, we might end up in a decent place [15:28:04] the revert still rebases cleanly, staged up at https://gerrit.wikimedia.org/r/#/c/378731/ [15:32:12] +1 :) [15:33:31] 10Traffic, 10Operations, 10monitoring, 10Patch-For-Review: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636#3615011 (10fgiunchedi) @bblack your patch to add `meminfo_numa` seems to be working! Anything left to do ? [15:38:05] 10Traffic, 10Operations, 10monitoring, 10Patch-For-Review: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636#3615020 (10BBlack) yeah, put it somewhere useful in grafana :) [15:51:48] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3615093 (10Dzahn) a:03Dzahn [16:07:29] 10Traffic, 10Operations: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3615165 (10BBlack) [16:07:33] 10Traffic, 10Operations, 10Patch-For-Review: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3615163 (10BBlack) [16:54:30] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3615327 (10RobH) a:05Cmjohnson>03RobH Ok, since the firmware updates, the host lvs1007 won't pxe boot. I'll investigate today/tomorrow and try to make it so this host will pxe b... [17:19:15] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3615489 (10RobH) a:05RobH>03Cmjohnson This is an HP gen8, so I cannot actually load the bios remotely and check the PXE settings for the cards. This issue sounds like the NIC c... [18:34:25] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3615765 (10BBlack) I did the NIC card bios check last week when I first found the PXE booting problem. It is enabled there. My guess is either something else in BIOS settings got c... [20:17:50] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3616118 (10Jgreen) [21:19:56] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3616118 (10faidon) All of them? Wasn't the plan to only do it for the few hosts that are important SPOFs? Again, I fear that this gives a fa... [22:25:35] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3616465 (10Jgreen) >>! In T176175#3616318, @faidon wrote: > All of them? Wasn't the plan to only do it for the few hosts that are important... [22:33:04] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3616476 (10faidon) >>! In T176175#3616465, @Jgreen wrote: >> Again, I fear that this gives a false sense of redundancy > > This does not co...