[07:15:03] 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966190 (10BBlack) [08:08:47] morning hashar! [08:10:05] hashar: I'm getting a 403 from Jenkins API when I try to run puppet utils/pcc, what are the permissions I need to ask for to get that fixed? [08:41:23] vgutierrez: hello, depends on the URL ? somes are blacklisted intentionally (such as the build history) [08:41:48] most would require to be under NDA, and for Jenkins administration it is locked down to just a few people [08:42:02] what is the 403ing URL ? [08:45:33] vgutierrez: did you use the --api-token parameter? You can get one in your user panel in the Jenkins UI [08:57:44] elukey: I'm using JENKINS_USERNAME & JENKINS_API_TOKEN env variables [08:59:36] hashar: that's it then, I'm not in the NDA group (yet) [09:02:01] super :) [09:02:27] are talking about the NDA group in LDAP or Phab? [09:04:56] LDAP Valentín should be in cn=wmf, not cn=nda (since he's staff), I'm adding him to LDAP now [09:05:17] thx moritzm :) [09:08:06] I've added you to cn=wmf and cn=ops, can you re-try PCC? [09:08:50] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10MoritzMuehlenhoff) Added to cn=ops and cn=wmf LDAP groups. [09:09:02] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3966318 (10MoritzMuehlenhoff) [09:09:40] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10MoritzMuehlenhoff) [09:13:03] moritzm: awesome, it's working now <3 [09:13:26] good :-) [09:16:18] <_joe_> vgutierrez: pcc is just a quick cli client, if you want to do fancier things (like running at higher concurrency, or running the compiler on most hosts) you'll need to go to jenkins itself [09:16:30] <_joe_> but pcc is usually enough, though [09:38:22] looks like pybal on lvs2003 isn't depooling ms-fe2005 even though it is depooled according to https://config-master.wikimedia.org/pybal/codfw/swift [09:38:43] I suspect this might be the culprit? [09:38:45] Feb 13 06:33:27 lvs2003 pybal[1924]: Unhandled error in Deferred: [09:38:45] Feb 13 06:33:27 lvs2003 pybal[1924]: Traceback (most recent call last): [09:38:49] Feb 13 06:33:27 lvs2003 pybal[1924]: Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. [09:38:52] Feb 13 06:33:37 lvs2003 pybal[1924]: Unhandled error in Deferred: [09:38:54] Feb 13 06:33:37 lvs2003 pybal[1924]: Traceback (most recent call last): [09:38:57] Feb 13 06:33:37 lvs2003 pybal[1924]: Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. [09:41:49] godog: interesting [09:42:17] I'd expect the pybal_ipvs_diff alert to trigger if that were the case [09:44:04] heh, how long should it take? I depooled ~15 min ago [09:44:36] I've just tried running the icinga command by hand and it does not find any differences [09:44:52] 09:43:08 ema@lvs2003.codfw.wmnet:~ [09:44:52] $ /usr/local/lib/nagios/plugins/check_pybal_ipvs_diff --prometheus-url=http://lvs2003:9100/metrics [09:44:55] OK: no difference between hosts in IPVS/PyBal [09:45:31] but indeed I do see ms-fe2005 as pooled in ipvs [09:46:48] and pybal thinks it's pooled too [09:46:50] 09:46:32 ema@lvs2003.codfw.wmnet:~ [09:46:50] $ curl -s http://localhost:9090/pools/swift_80 | grep 2005 [09:46:53] ms-fe2005.codfw.wmnet: enabled/up/pooled [09:47:02] (hence no pybal_ipvs_diff) [09:48:25] godog: could you please try pooling/depooling the host once again? [09:49:25] ema: yeah, just pooled and depooled again [09:50:31] moritzm: thank you for vgutierrez ldap fix :] [09:50:31] godog: no luck, it's still pooled according to pybal [09:53:47] ok I just tried with confctl on neodymium as opposed to 'depool' on the host, just because [09:55:13] same situation on lvs2006 btw [09:55:53] on both lvs200[36] connections to etcd are established as expected [09:58:04] !log restart pybal on lvs2006 [09:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:25] heh, after a pybal restart on lvs2006 the state of ms-fe2005 is reflected properly [10:01:23] interestingly, on lvs2006 there are now 30 established connections to etcd, while before the restart (and on lvs2003) there were only 6 connections [10:02:41] so this is a new failure scenario AFAIU: no difference between ipvs/pybal, (some) connections to etcd established, and still changes to data in etcd are not picked up by pybal [10:03:24] did it pick the change after the restart? [10:03:29] yep [10:03:39] godog: I'm gonna bounce pybal on lvs2003 too so you can carry on with your day [10:03:49] !log restart pybal on lvs2003 [10:03:58] ema: thanks, btw did you see the same traceback on lvs2006 logs too? [10:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:29] looks like some thread/coroutine from twisted never recovered, from the symptoms so far [10:06:31] godog: yes, same error on lvs2006, same time [10:06:48] (roughly same time that is) [10:07:06] Feb 13 06:33:22 lvs2006 pybal[39461]: Unhandled error in Deferred: [10:07:06] Feb 13 06:33:22 lvs2006 pybal[39461]: Traceback (most recent call last): [10:07:09] Feb 13 06:33:22 lvs2006 pybal[39461]: Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. [10:08:12] the time makes me think of cron daily + post logrotate [10:09:49] thing is: we need to ensure there's one established etcd connection per pool [10:10:19] ema: do you know to which conf host the two lvses were connected? [10:10:37] there are 30 pools defined on lvs200[36], the icinga check (WIP) should alert if there's less than 30 established connections [10:11:13] volans: yes, conf2001 [10:11:33] Feb 13 06:33:26 conf2001 etcd[646]: start to snapshot (applied: 66206635, lastsnap: 66196634) [10:11:36] Feb 13 06:33:26 conf2001 etcd[646]: saved snapshot at index 66206635 [10:11:39] Feb 13 06:33:26 conf2001 etcd[646]: compacted raft log at 66201635 [10:11:41] Feb 13 06:33:38 conf2001 etcd[646]: purged file /var/lib/etcd/codfw.wmnet/member/snap/0000000000005cb1-0000000003f17856.snap successfully [10:12:14] same log on 2003 but not 2002 (it might be the master) [10:13:15] ok yeah the timing seems suspicious - though lvs2006 already went belly up a few seconds before the etcd snapshot got logged, at 06:33:22 [10:14:20] no keys replicated by etcdmirror between 05:57 and 06:36 [10:14:26] on conf2002 [10:18:13] and I agree, pybal should reconnect no matter the error [10:30:33] <_joe_> ema: I don't think that monitoring is the way to go to find out such errors [10:30:46] <_joe_> but it could be useful while we work on a proper patch [10:31:25] <_joe_> https://gerrit.wikimedia.org/r/c/365549/ I should really try to work on that [10:32:34] <_joe_> ema: did you check if you do get false positives with that alert? [10:32:37] yeah well if we don't monitor it we're flying blind and only find out when depooling a production host (just like today) [10:33:03] <_joe_> yeah I agree, I just fear it could cause a lot of false positives [10:34:50] grr, I was naively thinking that size($lvs_services) would have been enough to know the number of configured lvs services [10:42:03] yup.. I thought the same as _joe_, looks like it could lead to a lot of false positives [10:42:04] <_joe_> :) [10:42:14] (newbie POV) [10:46:41] _joe_: am I correct in saying that the number of established TCP connections to etcd must be at all times >= than the number of configured pools? [10:47:52] <_joe_> ema: uhm, intuitively I'd say yes [10:55:42] _joe_: ok, so if that is true, why should we get any false positives? [10:56:53] <_joe_> ema: because I'm not sure that's reliable, given you're doing things in an eventloop, there might be times when one or more etcdconfig are not connected but still in a healthy state [10:57:02] <_joe_> the deferred not having fired up still [10:57:41] <_joe_> that's a very tiny race condition, but still, I guess our best shot is to test the script for now [10:57:54] <_joe_> and invest some time in making that damn code more reliable [10:58:01] <_joe_> I still am not sure why it fails at all [10:58:25] oh, ok, then my assumption is not true :) [10:59:04] <_joe_> ema: look at the code, I might be wrong, I'm saying this out of memory :) [11:02:00] I've just tried counting the number of connections vs. the number of services on all LVSs and indeed we do have cases where they differ [11:02:23] eg: lvs3002 has 0 established connections vs. 12 configured services :P [11:02:43] and lvs1003 has 12 established connections vs. 40 services /o\ [11:02:59] they don't seem like false positives to me though [11:09:21] <_joe_> they're not? [11:09:31] <_joe_> you can test it I guess [11:10:06] <_joe_> esp on lvs3002 [11:13:23] !log depool cp3007 to test pybal's behavior on lvs3002 [11:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:55] oh man [11:14:22] !log repool cp3007 [11:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:04] pybal is misbehaving on lvs3002 too [11:15:35] <_joe_> heh [11:15:58] <_joe_> ok I guess you might want to take a look at that patch I half-baked [11:16:14] <_joe_> or better, I can make it better, I guess this can be considered UBN! [11:16:33] yep [11:16:35] <_joe_> vgutierrez: sorry, acronym alert. UBN = UnBreak Now! [11:16:41] currently misbehaving hosts: [11:16:42] lvs[3001,3003].esams.wmnet,lvs4005.ulsfo.wmnet [11:16:51] lvs[3002,3004].esams.wmnet [11:17:00] <_joe_> ema: so all of esams basically [11:17:02] lvs[1003,1006].wikimedia.org [11:17:11] <_joe_> I guess any network glitch can cause such a failure [11:17:26] <_joe_> ok it's time to fix that damn driver [11:17:37] <_joe_> it's been a thorn in our side since its inception [11:17:48] <_joe_> ema: can you trace the errors in the logs, btw? [11:18:02] _joe_: yes [11:18:04] Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean [11:18:15] <_joe_> and then no reconnection, wtf [11:18:17] (see backlog) [11:18:27] which I don't understand given that we should be reconnecting? [11:18:43] <_joe_> anyways, I'll finish what I am doing and I'll take a stab at that patch afterwards [11:19:30] in the sense that we're implementing clientConnectionLost in EtcdConfigurationObserver, so it's not clear to me why the error is unhandled [11:19:47] <_joe_> ema: same here [11:20:00] <_joe_> 10:58 < _joe_> I still am not sure why it fails at all [11:20:07] right :) [11:20:23] I'll begin the restarts meanwhile [11:21:09] !log esams secondary LVSs: restart pybal to properly reconnect to etcd [11:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:08] !log esams primary LVSs: restart pybal to reconnect to etcd [11:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:52] 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966644 (10fgiunchedi) p:05Triage>03Normal [11:26:15] !log lvs4005: restart pybal to reconnect to etcd [11:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:49] !log lvs1006/1010: restart pybal to reconnect to etcd [11:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:49] !log lvs1003: restart pybal to reconnect to etcd [11:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:45] restarts done [12:37:58] I too wonder about that time of morning [12:38:51] 10:08 < volans> the time makes me think of cron daily + post logrotate [12:39:12] it's also similar timing to when we have our daily minor esams traffic issues that eventually self-resolve... [12:41:07] bblack: although the logrotate for etcd is just doing: service rsyslog rotate [12:41:49] I thought rsyslog had its own global logrotate rule? [12:45:40] yes it does too, invoke-rc.d rsyslog rotate [12:49:16] maybe we're rotating during our rotation heh [13:03:54] rotation inception :D anyway it sholdn't affect etcd per se I guess [13:04:38] _joe_: FYI we have some leftover rotate files and logdirs in conf hosts for the mirrors that I think might be manually cleanup [13:04:45] like /etc/logrotate.d/etcdmirror--eqiad-wmnet [13:04:58] <_joe_> volans: hah, yeah [13:05:10] <_joe_> it's still going to be back in the future, so.. [13:05:20] <_joe_> well, not getting there right now [13:11:22] 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966190 (10BBlack) [13:13:38] 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966974 (10BBlack) Switch config says xe-1/0/4, but that port appears to be missing (no SFP?) [13:14:56] 10Traffic, 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3967012 (10BBlack) [13:15:05] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#3967013 (10BBlack) [13:15:30] 10Traffic, 10Operations, 10ops-eqsin: dns5002 mgmt console unreachable - https://phabricator.wikimedia.org/T186902#3967014 (10BBlack) [13:15:34] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3967015 (10BBlack) [13:15:47] 10Traffic, 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3967016 (10BBlack) [13:16:08] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3967017 (10BBlack) [13:16:18] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3967018 (10BBlack) [13:16:34] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3967022 (10BBlack) [13:16:46] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3967023 (10BBlack) [13:17:00] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqsin: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3967024 (10BBlack) [13:23:15] 10Traffic, 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3794643 (10BBlack) >>! In T181569#3901023, @ayounsi wrote: > The other atlas don't seem to be connected to a scs so I can't compare. Do we even have a use for scs on atlas, if it's not use... [13:27:09] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3967086 (10BBlack) [13:27:11] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3967087 (10BBlack) [14:40:42] 10Traffic, 10Operations, 10Page-Previews, 10RESTBase, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#3967535 (10BBlack) Well, the above solution would still leave your with your short-age problem, if you didn't also zero out the Age. Do we want to... [15:06:21] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535#3967622 (10BBlack) a:05BBlack>03RobH These have been spared-out for a while now and we're fine on the new ones, please kill. [15:09:25] 10Traffic, 10Operations, 10Patch-For-Review: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#3967627 (10BBlack) [15:09:26] 10Traffic, 10Operations, 10Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#3967625 (10BBlack) 05Open>03declined We did some interesting refactoring here that helped with some other related issues, but the main goal here... [15:10:57] 10Traffic, 10Operations: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#3967629 (10BBlack) 05Open>03Resolved AFAIK with the resolution of T162035 we haven't had further reports. [16:16:12] 10Traffic, 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3967940 (10ayounsi) I also double-check the connection tracking spreadsheet and that should be the proper port. Switch says > error: device xe-1/0/4 not found And the tran... [16:42:14] 10Traffic, 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3968028 (10ayounsi) It's a nice to have, just in case, but clearly not a blocker. [17:16:06] Feb 13 17:10:53 cp4031 systemd[1]: Started varnish (Varnish HTTP Accelerator). [17:16:16] Feb 13 17:08:54 cp4031 systemd[1]: Stopped varnish (Varnish HTTP Accelerator) [17:16:36] so icinga picked it up presumably because there was a 2-minute gap there (while depooled), which I think sounds a little longer than we expect [17:17:24] all that happens in the script between the systemd-level stop->start is: "rm -f /srv/sd*/varnish*; sync; sleep 10" [17:19:12] yeah [17:19:23] for comparison, cp4023's restart: [17:19:31] Jan 14 14:16:36 cp4023 systemd[1]: Stopping varnish (Varnish HTTP Accelerator)... [17:19:37] Jan 14 14:17:32 cp4023 systemd[1]: Started varnish (Varnish HTTP Accelerator). [17:19:58] that one took a bit less than a minute [17:20:02] I guess the wipe/sync got slow, I mean there's no absolute expectation there [17:20:56] in any case, so long as it was depooled it's not a huge issue, so long as we don't end up spamming icinga alerts to IRC all the time (if this is e.g. common post-v5-upgrade, we might want to tweak icinga alerting) [17:21:27] ideally, we'd disable icinga while the host is depooled and re-enable it upon repool :) [17:21:45] (or someone could take on the amazing task of linking up icinga + etcd, so when we define a monitor we can define a matching etcd depool state that temporarily suppresses it) [17:21:50] heh [17:22:10] technically hosts don't get depooled, but services do [17:22:16] (would've worked in this particular case) [17:22:24] yeah [17:24:21] some grepping under /var/log/icinga/ seems to suggest that this was a isolated case [17:32:12] so I was starting to look at how our backend restart cron_splay stuff works (re: mysterious 160s thing) [17:32:39] to recap intentions from the code comments: [17:32:41] The idea here is to ensure each host in the set executes the cron once per time [17:32:45] period, and also ensure the time between hosts is consistent (no edge cases [17:32:48] much closer than the average) by splaying them as evenly as possible with [17:32:51] rounding errors. For the case of hosts with NNNN numbers indicating the [17:32:54] datacenter in the first digit, we also maximize the period between any two [17:32:57] hosts in a given datacenter by interleaving sorted per-DC lists of hosts before [17:33:00] splaying. [17:33:44] so the idea there is if you say "weekly", there are 10080 minutes in a week. and if your global cluster list is cp1001, cp1002, cp2001, cp2002, cp3001, cp3002 [17:34:25] it should interleave them something like cp1001, cp2001, cp3001, cp1002, cp2002, cp3002 [17:35:01] there's also some persistent randomization of the order within a DC, so it might well be e.g. cp1002, cp2001, cp3001, cp1001, cp2002, cp3002 [17:35:28] and then spread those out to have, in this case, 10080/6 = 1680 minutes between each restart [17:35:51] which seems like a fairly legit strategy for avoiding timing worries of various kinds, all things considered. [17:36:34] but then I go look at just text@eqiad (which has 8 of the 35 global text cluster nodes) [17:37:27] 6/8 of text@eqiad are all at exactly 02:44 of some day, using days 0, 1, 2, 3, 4, 5. [17:37:41] which seems like a very odd coincidence [17:38:03] maybe cron_splay is buggy, or maybe we just have some bad coincidences between cluster node counts and minutes-per-day/week. [17:40:06] <_joe_> occam's razor? [17:40:19] <_joe_> I mean it could be after all [17:40:33] yeah I think it's just unrealized coincidences [17:40:36] <_joe_> in some ways it's nice to have machines at the same time on different days [17:41:00] 10080 minutes per week evenly is evenly divisible by a lot of related time-divisions and cluster counts [17:41:17] the pattern is apparently common, though [17:41:39] e.g. text@esams has 6/8 of its nodes also aligned on the same time on different days [17:42:16] when looking at a daily fault pattern it makes some things hard, though. [17:42:27] (the fault doesn't directly align with any of the crons FWIW) [17:42:47] (but could be related to them in time, e.g. X hours after Y and Z restart shortly apart) [17:43:44] oh, right, so we actually have 36 nodes because cp1008 gets counted in this list for this purpose [17:43:47] that makes more sense [17:44:10] 10080/36 = 300 [17:44:42] which means every 25 minutes if it's cycling through 5x interleaved datacenters [17:45:07] I donno, need to stare more [17:45:32] (not 25 minutes, 25 hours) [17:51:03] it is probably still too early to draw conclusions about the probabilistic ttl variation patch [17:51:24] ? [17:51:34] the "swizzle" thing :) [17:51:36] I'd have thought it'd have had impact by now [17:51:46] I'm kinda meh on its impact so far on the evidence [17:52:07] but I guess in theory, it could take a full restart cycle (week) to have the fullest effect on various stampedes on storage [17:52:54] right, so end of this week give or take? [17:53:56] we've merged it on the 8th (Thursday) [17:56:04] right [17:56:26] but I really would've thought we'd see most of the impact (if any) on this particular issue within a day or two [17:56:36] but that pattern keeps recurring, unaffected [18:00:25] it would be really interesting to have stats about "successful" purges and be able to separate those from evictions/ttl+keep expirations [18:03:28] yeah, it's tricky IIRC because we may send the same status code either way [18:03:33] I think just the text after the status code changes [18:09:14] maybe we used to anyways [18:09:15] 10Traffic, 10Maps-Sprint, 10Operations: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3968439 (10Gehel) [18:09:23] seems like with v4 it's unconditional "204 Purged"? [18:09:38] I wonder if vcl_purge has access to any data that can tell it hit/miss [18:10:32] - RespStatus 204 [18:10:32] - RespReason No Content [18:10:32] - RespReason Purged [18:10:45] so yeah, '204 Purged' is the final one [18:11:39] the No Content is just the defaulting from the 204 I think [18:11:57] I don't think there's any call-path changes on hit or miss for purge [18:12:04] it's just vcl_recv->vcl_hash->vcl_purge [18:12:32] so the only place we might be able to make a difference is if vcl_pruge can observe something dumb like obj.hits or similar [18:13:06] https://varnish-cache.org/lists/pipermail/varnish-misc/2015-March/thread.html [18:17:10] interesting how this poor soul was using inline C before v4 for this? https://varnish-cache.org/lists/pipermail/varnish-misc/2015-March/024324.html [18:17:37] why? :) [18:18:26] "What do you use it for? Why do you care if the purge call purged something?" [18:18:30] yeah [18:18:34] lol - to debug your performance issues :P [18:19:14] we had similar without inline C before, and ignored the Vary-issue [18:20:15] who knows what was the reason to stop going through vcl_hit/miss from v3 to v4 [18:21:47] if I had to guess, I'd say they'd do that to enable the possibility of defering the actual purge work for perf [18:21:56] which they may or may not have taken advantage of in the end [18:22:38] (by using a ban-like optimization, they could return "purge done" accurately very quickly by storing a timestamp and purge info, and then some other code could asynchronously actually execute the purge, in the style of a ban) [18:23:18] hey, there's MAIN.n_obj_purged though [18:24:04] MAIN.n_purges 379380341 . Number of purge operations executed [18:24:07] MAIN.n_obj_purged 2586573 . Number of purged objects [18:24:55] oh interesting, do we track that in varnish-machine-stats? [18:25:02] now if the former is the number of PURGE requests and the latter is the number of objects acutally purged, we'd have some interesting data here :) [18:25:22] also, against my better judgement, I peeked at HSH_Purge() [18:25:54] it seems that when it's iterating the vary-slot objects within the object head to purge them all, if any of them is "busy" it just skips them and doesn't bother purging? :P [18:26:01] we don't plot it yet, let me do that [18:26:12] what? [18:26:22] VTAILQ_FOREACH(oc, &oh->objcs, hsh_list) { [18:26:25] [...] [18:26:31] if (oc->flags & OC_F_BUSY) { [18:26:31] /* [18:26:31] * We cannot purge busy objects here, because [18:26:34] * their owners have special rights to them, [18:26:37] * and may nuke them without concern for the [18:26:40] * refcount, which by definition always must [18:26:43] * be one, so they don't check. [18:26:46] */ [18:26:48] continue; [18:27:07] I think at least one common case of OC_F_BUSY would be inserting a new object [18:27:16] I'm not sure if there are others on older objects? [18:27:30] if it's just insertion, then basically it's a purge-vs-insert race and insert wins [18:28:54] hmm yes, simple code search seems to indicate new objects that are still in the process of inserting are the only ones with OC_F_BUSY [18:29:55] ema: we have the stat logged, just not grafana'd? [18:30:09] bblack: yup, adding it now [18:30:13] neat [18:30:27] I gotta run, dr appt in a bit, I'll peek at it later and see how it lines up [18:31:42] ok [18:31:45] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=67&fullscreen&orgId=1&var-server=cp4023&var-datasource=ulsfo%20prometheus%2Fops [19:54:21] 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968921 (10Nuria) As of me checking today referral groups are mostly unchanged from the graphs i pasted above. That is, safari sessions ap... [20:05:31] 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968983 (10Tgr) If nothing seems more broken than before that's good enough for me :) The main goal was to prevent Edge and Safari from se... [21:43:35] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969465 (10Dzahn) added to "WMF-NDA" in Phabricator https://phabricator.wikimedia.org/project/members/61/ This should let you see all the private tickets. [22:01:24] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969497 (10Dzahn) subscribed you to both ops mailing lists (others like wikitech-l are optional and self-service) https://lists.wikimedia.org/mailman/listi... [22:04:58] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969513 (10Dzahn) please take a look at https://office.wikimedia.org/wiki/Office_IT/Calendars#Human_calendars and check that you can see the "Ops Maintena... [22:06:11] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969514 (10Dzahn) @Robh could you do one more Racktables user? thanks! [22:09:23] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969518 (10Dzahn) [22:23:09] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10Dzahn)