[07:15:03] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966190 (10BBlack)
[08:08:47] <vgutierrez>	 morning hashar!
[08:10:05] <vgutierrez>	 hashar: I'm getting a 403 from Jenkins API when I try to run puppet utils/pcc, what are the permissions I need to ask for to get that fixed? 
[08:41:23] <hashar>	 vgutierrez: hello, depends on the URL ? somes are blacklisted intentionally (such as the build history)
[08:41:48] <hashar>	 most would require to be under NDA, and for Jenkins administration it is locked down to just a few people
[08:42:02] <hashar>	 what is the 403ing URL ?
[08:45:33] <elukey>	 vgutierrez: did you use the --api-token parameter? You can get one in your user panel in the Jenkins UI
[08:57:44] <vgutierrez>	 elukey: I'm using JENKINS_USERNAME & JENKINS_API_TOKEN env variables
[08:59:36] <vgutierrez>	 hashar: that's it then, I'm not in the NDA group (yet)
[09:02:01] <elukey>	 super :)
[09:02:27] <moritzm>	 are talking about the NDA group in LDAP or Phab?
[09:04:56] <moritzm>	 LDAP Valentín should be in cn=wmf, not cn=nda (since he's staff), I'm adding him to LDAP now
[09:05:17] <vgutierrez>	 thx moritzm :)
[09:08:06] <moritzm>	 I've added you to cn=wmf and cn=ops, can you re-try PCC?
[09:08:50] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10MoritzMuehlenhoff) Added to cn=ops and cn=wmf LDAP groups.
[09:09:02] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3966318 (10MoritzMuehlenhoff)
[09:09:40] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10MoritzMuehlenhoff)
[09:13:03] <vgutierrez>	 moritzm: awesome, it's working now <3
[09:13:26] <moritzm>	 good :-)
[09:16:18] <_joe_>	 vgutierrez: pcc is just a quick cli client, if  you want to do fancier things (like running at higher concurrency, or running the compiler on most hosts) you'll need to go to jenkins itself
[09:16:30] <_joe_>	 but pcc is usually enough, though
[09:38:22] <godog>	 looks like pybal on lvs2003 isn't depooling ms-fe2005 even though it is depooled according to https://config-master.wikimedia.org/pybal/codfw/swift
[09:38:43] <godog>	 I suspect this might be the culprit?
[09:38:45] <godog>	 Feb 13 06:33:27 lvs2003 pybal[1924]: Unhandled error in Deferred:
[09:38:45] <godog>	 Feb 13 06:33:27 lvs2003 pybal[1924]: Traceback (most recent call last):
[09:38:49] <godog>	 Feb 13 06:33:27 lvs2003 pybal[1924]: Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
[09:38:52] <godog>	 Feb 13 06:33:37 lvs2003 pybal[1924]: Unhandled error in Deferred:
[09:38:54] <godog>	 Feb 13 06:33:37 lvs2003 pybal[1924]: Traceback (most recent call last):
[09:38:57] <godog>	 Feb 13 06:33:37 lvs2003 pybal[1924]: Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
[09:41:49] <ema>	 godog: interesting
[09:42:17] <ema>	 I'd expect the pybal_ipvs_diff alert to trigger if that were the case
[09:44:04] <godog>	 heh, how long should it take? I depooled ~15 min ago
[09:44:36] <ema>	 I've just tried running the icinga command by hand and it does not find any differences
[09:44:52] <ema>	 09:43:08 ema@lvs2003.codfw.wmnet:~
[09:44:52] <ema>	 $ /usr/local/lib/nagios/plugins/check_pybal_ipvs_diff --prometheus-url=http://lvs2003:9100/metrics
[09:44:55] <ema>	 OK: no difference between hosts in IPVS/PyBal
[09:45:31] <ema>	 but indeed I do see ms-fe2005 as pooled in ipvs
[09:46:48] <ema>	 and pybal thinks it's pooled too
[09:46:50] <ema>	 09:46:32 ema@lvs2003.codfw.wmnet:~
[09:46:50] <ema>	 $ curl -s http://localhost:9090/pools/swift_80 | grep 2005                                                                                                    
[09:46:53] <ema>	 ms-fe2005.codfw.wmnet:  enabled/up/pooled
[09:47:02] <ema>	 (hence no pybal_ipvs_diff)
[09:48:25] <ema>	 godog: could you please try pooling/depooling the host once again?
[09:49:25] <godog>	 ema: yeah, just pooled and depooled again
[09:50:31] <hashar>	 moritzm: thank you for vgutierrez ldap fix :]
[09:50:31] <ema>	 godog: no luck, it's still pooled according to pybal 
[09:53:47] <godog>	 ok I just tried with confctl on neodymium as opposed to 'depool' on the host, just because
[09:55:13] <ema>	 same situation on lvs2006 btw
[09:55:53] <ema>	 on both lvs200[36] connections to etcd are established as expected
[09:58:04] <ema>	 !log restart pybal on lvs2006 
[09:58:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:25] <ema>	 heh, after a pybal restart on lvs2006 the state of ms-fe2005 is reflected properly
[10:01:23] <ema>	 interestingly, on lvs2006 there are now 30 established connections to etcd, while before the restart (and on lvs2003) there were only 6 connections
[10:02:41] <ema>	 so this is a new failure scenario AFAIU: no difference between ipvs/pybal, (some) connections to etcd established, and still changes to data in etcd are not picked up by pybal 
[10:03:24] <volans>	 did it pick the change after the restart?
[10:03:29] <ema>	 yep
[10:03:39] <ema>	 godog: I'm gonna bounce pybal on lvs2003 too so you can carry on with your day 
[10:03:49] <ema>	 !log restart pybal on lvs2003
[10:03:58] <godog>	 ema: thanks, btw did you see the same traceback on lvs2006 logs too?
[10:04:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:29] <godog>	 looks like some thread/coroutine from twisted never recovered, from the symptoms so far
[10:06:31] <ema>	 godog: yes, same error on lvs2006, same time 
[10:06:48] <ema>	 (roughly same time that is)
[10:07:06] <ema>	 Feb 13 06:33:22 lvs2006 pybal[39461]: Unhandled error in Deferred:
[10:07:06] <ema>	 Feb 13 06:33:22 lvs2006 pybal[39461]: Traceback (most recent call last):
[10:07:09] <ema>	 Feb 13 06:33:22 lvs2006 pybal[39461]: Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
[10:08:12] <volans>	 the time makes me think of cron daily + post logrotate
[10:09:49] <ema>	 thing is: we need to ensure there's one established etcd connection per pool
[10:10:19] <volans>	 ema: do you know to which conf host the two lvses were connected?
[10:10:37] <ema>	 there are 30 pools defined on lvs200[36], the icinga check (WIP) should alert if there's less than 30 established connections
[10:11:13] <ema>	 volans: yes, conf2001
[10:11:33] <volans>	 Feb 13 06:33:26 conf2001 etcd[646]: start to snapshot (applied: 66206635, lastsnap: 66196634)
[10:11:36] <volans>	 Feb 13 06:33:26 conf2001 etcd[646]: saved snapshot at index 66206635
[10:11:39] <volans>	 Feb 13 06:33:26 conf2001 etcd[646]: compacted raft log at 66201635
[10:11:41] <volans>	 Feb 13 06:33:38 conf2001 etcd[646]: purged file /var/lib/etcd/codfw.wmnet/member/snap/0000000000005cb1-0000000003f17856.snap successfully
[10:12:14] <volans>	 same log on 2003 but not 2002 (it might be the master)
[10:13:15] <ema>	 ok yeah the timing seems suspicious - though lvs2006 already went belly up a few seconds before the etcd snapshot got logged, at 06:33:22
[10:14:20] <volans>	 no keys replicated by etcdmirror between 05:57 and 06:36
[10:14:26] <volans>	 on conf2002
[10:18:13] <volans>	 and I agree, pybal should reconnect no matter the error
[10:30:33] <_joe_>	 ema: I don't think that monitoring is the way to go to find out such errors
[10:30:46] <_joe_>	 but it could be useful while we work on a proper patch
[10:31:25] <_joe_>	 https://gerrit.wikimedia.org/r/c/365549/ I should really try to work on that
[10:32:34] <_joe_>	 ema: did you check if you do get false positives with that alert?
[10:32:37] <ema>	 yeah well if we don't monitor it we're flying blind and only find out when depooling a production host (just like today)
[10:33:03] <_joe_>	 yeah I agree, I just fear it could cause a lot of false positives
[10:34:50] <ema>	 grr, I was naively thinking that size($lvs_services) would have been enough to know the number of configured lvs services
[10:42:03] <vgutierrez>	 yup.. I thought the same as _joe_, looks like it could lead to a lot of false positives
[10:42:04] <_joe_>	 :)
[10:42:14] <vgutierrez>	 (newbie POV)
[10:46:41] <ema>	 _joe_: am I correct in saying that the number of established TCP connections to etcd must be at all times >= than the number of configured pools?
[10:47:52] <_joe_>	 ema: uhm, intuitively I'd say yes
[10:55:42] <ema>	 _joe_: ok, so if that is true, why should we get any false positives?
[10:56:53] <_joe_>	 ema: because I'm not sure that's reliable, given you're doing things in an eventloop, there might be times when one or more etcdconfig are not connected but still in a healthy state
[10:57:02] <_joe_>	 the deferred not having fired up still
[10:57:41] <_joe_>	 that's a very tiny race condition, but still, I guess our best shot is to test the script for now
[10:57:54] <_joe_>	 and invest some time in making that damn code more reliable
[10:58:01] <_joe_>	 I still am not sure why it fails at all
[10:58:25] <ema>	 oh, ok, then my assumption is not true :)
[10:59:04] <_joe_>	 ema: look at the code, I might be wrong, I'm saying this out of memory :)
[11:02:00] <ema>	 I've just tried counting the number of connections vs. the number of services on all LVSs and indeed we do have cases where they differ
[11:02:23] <ema>	 eg: lvs3002 has 0 established connections vs. 12 configured services :P
[11:02:43] <ema>	 and lvs1003 has 12 established connections vs. 40 services /o\
[11:02:59] <ema>	 they don't seem like false positives to me though
[11:09:21] <_joe_>	 they're not?
[11:09:31] <_joe_>	 you can test it I guess
[11:10:06] <_joe_>	 esp on lvs3002
[11:13:23] <ema>	 !log depool cp3007 to test pybal's behavior on lvs3002
[11:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:55] <ema>	 oh man
[11:14:22] <ema>	 !log repool cp3007 
[11:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:04] <ema>	 pybal is misbehaving on lvs3002 too
[11:15:35] <_joe_>	 heh
[11:15:58] <_joe_>	 ok I guess you might want to take a look at that patch I half-baked
[11:16:14] <_joe_>	 or better, I can make it better, I guess this can be considered UBN!
[11:16:33] <ema>	 yep
[11:16:35] <_joe_>	 vgutierrez: sorry, acronym alert. UBN = UnBreak Now!
[11:16:41] <ema>	 currently misbehaving hosts:
[11:16:42] <ema>	 lvs[3001,3003].esams.wmnet,lvs4005.ulsfo.wmnet
[11:16:51] <ema>	 lvs[3002,3004].esams.wmnet
[11:17:00] <_joe_>	 ema: so all of esams basically
[11:17:02] <ema>	 lvs[1003,1006].wikimedia.org
[11:17:11] <_joe_>	 I guess any network glitch can cause such a failure
[11:17:26] <_joe_>	 ok it's time to fix that damn driver
[11:17:37] <_joe_>	 it's been a thorn in our side since its inception
[11:17:48] <_joe_>	 ema: can you trace the errors in the logs, btw?
[11:18:02] <ema>	 _joe_: yes
[11:18:04] <ema>	  Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean
[11:18:15] <_joe_>	 and then no reconnection, wtf
[11:18:17] <ema>	 (see backlog)
[11:18:27] <ema>	 which I don't understand given that we should be reconnecting?
[11:18:43] <_joe_>	 anyways, I'll finish what I am doing and I'll take a stab at that patch afterwards
[11:19:30] <ema>	 in the sense that we're implementing clientConnectionLost in EtcdConfigurationObserver, so it's not clear to me why the error is unhandled
[11:19:47] <_joe_>	 ema: same here
[11:20:00] <_joe_>	 10:58 < _joe_> I still am not sure why it fails at all
[11:20:07] <ema>	 right :)
[11:20:23] <ema>	 I'll begin the restarts meanwhile
[11:21:09] <ema>	 !log esams secondary LVSs: restart pybal to properly reconnect to etcd
[11:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:08] <ema>	 !log esams primary LVSs: restart pybal to reconnect to etcd
[11:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:52] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966644 (10fgiunchedi) p:05Triage>03Normal
[11:26:15] <ema>	 !log lvs4005: restart pybal to reconnect to etcd
[11:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:49] <ema>	 !log lvs1006/1010: restart pybal to reconnect to etcd
[11:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:49] <ema>	 !log lvs1003: restart pybal to reconnect to etcd
[11:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:45] <ema>	 restarts done
[12:37:58] <bblack>	 I too wonder about that time of morning
[12:38:51] <bblack>	 10:08 < volans> the time makes me think of cron daily + post logrotate
[12:39:12] <bblack>	 it's also similar timing to when we have our daily minor esams traffic issues that eventually self-resolve...
[12:41:07] <volans>	 bblack: although the logrotate for etcd is just doing: service rsyslog rotate
[12:41:49] <bblack>	 I thought rsyslog had its own global logrotate rule?
[12:45:40] <volans>	 yes it does too, invoke-rc.d rsyslog rotate
[12:49:16] <bblack>	 maybe we're rotating during our rotation heh
[13:03:54] <volans>	 rotation inception :D anyway it sholdn't affect etcd per se I guess
[13:04:38] <volans>	 _joe_: FYI we have some leftover rotate files and logdirs in conf hosts for the mirrors that I think might be manually cleanup
[13:04:45] <volans>	 like /etc/logrotate.d/etcdmirror--eqiad-wmnet
[13:04:58] <_joe_>	 volans: hah, yeah
[13:05:10] <_joe_>	 it's still going to be back in the future, so..
[13:05:20] <_joe_>	 well, not getting there right now
[13:11:22] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966190 (10BBlack)
[13:13:38] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966974 (10BBlack) Switch config says xe-1/0/4, but that port appears to be missing (no SFP?)
[13:14:56] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3967012 (10BBlack)
[13:15:05] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#3967013 (10BBlack)
[13:15:30] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: dns5002 mgmt console unreachable - https://phabricator.wikimedia.org/T186902#3967014 (10BBlack)
[13:15:34] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3967015 (10BBlack)
[13:15:47] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3967016 (10BBlack)
[13:16:08] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3967017 (10BBlack)
[13:16:18] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3967018 (10BBlack)
[13:16:34] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3967022 (10BBlack)
[13:16:46] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3967023 (10BBlack)
[13:17:00] <wikibugs>	 10Traffic, 10DC-Ops, 10Operations, 10ops-eqsin: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3967024 (10BBlack)
[13:23:15] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3794643 (10BBlack) >>! In T181569#3901023, @ayounsi wrote: > The other atlas don't seem to be connected to a scs so I can't compare.  Do we even have a use for scs on atlas, if it's not use...
[13:27:09] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3967086 (10BBlack)
[13:27:11] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3967087 (10BBlack)
[14:40:42] <wikibugs>	 10Traffic, 10Operations, 10Page-Previews, 10RESTBase, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#3967535 (10BBlack) Well, the above solution would still leave your with your short-age problem, if you didn't also zero out the Age.   Do we want to...
[15:06:21] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535#3967622 (10BBlack) a:05BBlack>03RobH These have been spared-out for a while now and we're fine on the new ones, please kill.
[15:09:25] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#3967627 (10BBlack)
[15:09:26] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#3967625 (10BBlack) 05Open>03declined We did some interesting refactoring here that helped with some other related issues, but the main goal here...
[15:10:57] <wikibugs>	 10Traffic, 10Operations: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#3967629 (10BBlack) 05Open>03Resolved AFAIK with the resolution of T162035 we haven't had further reports.
[16:16:12] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3967940 (10ayounsi) I also double-check the connection tracking spreadsheet and that should be the proper port.  Switch says  > error: device xe-1/0/4 not found And the tran...
[16:42:14] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3968028 (10ayounsi) It's a nice to have, just in case, but clearly not a blocker.
[17:16:06] <bblack>	 Feb 13 17:10:53 cp4031 systemd[1]: Started varnish (Varnish HTTP Accelerator).
[17:16:16] <bblack>	 Feb 13 17:08:54 cp4031 systemd[1]: Stopped varnish (Varnish HTTP Accelerator)
[17:16:36] <bblack>	 so icinga picked it up presumably because there was a 2-minute gap there (while depooled), which I think sounds a little longer than we expect
[17:17:24] <bblack>	 all that happens in the script between the systemd-level stop->start is: "rm -f /srv/sd*/varnish*; sync; sleep 10"
[17:19:12] <ema>	 yeah
[17:19:23] <ema>	 for comparison, cp4023's restart:
[17:19:31] <ema>	 Jan 14 14:16:36 cp4023 systemd[1]: Stopping varnish (Varnish HTTP Accelerator)...
[17:19:37] <ema>	 Jan 14 14:17:32 cp4023 systemd[1]: Started varnish (Varnish HTTP Accelerator).
[17:19:58] <ema>	 that one took a bit less than a minute
[17:20:02] <bblack>	 I guess the wipe/sync got slow, I mean there's no absolute expectation there
[17:20:56] <bblack>	 in any case, so long as it was depooled it's not a huge issue, so long as we don't end up spamming icinga alerts to IRC all the time (if this is e.g. common post-v5-upgrade, we might want to tweak icinga alerting)
[17:21:27] <ema>	 ideally, we'd disable icinga while the host is depooled and re-enable it upon repool :)
[17:21:45] <bblack>	 (or someone could take on the amazing task of linking up icinga + etcd, so when we define a monitor we can define a matching etcd depool state that temporarily suppresses it)
[17:21:50] <bblack>	 heh
[17:22:10] <bblack>	 technically hosts don't get depooled, but services do
[17:22:16] <bblack>	 (would've worked in this particular case)
[17:22:24] <ema>	 yeah
[17:24:21] <ema>	 some grepping under /var/log/icinga/ seems to suggest that this was a isolated case
[17:32:12] <bblack>	 so I was starting to look at how our backend restart cron_splay stuff works (re: mysterious 160s thing)
[17:32:39] <bblack>	 to recap intentions from the code comments:
[17:32:41] <bblack>	 The idea here is to ensure each host in the set executes the cron once per time
[17:32:45] <bblack>	 period, and also ensure the time between hosts is consistent (no edge cases
[17:32:48] <bblack>	 much closer than the average) by splaying them as evenly as possible with
[17:32:51] <bblack>	 rounding errors.  For the case of hosts with NNNN numbers indicating the
[17:32:54] <bblack>	 datacenter in the first digit, we also maximize the period between any two
[17:32:57] <bblack>	 hosts in a given datacenter by interleaving sorted per-DC lists of hosts before
[17:33:00] <bblack>	 splaying.
[17:33:44] <bblack>	 so the idea there is if you say "weekly", there are 10080 minutes in a week.  and if your global cluster list is cp1001, cp1002, cp2001, cp2002, cp3001, cp3002
[17:34:25] <bblack>	 it should interleave them something like cp1001, cp2001, cp3001, cp1002, cp2002, cp3002
[17:35:01] <bblack>	 there's also some persistent randomization of the order within a DC, so it might well be e.g. cp1002, cp2001, cp3001, cp1001, cp2002, cp3002 
[17:35:28] <bblack>	 and then spread those out to have, in this case, 10080/6 = 1680 minutes between each restart
[17:35:51] <bblack>	 which seems like a fairly legit strategy for avoiding timing worries of various kinds, all things considered.
[17:36:34] <bblack>	 but then I go look at just text@eqiad (which has 8 of the 35 global text cluster nodes)
[17:37:27] <bblack>	 6/8 of text@eqiad are all at exactly 02:44 of some day, using days 0, 1, 2, 3, 4, 5.
[17:37:41] <bblack>	 which seems like a very odd coincidence
[17:38:03] <bblack>	 maybe cron_splay is buggy, or maybe we just have some bad coincidences between cluster node counts and minutes-per-day/week.
[17:40:06] <_joe_>	 occam's razor?
[17:40:19] <_joe_>	 I mean it could be after all
[17:40:33] <bblack>	 yeah I think it's just unrealized coincidences
[17:40:36] <_joe_>	 in some ways it's nice to have machines at the same time on different days
[17:41:00] <bblack>	 10080 minutes per week evenly is evenly divisible by a lot of related time-divisions and cluster counts
[17:41:17] <bblack>	 the pattern is apparently common, though
[17:41:39] <bblack>	 e.g. text@esams has 6/8 of its nodes also aligned on the same time on different days
[17:42:16] <bblack>	 when looking at a daily fault pattern it makes some things hard, though.
[17:42:27] <bblack>	 (the fault doesn't directly align with any of the crons FWIW)
[17:42:47] <bblack>	 (but could be related to them in time, e.g. X hours after Y and Z restart shortly apart)
[17:43:44] <bblack>	 oh, right, so we actually have 36 nodes because cp1008 gets counted in this list for this purpose
[17:43:47] <bblack>	 that makes more sense
[17:44:10] <bblack>	 10080/36 = 300
[17:44:42] <bblack>	 which means every 25 minutes if it's cycling through 5x interleaved datacenters
[17:45:07] <bblack>	 I donno, need to stare more
[17:45:32] <bblack>	 (not 25 minutes, 25 hours)
[17:51:03] <ema>	 it is probably still too early to draw conclusions about the probabilistic ttl variation patch
[17:51:24] <bblack>	 ?
[17:51:34] <ema>	 the "swizzle" thing :)
[17:51:36] <bblack>	 I'd have thought it'd have had impact by now
[17:51:46] <bblack>	 I'm kinda meh on its impact so far on the evidence
[17:52:07] <bblack>	 but I guess in theory, it could take a full restart cycle (week) to have the fullest effect on various stampedes on storage
[17:52:54] <ema>	 right, so end of this week give or take?
[17:53:56] <ema>	 we've merged it on the 8th (Thursday)
[17:56:04] <bblack>	 right
[17:56:26] <bblack>	 but I really would've thought we'd see most of the impact (if any) on this particular issue within a day or two
[17:56:36] <bblack>	 but that pattern keeps recurring, unaffected
[18:00:25] <ema>	 it would be really interesting to have stats about "successful" purges and be able to separate those from evictions/ttl+keep expirations
[18:03:28] <bblack>	 yeah, it's tricky IIRC because we may send the same status code either way
[18:03:33] <bblack>	 I think just the text after the status code changes
[18:09:14] <bblack>	 maybe we used to anyways
[18:09:15] <wikibugs>	 10Traffic, 10Maps-Sprint, 10Operations: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3968439 (10Gehel)
[18:09:23] <bblack>	 seems like with v4 it's unconditional "204 Purged"?
[18:09:38] <bblack>	 I wonder if vcl_purge has access to any data that can tell it hit/miss
[18:10:32] <ema>	 -   RespStatus     204
[18:10:32] <ema>	 -   RespReason     No Content
[18:10:32] <ema>	 -   RespReason     Purged
[18:10:45] <ema>	 so yeah, '204 Purged' is the final one
[18:11:39] <bblack>	 the No Content is just the defaulting from the 204 I think
[18:11:57] <bblack>	 I don't think there's any call-path changes on hit or miss for purge
[18:12:04] <bblack>	 it's just vcl_recv->vcl_hash->vcl_purge
[18:12:32] <bblack>	 so the only place we might be able to make a difference is if vcl_pruge can observe something dumb like obj.hits or similar
[18:13:06] <ema>	 https://varnish-cache.org/lists/pipermail/varnish-misc/2015-March/thread.html
[18:17:10] <ema>	 interesting how this poor soul was using inline C before v4 for this? https://varnish-cache.org/lists/pipermail/varnish-misc/2015-March/024324.html
[18:17:37] <ema>	 why? :)
[18:18:26] <bblack>	 "What do you use it for? Why do you care if the purge call purged something?"
[18:18:30] <ema>	 yeah
[18:18:34] <bblack>	 lol - to debug your performance issues :P
[18:19:14] <bblack>	 we had similar without inline C before, and ignored the Vary-issue
[18:20:15] <ema>	 who knows what was the reason to stop going through vcl_hit/miss from v3 to v4
[18:21:47] <bblack>	 if I had to guess, I'd say they'd do that to enable the possibility of defering the actual purge work for perf
[18:21:56] <bblack>	 which they may or may not have taken advantage of in the end
[18:22:38] <bblack>	 (by using a ban-like optimization, they could return "purge done" accurately very quickly by storing a timestamp and purge info, and then some other code could asynchronously actually execute the purge, in the style of a ban)
[18:23:18] <ema>	 hey, there's MAIN.n_obj_purged though
[18:24:04] <ema>	 MAIN.n_purges                        379380341          .   Number of purge operations executed
[18:24:07] <ema>	 MAIN.n_obj_purged                      2586573          .   Number of purged objects
[18:24:55] <bblack>	 oh interesting, do we track that in varnish-machine-stats?
[18:25:02] <ema>	 now if the former is the number of PURGE requests and the latter is the number of objects acutally purged, we'd have some interesting data here :)
[18:25:22] <bblack>	 also, against my better judgement, I peeked at HSH_Purge()
[18:25:54] <bblack>	 it seems that when it's iterating the vary-slot objects within the object head to purge them all, if any of them is "busy" it just skips them and doesn't bother purging? :P
[18:26:01] <ema>	 we don't plot it yet, let me do that
[18:26:12] <ema>	 what?
[18:26:22] <bblack>	 VTAILQ_FOREACH(oc, &oh->objcs, hsh_list) {
[18:26:25] <bblack>	 [...]
[18:26:31] <bblack>	 if (oc->flags & OC_F_BUSY) {
[18:26:31] <bblack>	                                 /*
[18:26:31] <bblack>	                                  * We cannot purge busy objects here, because
[18:26:34] <bblack>	                                  * their owners have special rights to them,
[18:26:37] <bblack>	                                  * and may nuke them without concern for the
[18:26:40] <bblack>	                                  * refcount, which by definition always must
[18:26:43] <bblack>	                                  * be one, so they don't check.
[18:26:46] <bblack>	                                  */
[18:26:48] <bblack>	                                 continue;
[18:27:07] <bblack>	 I think at least one common case of OC_F_BUSY would be inserting a new object
[18:27:16] <bblack>	 I'm not sure if there are others on older objects?
[18:27:30] <bblack>	 if it's just insertion, then basically it's a purge-vs-insert race and insert wins
[18:28:54] <bblack>	 hmm yes, simple code search seems to indicate new objects that are still in the process of inserting are the only ones with OC_F_BUSY
[18:29:55] <bblack>	 ema: we have the stat logged, just not grafana'd?
[18:30:09] <ema>	 bblack: yup, adding it now
[18:30:13] <bblack>	 neat
[18:30:27] <bblack>	 I gotta run, dr appt in a bit, I'll peek at it later and see how it lines up
[18:31:42] <ema>	 ok
[18:31:45] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=67&fullscreen&orgId=1&var-server=cp4023&var-datasource=ulsfo%20prometheus%2Fops
[19:54:21] <wikibugs>	 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968921 (10Nuria) As of me checking today referral groups are mostly unchanged from the graphs i pasted above. That is, safari sessions ap...
[20:05:31] <wikibugs>	 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968983 (10Tgr) If nothing seems more broken than before that's good enough for me :) The main goal was to prevent Edge and Safari from se...
[21:43:35] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969465 (10Dzahn) added to "WMF-NDA" in Phabricator  https://phabricator.wikimedia.org/project/members/61/  This should let you see all the private tickets.
[22:01:24] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969497 (10Dzahn) subscribed you to both ops mailing lists (others like wikitech-l  are optional and self-service)  https://lists.wikimedia.org/mailman/listi...
[22:04:58] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969513 (10Dzahn) please take a look at   https://office.wikimedia.org/wiki/Office_IT/Calendars#Human_calendars  and check that you can see the "Ops Maintena...
[22:06:11] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969514 (10Dzahn) @Robh could you do one more Racktables user? thanks!
[22:09:23] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969518 (10Dzahn)
[22:23:09] <wikibugs>	 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10Dzahn)