[06:14:22] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Billinghurst) >>! In T238285#5821312, @MusikAnimal wrote: > I ran into this when I was unable to block https://...
[06:58:46] <wikibugs>	 10Traffic, 10Core Platform Team, 10Operations: Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10ema) > @ema please confirm if we need to keep using HTCP or we can switch to kafka for these caches. In case kafka needs to be used, we need to enable EventBus on wikitech, which would...
[06:58:54] <wikibugs>	 10Traffic, 10Core Platform Team, 10Operations: Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10ema) p:05Triage→03Medium
[07:10:54] <wikibugs>	 10Traffic, 10Core Platform Team, 10Operations: Configure purged in depoloyment-prep - https://phabricator.wikimedia.org/T254844 (10ema)
[07:11:15] <wikibugs>	 10Traffic, 10Core Platform Team, 10Operations: Configure purged in depoloyment-prep - https://phabricator.wikimedia.org/T254844 (10ema) p:05Triage→03Medium
[07:14:59] <wikibugs>	 10Traffic, 10Core Platform Team, 10Operations: Configure purged in depoloyment-prep - https://phabricator.wikimedia.org/T254844 (10ema) purged is now running in deployment-prep instead of vhtcpd:  ` ema@deployment-cache-text06:~$ systemctl status purged.service  ● purged.service - Purger for ATS and Varnish...
[07:42:29] <wikibugs>	 10Traffic, 10Core Platform Team, 10Operations: Configure purged in deployment-prep - https://phabricator.wikimedia.org/T254844 (10Aklapper)
[08:46:01] <jayme>	 o/ I would like to https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers for https://gerrit.wikimedia.org/r/c/operations/puppet/+/603437 - wiki says I should check here if it's "a good time for PyBal restart". Can someone help me out with this?
[09:08:46] <jayme>	 vgutierrez: maybe (EU timezone)
[09:22:02] <ema>	 jayme: hey!
[09:22:29] <ema>	 jayme: I'm around, let's see
[09:23:54] <jayme>	 ema: cool, thanks!
[09:25:04] <ema>	 jayme: alright so your service is in class "low-traffic", which as you can see from modules/lvs/manifests/configuration.pp is handled by lvs1015 and lvs1016 in eqiad, lvs2009 and lvs2010 in codfw
[09:26:32] * jayme nods
[09:27:59] <ema>	 also as you can see from $lvs_class, lvs1015 and lvs2009 are the primaries (those actually handling traffic)
[09:27:59] <jayme>	 ema: are the higher numberes ones the backups?
[09:28:38] <jayme>	 ah :)
[09:28:52] <jayme>	 (so "yes in this case")
[09:30:31] <ema>	 correct, you can always double-check when in doubt by looking at the value of bgp-med with: sudo cumin 'lvs[1015-1016].eqiad.wmnet' 'grep ^bgp-med /etc/pybal/pybal.conf' 
[09:31:15] <ema>	 lower values are preferred: the primary has bgp-med 0, the backup has 100 
[09:31:37] <ema>	 for a triple sanity check there's https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1 too :)
[09:33:55] <ema>	 jayme: what you should do at this point is running pcc against the 4 hosts and see if the change looks reasonable
[09:34:16] <ema>	 something like this:
[09:34:16] <ema>	 ./utils/pcc 603437 lvs1015.eqiad.wmnet,lvs1016.eqiad.wmnet,lvs2009.codfw.wmnet,lvs2010.codfw.wmnet,lvs1013.eqiad.wmnet
[09:34:38] <ema>	 I've added lvs1013 to the mix just for kicks
[09:35:54] <jayme>	 haha :)
[09:36:08] <jayme>	 ema: pcc is run on puppetmaster I guess?
[09:37:02] <ema>	 nope, from your workstation
[09:37:31] <ema>	 I've done it for now so that we don't slow down unecessarily to configure it, can deal with that later
[09:37:35] <ema>	 https://puppet-compiler.wmflabs.org/compiler1002/23086/
[09:38:15] <jayme>	 ah, sorry...I already did that once during onboarding but forgot about it again
[09:38:34] <ema>	 unacceptable
[09:39:01] <ema>	 as a punishment you will be assigned a random traffic bug
[09:39:24] <ema>	 jayme: does the pcc diff above seem sane to you?
[09:39:32] <volans>	 ema: pick something simple, like adding support for hit for pass to ATS
[09:39:50] <volans>	 ;)
[09:43:04] <ema>	 jayme: TCP port looks good to me, the check returns 200 according to `curl -v -k https://kubernetes1001.eqiad.wmnet:4004/_info`
[09:43:05] <jayme>	 ema: let's say it looks somewhat sane (taking into consideration that I do not exactly know what to expect) :)
[09:43:34] <jayme>	 ema: yeah. That part I've checked ofc
[09:44:16] <jayme>	 (bit hard to argue about the "Class[Pybal::Configuration]" diffs, though)
[09:44:17] <ema>	 +1, let's merge
[09:44:33] <ema>	 ah no luckily we don't have to argue about that one
[09:46:28] <jayme>	 ema: okay, merged
[09:47:26] <ema>	 ack, now follow point 2. (run puppet on the LVSs)
[09:48:07] <jayme>	 wilco
[09:54:29] <jayme>	 ema: looks fine (besides an unecpedted change to /etc/ssh/ssh_known_hosts updating db1141.eqiad.wmnet key - but thats 06a47f1cc4f45ab1d701f072e090485231010cb8 I guess)
[09:54:37] <ema>	 yeah
[09:54:53] <ema>	 ok now point (3): systemctl restart pybal on lvs1016 and lvs2010 -- the two backups. Make sure to !log what you're doing on #wikimedia-operations including the task number
[09:57:10] <ema>	 there are some (expected) icinga errors due to the discrepancies between the number of services in etcd and those actually known to ipvs, I'm going to ack them
[09:57:27] <jayme>	 thanks
[09:57:52] <ema>	 we should probably add a note about this to https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers  
[09:59:42] <ema>	 the alerts related to the backup LVSs are recovering as expected
[10:00:18] <ema>	 (by restarting PyBal you have fixed the discrepancy between what IPVS knows and what etcd knows about the services)
[10:00:37] <jayme>	 okay so this is because PyBal makes the changes on restaaaa... okay :)
[10:01:39] <ema>	 right
[10:05:49] <ema>	 jayme: alright, now the primaries
[10:06:01] <jayme>	 So, something like 'curl -v -k https://termbox.svc.eqiad.wmnet:4004/_info' still fails
[10:06:34] <jayme>	 (following the wiki) ... but I probably would need to make sure to use a secondary there
[10:09:33] <ema>	 I don't think point 4. is correct, we haven't restarted the primaries yet so that curl cannot work yet
[10:10:33] <jayme>	 Yeah...what about this magic 120 second wait?
[10:11:01] <ema>	 it's probably there mostly to give us enough time to knock on wood
[10:12:08] <jayme>	 okay. I'll do the primaries then
[10:12:17] <ema>	 ack
[10:13:46] <XioNoX>	 ema: hahahah, this should be saved in some IRC quotes archives
[10:15:10] <ema>	 :)
[10:15:13] <ema>	 jayme: nice, https://termbox.svc.eqiad.wmnet:4004/_info works
[10:15:42] <ema>	 and codfw too
[10:15:48] <volans>	 XioNoX: there is bash ;)
[10:15:55] <jayme>	 yay
[10:16:29] <ema>	 jayme: and wikipedia is still up. Congratulations!
[10:17:33] <jayme>	 Thanks for guiding me through :)
[10:18:25] <ema>	 jayme: TODO at this point is: replacing point 4. with something like "check the output of `sudo ipvsadm -L` and look for the newly added service", and point 5. with like "wait 120s by looking at https://icinga.wikimedia.org/alerts"
[10:18:34] <ema>	 s/by looking/while looking/ 
[10:19:16] <ema>	 then move the check with curl to point 7. -- stressing that the check needs to be run from another host, not the LVS 
[10:19:24] <jayme>	 ema: yeah .. and something about ack'ing the alerts (or preventing them from firing?)
[10:19:59] <ema>	 good point, yes. Right after point 2. "run puppet", we need something like "ACK the alerts" :)
[10:20:18] <jayme>	 Did you just do that via icinga web?
[10:20:22] <ema>	 indeed
[10:21:00] <ema>	 we could perhaps also think about increasing the check timeout and spare a manual step to the poor operator
[10:22:19] <ema>	 but in any case, it could happen that a careful admin spends an amount of time gt timeout before restarting pybal, so worth mentioning the alerts on wikitech
[10:23:31] <ema>	 oh, and you should configure pcc! :)
[10:24:53] <jayme>	 Did so already. Just forgot about it's existence becaused I managed to dodge puppet for a while
[13:51:39] <wikibugs>	 10Traffic, 10Core Platform Team, 10Operations: Configure purged in deployment-prep - https://phabricator.wikimedia.org/T254844 (10Pchelolo) > @Pchelolo: let me know which kafka topics we should read from in deployment-prep!   @ema - it would be `eqiad.resource-purge`
[14:06:08] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10BBlack) Looking at that other ticket T250912 - would an in-band service ping or NOP event of some kind addre...
[14:16:20] <vgutierrez>	 _joe_: it looks like Id0986258178e60b8931ab58feaf94439c78300ab is breaking our beloved ATS instances on labs
[14:16:34] <vgutierrez>	 what would be a proper value for etcd::autogen_pwd_seed on those instances?
[14:16:41] <_joe_>	 "lol"
[14:16:54] <_joe_>	 also it should be in labs/private
[14:17:02] <_joe_>	 I don't think they actually talk to etcd
[14:17:14] <_joe_>	 so instead just define the etcd_user as "root"
[14:17:19] <_joe_>	 and be done with it
[14:17:59] <vgutierrez>	 well.. etcd::autogen_pwd_seed is being accessed by lookup()
[14:18:10] <vgutierrez>	 and lookup() breaks the compilation if there is no value
[14:36:33] <_joe_>	 so there should be in labs/private, if it's not I messed something up
[14:36:35] <_joe_>	 lemme check
[14:39:02] <_joe_>	 Code/WMF/labs/private (master=)$ git grep autogen_pwd_seed
[14:39:04] <_joe_>	 hieradata/common/etcd.yaml:etcd::autogen_pwd_seed: "21}@/"
[14:39:05] <_joe_>	 it's there....
[14:39:36] <_joe_>	 I suspect beta is somehow broken in new creative ways
[14:48:58] <vgutierrez>	 hmmm
[14:49:39] <vgutierrez>	 yeah.. we have it on our local checkout as well
[14:49:50] <vgutierrez>	 vgutierrez@traffic-puppetmaster-buster:/var/lib/git/labs/private$ cat hieradata/common/etcd.yaml
[14:49:50] <vgutierrez>	 etcd::autogen_pwd_seed: "21}@/"
[14:50:07] <vgutierrez>	 so wtf
[14:55:55] <vgutierrez>	 andrewbogott: are you familiar with this kind of issue?
[14:56:56] <andrewbogott>	 vgutierrez: local patches to /labs/private were lost last week.  In some cases we were able to recover the original values and in some cases not
[14:57:04] <andrewbogott>	 I think for that one I may have inserted a new dummy value
[14:57:32] <andrewbogott>	 if you know a more correct seed value you can update that patch as needed
[14:57:37] <andrewbogott>	 (lmk if that's unclear)
[14:58:01] <andrewbogott>	 some context at https://phabricator.wikimedia.org/T254491
[15:40:21] <wikibugs>	 10netops, 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10akosiaris) I just had a quick look into the 3 PoP ganeti clusters and it seems they aren't ready to serve public IPs VMs. /etc/network/interfa...
[15:50:54] <wikibugs>	 10netops, 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10ayounsi) Sure I can do it, but do they need internet access? DHCP/TFTP shouldn't need internet access afaik? Are there other services running...
[15:54:21] <wikibugs>	 10netops, 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10BBlack) @Ayounsi - Yes, we're going to have some outbound recursive DNS needs from some ganeti-hosted services
[15:59:17] <wikibugs>	 10netops, 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10ayounsi) a:03ayounsi
[16:01:56] <wikibugs>	 10netops, 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) >>! In T254157#6206559, @ayounsi wrote: > Sure I can do it, but do they need internet access? DHCP/TFTP shouldn't need internet access...
[18:03:24] <cdanis>	 vgutierrez: ema: do you know if anything presently uses the /check healthcheck URL we define for ATS's healthcheck.so? 
[18:04:29] <vgutierrez>	 anything besides varnish-fe?
[18:04:38] <cdanis>	 yeah (and also curious where in varnishfe that's configured)
[18:05:06] <cdanis>	 ah looks like modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb line85
[18:05:33] <cdanis>	 something interesting: ATS seems to eat that path /check for any host whatsoever, even ones it doesn't actually 'serve' :)
[18:18:30] <cdanis>	 where does the /monitoring/backend path that gets probed for upload.wikimedia.org come from?  swift?
[18:52:48] <cdanis>	 yeah, from swift
[18:53:02] <cdanis>	 interestingly, you get the 'bug' cache status for hitting the /check path on ats-be, which I hadn't seen before :)
[18:54:01] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) Hm, I'm pretty sure the connection is terminated even when there are events being sent.   `  time...
[20:04:04] <wikibugs>	 10Traffic, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10Pchelolo)