[08:49:58] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema)
[08:57:50] <elukey>	 soooo shall we talk about removing IPSec from cp to jumbo? :)
[08:59:30] <ema>	 elukey: yup! What's to be done?
[09:01:19] <elukey>	 ema: need to triple check with Valentin but IIUC only removing IPsec, vk should be ok now
[09:01:32] <ema>	 elukey: nice. Is there a task somewher?
[09:01:41] <ema>	 *somewhere 
[09:02:07] <elukey>	 https://phabricator.wikimedia.org/T182993
[09:03:06] <elukey>	 not sure about the procedure to follow though (depool one cp, remove ipsec config, repool, etc..)
[09:03:09] <ema>	 ha! I was searching on the traffic board for ipsec / varnishkafka / jumbo to no avail :)
[09:05:11] <ema>	 elukey: when merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/425550/ I just had to run puppet on the kafka nodes to avoid icinga spam from the strongswan alerts
[09:06:15] <ema>	 vk is already using TLS right? We can confirm that no traffic is going through the ipsec tunnel any longer I guess and that should be it
[09:06:51] <elukey>	 ema: yes it is using TLS, but I thought over the IPsec channel.. if this is not the case, ok :)
[09:07:01] <elukey>	 I was worried about outstanding tcp conns
[09:10:47] <ema>	 yep, vk traffic *is* going through the tunnel
[09:12:13] <ema>	 so I'd say: disable puppet on all affected cache hosts, merge the change removing ipsec config, depool one node and enable puppet, repool, repeat   
[09:13:26] <ema>	 while we are at it we might as well reboot given that we still need to apply the microcode updates to most cache hosts   
[09:13:47] <elukey>	 +1
[09:14:02] <ema>	 let's wait for valentin though, he should be online soon
[09:14:09] <elukey>	 we can schedule it for tomorrow if you want, and then wait for Brandon/Valentin's green light later on
[09:14:26] <elukey>	 (or even later on this week, no rush)
[09:14:42] <moritzm>	 ema: there'll soon be an update to 4.9.102 for jessie, if you do reboots we can combine those
[09:16:17] <ema>	 moritzm: ok, rough estimate of when $soon is gonna be?
[09:17:33] <vgutierrez>	 you've my green light elukey, but I'd feel safer with Brandon's approval O:)
[09:18:56] <elukey>	 <3
[09:19:17] <elukey>	 everybody feels safer with Brandon's approval :)
[09:20:16] <moritzm>	 $someone needs to rebuild the 4.9.102 from stretch-proposed-updates for jessie-wikimedia, I was planning to do that on "don't set the cluster on fire Friday", but I can also look into it later the day, so should be availabe tomorrow
[09:21:26] <ema>	 moritzm: perfect, tell $someone that there's no need to hurry!
[09:22:48] <moritzm>	 ack!
[09:27:59] <vgutierrez>	 ema: do you have some time to discuss and give some love to T184715?
[09:28:00] <stashbot>	 T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715
[09:29:25] <vgutierrez>	 apparently pybal caused some troubles: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180626-LoadBalancers
[09:38:11] <ema>	 ok so https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/404680/ did not work as expected? We should try to come up with a test case reproducing whatever pybal did when the outage happened
[09:38:28] <vgutierrez>	 yup
[09:38:37] <vgutierrez>	 I have in my TODO write a test case for that
[09:53:07] <ema>	 pybal logs are now under lvs1016.eqiad.wmnet:~ema/20180626-LoadBalancers_logs for our grepping pleasure
[09:53:16] <ema>	 there's a lot of noise there
[09:53:48] <vgutierrez>	 as long as it's everything there... ¬¬ I'm talking to you journald
[09:56:33] <ema>	 vgutierrez: we should be fine. those are from pybal.log.7.gz, not from journalctl
[09:56:47] <vgutierrez>	 <3
[09:59:32] <ema>	 vgutierrez: we haven't merged/cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/418866/ yet BTW
[09:59:53] <vgutierrez>	 ema: that should be abandoned already
[10:01:05] <ema>	 how so?
[10:01:29] <vgutierrez>	 https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/420299/
[10:01:33] <vgutierrez>	 we went with that
[10:02:08] <vgutierrez>	 making less noise pybal startup regarding logging
[10:02:22] <ema>	 mmh but that doesn't fix the general issue, does it 
[10:02:23] <vgutierrez>	 s/less noise/quieter/g
[10:03:22] <vgutierrez>	 ema: the default journald rate limiting should be enough.. it doesn't make sense to flood the logs with noise
[10:04:48] <ema>	 10:04:05 ema@lvs1016.eqiad.wmnet:~
[10:04:48] <ema>	 $ sudo journalctl -u systemd-journald.service | grep Suppre
[10:04:48] <ema>	 Jun 26 09:32:45 lvs1016 systemd-journald[570]: Suppressed 994 messages from /system.slice/pybal.service
[10:05:10] <ema>	 and six more lines like that, all around the time of the outage ^
[10:05:27] <vgutierrez>	 yep.. mass depooling / Repooling it's still noisy as hell
[10:05:57] <vgutierrez>	 IMHO we should work in that.. we should be able to provide meaningful & easy to debug logs
[10:06:01] <vgutierrez>	 not the mess that we've right now
[10:07:50] <ema>	 I agree that pybal logging needs to be worked on, but I also think that disabling journald's rate limiting for pybal is a good idea 
[10:08:11] <ema>	 it's not that we need to save a few megabytes of disk space on the LVSs :) 
[10:08:33] <vgutierrez>	 yup, that's right
[10:09:01] <mark>	 ...and at least you can retrace completely what it did
[10:09:02] <vgutierrez>	 but journald works in RAM.. and that could be a potential issue
[10:09:12] <mark>	 we could summarize it as well, but get less info of course
[10:14:44] <ema>	 oh right, by default journald *does* actually store data persistently on disk, but only if /var/log/journal/ exists (it doesn't on the LVSs)
[10:15:00] <ema>	 (Storage=auto is the default)
[10:15:01] <mark>	 i don't know systemd/journald at all
[10:15:16] <mark>	 is there a way to ship it full logging, but it only shows above a certain logging level by default?
[10:15:33] <mark>	 it would be nice not to be bogged down with full logging by default, e.g. during an issue
[10:15:44] <mark>	 but have the ability to review it all afterwards when needed
[10:16:48] <vgutierrez>	 [api_80] depool-threshold = .7
[10:17:02] <ema>	 yeah journalctl does allow to filter by priority ranges (journalctl -p)
[10:17:11] <vgutierrez>	 so that should mean that we should always have online a 70% of the servers?
[10:17:24] <mark>	 vgutierrez: yes
[10:17:47] <mark>	 but how "of the servers" is defined is a bit loose, does that mean all servers in the pool, just enabled servers, etc
[10:18:18] <mark>	 and the reason why pybal originally had servers in the pool but disabled is exactly because of depool threshold
[10:18:45] <mark>	 otherwise removing a server completely (as was done by commenting them in a file) achieved the same
[10:22:38] <vgutierrez>	 hmm there is something in the depool logic that bothers me
[10:23:10] <vgutierrez>	 a server monitored as down needs to pass that canDepool() logic in order to be depooled
[10:24:48] <mark>	 to see if that makes it drop below the depool threshold yes
[10:25:29] <vgutierrez>	 hmmm so it's better to have a faulty server on the pool that fail to meet the depool threshold?
[10:25:57] <mark>	 yeah that is the point of it - it's better to have some (possibly) faulty servers in the pool than... too few or none at all
[10:26:03] <mark>	 so it's meant to guard against issues like that outage
[10:26:46] <mark>	 there's not much point in pybal depooling all servers it thinks are faulty if that means there aren't enough (or any) to serve the traffic we have
[10:26:57] <vgutierrez>	 at that point we could benefit from some kind of fake backend server that returns 503s IMHO
[10:27:08] <mark>	 what do you mean?
[10:27:52] <mark>	 oh instead of sending traffic to down servers that likely just timeout, we have a server generating error pages?
[10:27:53] <vgutierrez>	 right now we don't want to go below the depool treshold because the traffic would overload the remaining servers
[10:28:04] <vgutierrez>	 mark: yup
[10:28:15] <mark>	 well
[10:28:22] <mark>	 yeah, that could be a good idea in some cases,
[10:28:30] <vgutierrez>	 mark: or even a 500, and let the pool rest for a while
[10:28:33] <mark>	 but what we've found often is that this feature actually saves us
[10:28:51] <mark>	 that there are some issues which are enough to cause pybal to depool stuff but due to this not many servers get depooled and the site stays mostly up
[10:28:59] <mark>	 which is still much better than just serving 503s
[10:29:09] <mark>	 and what happened last week is a bit different
[10:29:16] <mark>	 now it wasn't pybal detecting servers as down
[10:29:25] <mark>	 but scap actually depooling servers from pybal (marking them as disabled)
[10:29:33] <mark>	 and, supposedly, in the past pybal also protected against that
[10:29:46] <mark>	 (which I'm honestly not sure about atm, but _joe_ seems certain)
[10:29:57] <vgutierrez>	 well.. a % of the traffic is still failing, but IMHO it's better to return a 503 that not return anything at all
[10:30:20] <mark>	 yeah but you don't know if it won't return anything at all
[10:30:26] <mark>	 it might just be a bit slow, e.g. more than 5s
[10:30:35] <mark>	 it's not like pybal sees the complete behaviour of the server
[10:30:39] <vgutierrez>	 mark: from the report I understand that scap depooling a 90% of the servers is something normal?
[10:30:41] <mark>	 it just knows what its own health checks tell it
[10:30:59] <mark>	 vgutierrez: well, that's a bug in scap, which supposedly previously was caught by pybal
[10:31:07] <mark>	 which is still not a good thing at all, pybal shouldn't need to save us that way
[10:31:08] <mark>	 but yeah
[10:31:14] <vgutierrez>	 from "
[10:31:15] <vgutierrez>	 2018-06-26 09:33: It is noticed that the worst-case-scenario behaviour from scap is happening. Scap is abruptly stopped, but by this time, it has removed more than 90% of all servers in both clusters from the pool"
[10:31:27] <vgutierrez>	 so it depools a 90%.. does some magic and then repools it
[10:31:36] <mark>	 that's not normal behavior
[10:31:44] <vgutierrez>	 "expected"
[10:32:20] <vgutierrez>	 depool(90%); do_stuff(); repool(90%);.. and someboby hit Ctrl+C after depool(90%)
[10:32:30] <mark>	 that's not how it should work no :)
[10:32:46] <mark>	 i believe there's just a bug in scap where a restart fails and then the repool doesn't happen
[10:32:51] <mark>	 but probably server by server, not all at once
[10:33:05] <vgutierrez>	 ack
[10:33:07] <mark>	 so probably supposed behavior is, in small batches or one by one, depool server, restart hhvm, repool server, continue with next server
[10:33:19] <mark>	 but restart hhvm fails, repool doesn't happen, continue with next server does happen? :)
[10:33:29] <mark>	 so that's obviously very broken
[10:33:38] <mark>	 but supposedly pybal still caught this by not depooling too many servers, in the past
[10:35:50] <vgutierrez>	 we can reconstruct scap orders by parsing the pybal log... and diffing the number of enabled and disabled servers on every block
[10:37:36] <mark>	 yes
[10:41:29] <vgutierrez>	 vgutierrez@lvs1016:~$ fgrep "Could not depool server" 20180626-LoadBalancers_logs  |wc -l
[10:41:32] <vgutierrez>	 vgutierrez@lvs1016:~$ fgrep "Could not depool server" 20180626-LoadBalancers_logs  |wc -l
[10:41:37] <vgutierrez>	 11
[10:42:06] <vgutierrez>	 we've some depooling blocked by pybal
[10:42:20] <vgutierrez>	 mainly in apaches_80 and api-https_443
[10:43:14] <vgutierrez>	 the three services have a 70% (.7) depool threshold configured
[10:43:42] <vgutierrez>	 interesting... I'll continue after lunch :D
[10:44:09] <mark>	 :)
[12:40:31] <mark>	 ema / vgutierrez: It looks like i'm going to esams tomorrow
[12:40:35] <mark>	 looking at broken cp servers there...
[12:40:47] <mark>	 cp3043 has a bad SSD
[12:41:10] <mark>	 now, that batch did get the SSDs supplied by Dell, but not part of their standard specification, they came with 60 GB Dell SSDs
[12:41:23] <mark>	 and the systems are over 3 years old, so out of support
[12:41:35] <mark>	 which means that our chances of getting that SSD replaced by Dell are minimal
[12:42:02] <mark>	 the SSDs have 5 year warranty by manufacturer Intel, but also an "expected endurance" of 5 years when doing 10 writes a day :P
[12:42:19] <mark>	 which really means, we're on our own
[12:42:29] <mark>	 and those Intel S3700 ssds are no longer sold
[12:42:49] <volans>	 mark: just to keep it in your radar, there is bast3002 too ;)
[12:43:03] <mark>	 volans: i lost track which one is the actually active/used bastion :P
[12:43:06] <mark>	 it changed like 3 times over the year
[12:43:08] <mark>	 is that it? :P
[12:43:14] <vgutierrez>	 yup
[12:43:23] <vgutierrez>	 well.. at least it's the one I'm currently using
[12:43:24] <mark>	 ok
[12:43:28] <mark>	 then i will look at it
[12:43:36] <mark>	 really we need to replace those systems soon
[12:44:25] <paravoid>	 we have quotes in phab I think
[12:44:50] <mark>	 the issue is no space to put them until we carry out all the old crap
[12:46:45] <mark>	 cp3034 has memory issues...
[12:47:08] <mark>	 i could either reseat it or swap with another box
[12:47:44] <mark>	 cp3048 (same batch) has CPU1 machine check errors
[12:48:06] <mark>	 so we have 3 servers out of the same batch with issues
[12:48:16] <mark>	 plus cp3037 which we were able to revive with smarthands
[12:52:41] <paravoid>	 hardware is awesome™
[12:53:35] <mark>	 yeah
[12:53:36] <volans>	 true dat!
[12:53:40] <mark>	 so what I could do is sacrifice one host
[12:53:44] <mark>	 steal its ssds, memory ;p
[12:53:56] <mark>	 and then later see if we can still fix it up
[12:53:59] <mark>	 while the other 2 are up again
[12:56:58] <vgutierrez>	 frankenhost!
[13:00:10] <mark>	 yes
[13:07:10] <mark>	 so for the SSD
[13:07:17] <mark>	 the Intel S3700 is no longer sold, but the S3710 is
[13:07:21] <mark>	 and it has a 400 GB option
[13:07:34] <mark>	 I could try to order and get one delivered by tomorrow
[13:07:42] <mark>	 but it would be a different model
[13:07:50] <mark>	 alternatively I can steal one from the frankenhost
[13:08:04] <mark>	 that's probably a better option this time
[13:11:55] <ema>	 mark: nice! at what time do you think you're going to be there approximatively?
[13:12:17] <mark>	 probably around noon
[13:12:35] <mark>	 i will leave as soon as morning congestion starts to desolve
[13:12:41] <mark>	 and then stay until 7-8 pm or so
[13:12:53] <mark>	 so until evening congestion starts to desolve ;)
[13:13:10] <ema>	 good, I'll be around
[13:13:13] <mark>	 thanks :)
[13:14:03] <mark>	 so, looks like i'll have to make cp3048 the frankenhost
[13:14:39] <ema>	 ok
[13:14:46] <mark>	 what SSDs are we using in our newest cp's btw?
[13:16:01] <mark>	 i bet it'll be hot in the esams hot rows ;)
[13:16:36] <mark>	 it's 30C here
[13:18:16] <ema>	 so for the eqiad cache refresh we ended up choosing these SSDs: https://phabricator.wikimedia.org/T193911#4232088
[13:18:54] <mark>	 Samsung NVMe, ok
[13:19:38] <ema>	 BTW do we have the option to apply thermal paste on cp3037? It's been behaving fine since the repool on Monday, but when it crashed it did show signs of thermal issues
[13:20:49] <paravoid>	 do these servers even support NVMe?
[13:21:06] <paravoid>	 I'd guess not :)
[13:21:27] <mark>	 probably not but would be good to check
[13:21:37] <mark>	 hmm thermal paste
[13:21:47] <mark>	 i have no thermal paste that isn't years old at least...
[13:21:51] <mark>	 wondering where I could get that quickly
[13:22:26] <mark>	 so maybe not tomorrow but we could order it for my next visit, which could happen in the next 1-2 weeks
[13:22:55] <bblack>	 we use the S3710 for 800G drives on all but that latest eqiad order, so an S3710 400 as a replacement should be fine.
[13:22:59] <mark>	 i'll put in a ticket to order some now
[13:23:03] <mark>	 bblack: ah cool
[13:23:21] <mark>	 so i will use one from the frankenhost I think, but we can consider that for any further failures
[13:23:29] <mark>	 and then maybe swap both in a given box?
[13:23:34] <mark>	 instead of having two different ones
[13:23:39] <bblack>	 what's the frankenhost?
[13:23:48] <mark>	 see above
[13:23:55] <mark>	 the machines are all out of support now :(
[13:24:00] <mark>	 and there's one with a bad mainboard
[13:24:03] <bblack>	 ok
[13:24:11] <mark>	 so I think i will use it to get DIMMs and SSDs for the other 2 broken ones in the same batch
[13:24:14] <mark>	 so sacrifice that one
[13:24:31] <bblack>	 yeah this is the whole warranty-vs-refresh thing.  I think all of these cp30[34]x are due for replacement in FY1819 cycle anyways.
[13:24:50] <bblack>	 makes sense
[13:24:50] * mark checks that
[13:24:52] <paravoid>	 I don't think so
[13:25:00] <bblack>	 are they 1920?
[13:25:05] <mark>	 might be
[13:25:08] <paravoid>	 1920?
[13:25:10] <mark>	 given they just went out of warranty too
[13:25:10] <paravoid>	 oh FY
[13:25:23] <paravoid>	 I don't see anything in the FY18-19 budget for that
[13:25:29] <mark>	 yeah nothing
[13:29:57] <mark>	 cp3048: Description: The system board PS2 PG Fail voltage is outside of range.
[13:30:04] <mark>	 probably a bad capacitor on the mainboard somewhere...
[13:30:10] <bblack>	 yeah esams and codfw were together, I guess that was one where I was originally writing down 1819 but we pushed to 1920 for expiry+1
[13:30:51] <bblack>	 probably better this way anyways, it will give more time to adapt hardware plans to ATS results before we do more big orders.
[13:31:05] <mark>	 if it gets dire we can see if we can do some order
[13:31:06] <ema>	 there's also a SMART alert in icinga for cp3048, how timely! (since 12 minutes ago)
[13:31:12] <mark>	 oh really :P
[13:31:21] <mark>	 that SSD realized i was going to steal it tomorrow
[13:31:32] <mark>	 so I need to pick the other one I guess
[13:31:36] <mark>	 but it may be that they're all going bad now
[13:31:41] <mark>	 they're over 3 years old and heavily used
[13:31:52] <mark>	 bblack:
[13:31:53] <mark>	 14:20:38 <mark> our varnish server SSDs:
[13:31:53] <mark>	 14:20:39 <mark>  Endurance Rating (Lifetime Writes) 10 drive writes per day for 5 years 
[13:31:59] <mark>	 that's the DC S3700 ;)
[13:32:15] <mark>	 i think they've been doing a bit more than 10 writes per day in the past 3+ years
[13:32:25] <bblack>	 how many SSD failures are we at now on the cp30[34]x? there's 40 total drives there
[13:32:45] <bblack>	 "drive writes" means "overwrite the entire contents of the drive"
[13:32:46] <mark>	 i think just 2-3
[13:32:58] <mark>	 bblack: aaaah hehe ;)
[13:33:03] <mark>	 it seemed just bad wording
[13:33:26] <mark>	 then they may be in warranty in fact
[13:33:28] <bblack>	 it's just an annoying term.  so for a 400GB disk, 10 dwpd means ~4TB written per day
[13:33:29] <mark>	 they have 5 yr warranty
[13:34:44] <mark>	 do we have stats somewhere on how much data they've actually written?
[13:34:50] <mark>	 they probably track that themselves
[13:35:14] <mark>	 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       11449008
[13:35:16] <bblack>	 yeah that's possible some drive stat we can pull.  I don't think we could get an accurate running total from logged OS stats (if they even went back that far!)
[13:35:35] <mark>	 SMART stats probably work
[13:36:19] <bblack>	 if that Host_writes_32MiB is what I think it is, that's ~350TB
[13:36:27] <bblack>	 so ballpark 100 days worth of their dwpd
[13:36:50] <mark>	 then they should be good for another while
[13:37:25] <mark>	 ema: ehhh device sdc for cp3048?
[13:37:31] <mark>	 these have only 2
[13:37:40] <bblack>	 to confirm, there's also "Media_Wearout_Indicator" which converts that into a percentage of usefil write life consumed
[13:38:17] <ema>	 mark: yes, sdc
[13:38:18] <mark>	 set to 0 on these
[13:38:49] <mark>	 and there's:
[13:38:50] <mark>	 226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       3706
[13:38:50] <mark>	 227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       66
[13:39:24] <bblack>	 yeah another random cp4 I checked has the wear indicator at 96%
[13:39:30] <bblack>	 odd that doesn't match up with TB written
[13:40:23] <bblack>	 cp3046 has 96% for both drives, showing 449TB and 467TB written on them.
[13:40:48] <mark>	 yeah the one I was checking has been depooled for a while
[13:40:52] <mark>	 that's the one with the bad mainboard
[13:40:55] <mark>	 the upcoming frankenhost
[13:41:05] <mark>	 so if you guys need to decom that box before I rip its contents out, please do so today ;)
[13:41:10] <mark>	 or tomorrow morning
[13:41:22] <mark>	 cp3048
[13:41:35] <bblack>	 I'm guessing the discrepancy might be to do with the fact that our drive writes don't evenly walk the whole disk replacing whole blocks.
[13:41:48] <bblack>	 the drive's excess capacity and wear-levelling can only do so much if we have hotspots, which we likely do.
[13:45:44] <bblack>	 maybe we should be monitoring that wearout indicator universally
[13:46:24] <mark>	 Sensor Type : PROCESSOR
[13:46:24] <mark>	 <Sensor Name>                   <Status>    <State>             <lc>        <uc>        
[13:46:24] <mark>	 CPU1 Status                     Failed      Presence Detected   NA          NA          
[13:46:24] <mark>	 CPU2 Status                     Ok          Presence Detected   NA          NA   
[13:46:25] <mark>	 ha
[13:46:25] <bblack>	 oh! maybe I'm reading the percentage backwards, because even some very new ones are showing 100%
[13:46:40] <vgutierrez>	 weird... at some point scap asked for a complete depool of api_80
[13:46:43] <bblack>	 yes, the wearout indicator starts at 100 and counts down to zero
[13:46:46] <vgutierrez>	 Jun 26 09:33:35 service: api_80: 0 enabled 46 disabled: 0% enabled
[13:47:30] <mark>	 vgutierrez: i'm not sure it's reasonable to let depool-threshold prevent that tbh
[13:47:56] <mark>	 even if it did in the past, this is something that should be prevented in scap
[13:48:09] <vgutierrez>	 mark: right now, it looks like we are abusing the depool threshold :)
[13:48:19] <bblack>	 so, they're all fine on wear indicator %.  The only ones that are even sub-100 in the whole fleet are of course esams, codfw, and eqiad (the oldest ones in service)
[13:48:34] <mark>	 bblack: we had graphite machines at 0%
[13:48:44] <mark>	 those were really poorly abusing the ssd
[13:48:53] <bblack>	 the very worst of the cache boxes is cp1071-4,1099 at 94% heh
[13:49:02] <mark>	 godog had a fun evening at FOSDEM with that
[13:52:49] <vgutierrez>	 sigh... I'll continue the debugging later.. let's fix my left shoulder /o\
[14:05:04] <godog>	 lol, fun indeed! friday night outages FTW
[14:05:09] <godog>	 vgutierrez: take care!
[14:05:38] <godog>	 mark: glad I wasn't alone fighting with it tho!
[14:20:29] <ema>	 bblack, vgutierrez: sanity check on adding misc vcl files to text hosts please: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440157/
[14:37:43] <XioNoX>	 bblack: I'm going to ulsfo this morning. let me know if there is anything to do
[14:38:12] <XioNoX>	 (other than plugging that cable for JTAC tests)
[14:41:22] <bblack>	 ema: looks sane? assuming the rest of it works out that's already deployed :)
[14:43:00] <bblack>	 XioNoX: I think there's a bunch of decom work last I heard, but I donno the status (pulling most of the old machines, except maybe bast4001)
[14:43:12] <bblack>	 rob would know :)
[14:44:02] <XioNoX>	 I mean some low hanging fruits :)
[15:36:07] <jynus>	 hi, fellow traffic people, could you check if you have any unusual rate of cache misses since today? I have seen an increase of traffic at lower levels and wanted to ask if you saw something strange at higher ones?
[15:36:44] <jynus>	 this would be text cache
[15:36:57] <XioNoX>	 jynus: the Equinix Ashburn link is saturating, like last week
[15:37:00] <jynus>	 for normal browsing (not api calls)
[15:37:14] <jynus>	 wasn't last week on uploads?
[15:37:22] <XioNoX>	 yeah
[15:37:38] <jynus>	 what I am getting is higher webrequests on enwiki text
[15:37:49] <XioNoX>	 it started about 15min ago, not sure if similar issue
[15:37:55] <XioNoX>	 bblack: ^
[15:38:29] <jynus>	 (note it could be cache misses at other layers- but asking here first)
[15:39:22] <jynus>	 this is what we are seeing at db layer, in case you can corralate on yours: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1530545942418&to=1530632342418&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All
[15:39:32] <jynus>	 traffic, connections and reads increases
[15:40:24] <jynus>	 in the 2x range
[15:42:37] <XioNoX>	 jynus: graphs don't match with the traffic spike
[15:42:44] <bblack>	 yeah
[15:42:47] <XioNoX>	 https://librenms.wikimedia.org/graphs/to=1530632400/id=11600/type=port_bits/from=1530546000/
[15:43:03] <jynus>	 and re pure traffic, not anomaline on cache misses?
[15:43:06] <bblack>	 neither do the edge cache graphs.  there's some minor perturbances, but no huge moves
[15:43:09] <jynus>	 *anomaly
[15:43:13] <ema>	 in a meeting, but I don't see any significant change in hitrate https://grafana.wikimedia.org/dashboard/db/varnish-caching-last-week-comparison?refresh=15m&orgId=1&var-cluster=text&var-site=All&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-12h&to=now 
[15:43:18] <jynus>	 ok, thanks
[15:43:24] <jynus>	 I will check other layers too
[15:43:49] <bblack>	 there was a reqrate bump further back though
[15:44:07] <bblack>	 roughly 14:55 -> 15:05, so nearly 45 minutes back now
[15:44:32] <XioNoX>	 Equinix Ashburn spike seems to have stopped
[15:44:34] <bblack>	 and it was mostly pass-traffic in the increase
[15:44:50] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&from=now-1h&to=now&var-cluster=cache_text&var-site=eqiad&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5
[15:44:57] <jynus>	 yeah, but unless hit ratio changed, our "bump" is 10x less impacting always
[15:45:03] <bblack>	 ^ there's the bump I'm talking about, and you can see the pass fraction increases to match
[15:45:12] <jynus>	 specially for webrequests thata are not api calls
[15:46:12] <bblack>	 I would say looks self-inflicted by changes, but then the patterns I see (even at our level) do seem to be eqiad-specific, so more likely client-induced.
[15:46:56] <bblack>	 there was also a dip in upload hitrate (like before) in eqiad, which trailed off around the time your problem window started
[15:46:59] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&from=now-1h&to=now&var-cluster=cache_upload&var-site=eqiad&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5
[15:47:36] <bblack>	 perhaps this same EC2 instance(s) just scanned through some more large-volume commons downloads, and then moved over to text to rip a bunch of related/meta -data for it.
[15:47:49] <jynus>	 :-)
[15:48:43] <jynus>	 OR
[15:49:00] <jynus>	 millions of americans are planning 4 july on wikipedia instead of working :-)
[15:50:12] <bblack>	 true!
[15:50:51] <bblack>	 both effects at the cache layer are completely absent in codfw+ulsfo though
[15:51:01] <bblack>	 it's very client-specific like last time I assume
[15:51:45] <jynus>	 if it is pure traffic and maybe not self inflicted, there could be not much we can do anyway
[15:51:58] <jynus>	 I overprovision the dbs precisely to sustain this
[15:52:53] <jynus>	 but I think we just beat our top Queries per second ever
[15:54:34] <bblack>	 that's why it seems so odd
[15:54:37] <jynus>	 400 K QPS over a 1 minute average
[15:54:50] <bblack>	 we're not setting records on raw text reqs, nor pass/miss-rate, in the window there
[15:54:50] <jynus>	 on the main production metadata servers
[15:55:04] <jynus>	 that is actually normal-
[15:55:17] <bblack>	 (but we did earlier from 14:55-15:05 have spikes of both.... maybe some delayed effect?)
[15:55:42] <jynus>	 yeah, I don't know about that- maybe it is correlated, maybe has nothing to do
[15:56:46] <bblack>	 one thing that stands out in oxygen logs: seems like an unusually-high level of labs traffic to text
[15:57:59] <jynus>	 I would expect high api calls, not that much non-api webrequests
[15:58:30] <bblack>	 cyberbot-exec-01 is a Cyberbot Database Node (cyberbot::db)
[15:58:30] <bblack>	 cyberbot-exec-01 is a Cyberbot Exec Node (cyberbot::exec)
[15:58:39] <bblack>	 ^ this in labs is the unusual request volume to text
[15:59:04] <bblack>	 and taxonbot
[15:59:28] <jynus>	 both friends of us DBAs on wikireplicas :-)
[16:00:17] <jynus>	 don't spend much time on this for me, ok?
[16:00:29] <jynus>	 I got all the help I needed
[16:15:45] <vgutierrez>	 bblack: hey, I'd love to hear your insights regarding replacing baham with authdns2001 :)
[16:35:49] <bblack>	 vgutierrez: I think it's basically:
[16:36:55] <bblack>	 1) Add to role::authdns::data nameservers list (this should make authdns-update start syncing to it properly), and make sure it is in sync (probably easiest way is push some natural or no-op dns repo change and watch it updates there as well
[16:38:59] <bblack>	 2) The actual routing of the ns1.wikimedia.org IP -> baham is handled by the actual router config, XioNoX can help switch it to route towards authdns2001 instead of baham.
[16:39:12] <vgutierrez>	 ack :)
[16:39:17] <bblack>	 3) Once that's switched over and everything looks sane, can pull baham from the data.pp list, decom, etc...
[17:02:12] <Krenair>	 Ticket#2018062510000561 is someone asking about a security requirement change on 1st August, anyone know what that's about?
[17:02:59] <Krenair>	 I assume we're dropping support for an SSL protocol/algorithm or something but they didn't provide a link
[17:03:21] <jynus>	 that is bad security browsers being blocked to not compromise the others that are using the right security config
[17:03:29] <jynus>	 let me search the tickets and meta page
[17:04:04] <jynus>	 https://phabricator.wikimedia.org/T192559
[17:04:37] <jynus>	 it will also happen on other sites, not only wikimedia (most site that handle credit cards)
[17:04:59] <Krenair>	 figured it'd be something like this, thanks jynus 
[17:05:24] <jynus>	 there is https://en.wikipedia.org/sec-warning
[17:05:39] <Krenair>	 is that the one for 1st august?
[17:05:40] <jynus>	 and this is the canonical help page: https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations
[17:05:51] <Krenair>	 ah that sec-warning page is
[17:06:20] <jynus>	 see the bottom part
[17:06:38] <Krenair>	 huh we're linking readers to wikitech? ok
[17:06:45] <jynus>	 most peopel affected nokia old browsers, playstation 3 and bad corporate networks
[17:08:21] <jynus>	 "This message will remain until Aug 1, 2018. After that date, your browser will not be able to establish a connection to our servers at all."
[17:08:44] <Krenair>	 yeah this particular case is an old phone
[17:09:58] <jynus>	 even some old phones could probably install newer browsers like firefox
[20:55:33] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10Jgreen) Looks like .4, .9, and .15 are available. .9 was tellurium and still has crufty DNS so my suggestion is we use that, and clean up the cruft in the process.
[20:56:43] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10Jgreen) Looks like .4, .9, and .15 are available. .9 was tellurium and still has crufty DNS so my suggestion is we use that, and clean up the cruft in the process.