[07:57:28] <wikibugs__>	 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3174291 (10fgiunchedi) p:05Triage>03High I'm triaging as high since there's potential for an outage. Did the block or rate li...
[07:58:34] <wikibugs__>	 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3174294 (10fgiunchedi) p:05Triage>03Normal
[08:05:30] <wikibugs>	 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3160389 (10Pnorman) Unless you take special measures two tile servers with the same style and data may render labels differently. Generally this is caused by queries w...
[08:05:39] <wikibugs__>	 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3174307 (10Marostegui) Looks like he stopped two days ago: https://grafana.wikimedia.org/dashboard/db/api-summary?orgId=1&from=14...
[08:51:43] <ema>	 cp4005 has been running fine during the night with 4.9, I'm gonna go on with upgrading cache_upload
[13:04:49] <volans>	 bblack, ema FYI I'm about to test the varnish switchdc task with a noop change in VCL if you don't have anything against or in the middle of something:
[13:04:52] <volans>	 https://gerrit.wikimedia.org/r/#/c/347828/
[13:06:07] <volans>	 related executed code is: https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t05_switch_traffic.py
[13:06:09] <bblack>	 I'm ok.  I don't know if ema's in the midst of some 4.9 kernel update reboots, in which case 1/N caches might fail the agent run for being down
[13:06:11] <ema>	 volans: go ahead, I've stopped the upload kernel upgrades meanwhile
[13:06:19] <volans>	 and dry-run output is: https://wikitech.wikimedia.org/wiki/Switch_Datacenter/MediaWiki#output_6 (second part, from switchdc.stages.t05_switch_traffic)
[13:06:20] <bblack>	 well there we go :)
[13:06:32] <volans>	 great, thanks
[13:06:41] <bblack>	 oh hmmm
[13:06:57] <bblack>	 it's not critical now, but you might want to alter the eqiad step to avoid cp1008
[13:07:04] <bblack>	 (e.g. maybe an additional filter on *.eqiad.wmnet)
[13:07:51] <bblack>	 it's a test-server that uses production's role, but it's special and could very well be in an odd state that causes you to see false failures when you execute this, etc
[13:08:05] <bblack>	 volans: ^
[13:08:06] <volans>	 sure, let's avoid it
[13:08:31] <ema>	 'and not cp1008.wikimedia.org' I guess?
[13:08:38] <volans>	 yes, but is it "fixed?
[13:08:45] <volans>	 can I get it dynamically from something else?
[13:08:58] <ema>	 I don't think so
[13:08:59] <volans>	 role::authdns::testns ?
[13:09:52] <volans>	 bblack: ^^ not sure if there is any "enforcement" that means it will always be that one
[13:10:17] <bblack>	 there's not any good role-based way to do it reliably
[13:10:25] <volans>	 class parameters?
[13:10:28] <bblack>	 but all real cache hosts are in private networks, and it's public in .wikimedia.org
[13:10:38] <bblack>	 hence *.eqiad.wmnet as an easy filter
[13:10:55] <volans>	 ok, or not *.wikimedia.org :D
[13:11:22] <volans>	 then ema, let me fix this, feel free to continue with some kernel upgrade in the meanwhile
[13:11:53] <ema>	 volans: OK please ping me when you're done
[13:11:55] <volans>	 sure
[13:12:24] <volans>	 also I'm forcing run puppet only in eqiad/codfw, so if you're doing esams/ulsfo there is no conflict
[13:21:25] <ema>	 volans: I'm doing all DCs, but in any case I'd say it's better to avoid multiple concurrent work of this sort :)
[13:21:40] <volans>	 yeah of course
[13:26:33] <ema>	 volans: +1 on the *.wikimedia.org filter, only minor nit is that you might want to add a comment to remember why the filter is there in the first place
[13:26:59] <volans>	 make sense, let me add that too, the commit message will be lost in history :)
[13:28:03] <volans>	 done
[13:37:32] <bblack>	 volans are you about to start?
[13:37:46] <bblack>	 oh I guess you're in-progress
[13:38:00] <volans>	 waiting for CI to submit, then I can proceed
[13:38:08] <volans>	 but if you have other stuff you want to do first
[13:38:10] <volans>	 I can wait
[13:38:12] <bblack>	 nope
[13:39:38] <volans>	 rebasing the VCL change to now have to wait on CI during the test
[13:39:56] <bblack>	 seems like 1 blank line can skip CI
[13:40:10] <bblack>	 (and really, all our pre-staged stuff, can CI ahead of time and assume it's the same on rebase I think)
[13:40:21] <volans>	 bblack: double check, is it ok to run puppet in parallel on all eqiad caches (8 servers)?
[13:40:27] <bblack>	 yes
[13:42:24] <volans>	 ok, ready
[13:42:32] <ema>	 volans: just a sec
[13:42:52] <volans>	 sure
[13:43:02] <ema>	 cp1072 is booting slowly because of T162612
[13:43:02] <stashbot>	 T162612: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612
[13:43:41] <volans>	 ema: sure let it also do a full puppet run
[13:43:46] <volans>	 and let me know when ready
[13:44:30] <bblack>	 ema: I thought you guys were blacklisting uncore to fix that?
[13:45:41] <ema>	 bblack: true, that was the plan
[13:46:03] <ema>	 let's do that after the switchdc test
[13:47:28] <ema>	 volans: done
[13:47:59] <volans>	 ema: great, also the puppet run I guess
[13:48:03] <ema>	 yup
[13:48:21] <volans>	 proceeding then
[13:50:06] <volans>	 running puppet now
[13:51:29] <volans>	 so the diff is shown 4 times for each host
[13:51:44] <volans>	 I guess we could go with run-puppet-agent -q, thoughts?
[13:51:55] <volans>	 done
[13:54:57] <bblack>	 yeah, so, re: the 4x per host
[13:55:34] <bblack>	 cache::app_directors affects wikimedia-common template
[13:56:02] <bblack>	 the wikimedia-common template is templated-out twice on disk for the two different frontend/backend instances (with per-instance variation)
[13:56:13] <wikibugs>	 10netops, 06Operations, 10ops-esams: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3175202 (10ayounsi) 05Open>03Resolved Juniper received the faulty part,  > Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juni...
[13:56:28] <bblack>	 and then we also have duplicates of everything templated out slightly-differently again for varnishtest
[13:57:00] <bblack>	 run-puppet-agent -q is fine I think, so long as we're absolutely sure the change applied when we get success back
[13:57:37] <volans>	 we have the insurance that exit code was 0
[13:57:45] <volans>	 from run-puppet-agent
[13:57:58] <volans>	 https://github.com/wikimedia/puppet/blob/production/modules/base/files/puppet/run-puppet-agent
[13:58:03] <volans>	 last line
[13:58:04] <bblack>	 right
[13:58:12] <bblack>	 but that just means the agent succeeded, not that it applied the change we wanted :)
[13:58:38] <bblack>	 the race condition that scares me (and I saw it earlier this week in practice)
[13:58:48] <bblack>	 is that this sequence:
[13:59:32] <bblack>	 (on target nodes)# disable-puppet
[13:59:42] <bblack>	 (on gerrit)# C+2 -> Submit
[13:59:51] <bblack>	 (on puppetmaster)# puppet-merge # and wait for it to complete
[13:59:58] <bblack>	 (on target nodes)# enable and then run agent
[14:00:22] <bblack>	 does not gaurantee that the change you were working on actually gets applied to the target nodes
[14:00:50] <bblack>	 apparently even after puppet-merge completes, there can be a small window before the masters pick up all the on-disk changes and use them in new catalog compiles
[14:01:05] <bblack>	 so you can get a no-op success on the last step there.  then wait a short while and try again and your change happens.
[14:01:55] <ema>	 bblack: is there any way to know when the masters finished picking up the change?
[14:02:10] <bblack>	 I don't know what the boundaries are on that short while.  I've only witnessed it and noticed it when the timing gap from puppet-merge completion to agent startup was <10s or so
[14:02:33] <bblack>	 but what's the upper bound there? if we're automating lots of things we'll eventually hit every corner case
[14:02:45] <bblack>	 ema: no idea! :)
[14:02:48] <volans>	 ok, so then we could change run-puppet-agent
[14:02:51] <volans>	 to use -t
[14:02:58] <bblack>	 yeah but even -t sucks right?
[14:02:59] <volans>	 and check for exit code 2
[14:03:07] <volans>	 , an exit code of '2' means there were changes
[14:03:12] <volans>	 from https://docs.puppet.com/puppet/3.8/man/agent.html
[14:03:21] <ema>	 well technically it could be other changes too
[14:03:33] <volans>	 so we could pass a paremter to run-puppet-agent that says expect changes
[14:03:38] <volans>	 and it will fail if exit code is not 2
[14:03:39] <bblack>	 and trying to trawl through all the outputs to "see what you expected to see" sucks too
[14:03:42] <volans>	 ema: sure, that case too apply
[14:03:55] <bblack>	 the core of the problem is that puppet is so non-transactional about these updates
[14:04:02] <volans>	 yeah ew know that
[14:04:05] <volans>	 *we
[14:04:32] <bblack>	 it just gets to be even more of a problem when we automated around it tighter, vs a slower process where a human's at least trying to catch puppet's foibles
[14:04:49] <volans>	 the other option is run-puppet-agent (without -q) | grep 
[14:05:01] <volans>	 not very reliable either
[14:05:29] <volans>	 that points us even more to avoid puppet for those kind of changes that we want to automate :D
[14:05:31] <bblack>	 I wonder if there's a setting by which we know the time bound
[14:05:41] <bblack>	 e.g. some puppetmaster setting for "scan for disk changes every X seconds" or something
[14:06:22] <ema>	 volans: can I resume the kernel upgrades?
[14:06:24] <bblack>	 we've also seen even worse related cases, where a change is half-picked-up
[14:06:30] <volans>	 ema: sure!
[14:06:34] <ema>	 cool, thanks
[14:06:44] <bblack>	 (e.g. the manifest change and related template change from the same commit, there's a time gap between puppet picking the two up)
[14:06:56] <volans>	 ema: sorry I though was explicit the "done" before :)
[14:07:20] <ema>	 volans: yeah, I just wanted to double-check :)
[14:07:37] <ema>	 bblack: I'll try https://gerrit.wikimedia.org/r/#/c/347848/ by hand on the next couple of hosts
[14:07:38] <volans>	 bblack: it would be nice to have puppet log the commit applied
[14:07:54] <bblack>	 I don't think puppetmaster is even aware that it's reading from git
[14:08:08] <bblack>	 it just thinks it's reading from a filesystem, right?
[14:08:51] <volans>	 bblack: probably not (git aware), but we could have a post-hook that set's some variable to the git commit hash and have the clients log it
[14:09:36] <bblack>	 a post-what-hook?
[14:09:42] <bblack>	 (catalog compilation?)
[14:09:44] <volans>	 puppet-merge
[14:10:24] <bblack>	 a puppet-merge hook doesn't tell us much about the puppetmaster process's state though
[14:10:24] <volans>	 I need to check, but we do git-stuff there
[14:10:38] <volans>	 no but if it set's a variable in hiera for example
[14:11:06] <volans>	 or another local file that is included in the puppet compilation process
[14:11:19] <bblack>	 probably step 1 is even understanding what causes the delays (if it's a timer in puppetmaster to check changes, or maybe it uses inotify but we need to sync changes to disk better?)
[14:11:21] <volans>	 and like in base have info() of that value
[14:11:50] <bblack>	 someone surely has studied this problem and filed complaints/bugs before
[14:12:21] <ema>	 you think puppet devs ever received complaints? Impossible!
[14:13:16] <wikibugs__>	 10Traffic, 06Operations: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#2920977 (10Astinson) +1 to @Nuria 's comment: I think the main concern here from @DarTar  and me is that external websites need to have some sort of awareness where the dark tr...
[14:13:22] <volans>	 lol
[14:17:43] <bblack>	 puppet code deploy --wait production
[14:18:01] <bblack>	 ^ apparently this is from puppet enterprise stuff
[14:20:24] <bblack>	 is there even a place where pupptmaster would log such a thing
[14:21:48] <ema>	 bblack: OK to merge https://gerrit.wikimedia.org/r/#/c/347848/? cp2005 and cp1063 booted fine with the module blacklisted
[14:25:05] <volans>	 bblack: which thing?
[14:25:18] <bblack>	 volans: that it had noticed some changes on the FS or whatever
[14:25:43] <bblack>	 ema: maybe? I was looking at modules/base, I don't think it does any magic beyond pushing the modprobe.d file
[14:25:55] <bblack>	 doesn't it need to rebuild initramfs afterwards or something like that?
[14:26:01] <bblack>	 (or affect grub cmdline?)
[14:26:23] <bblack>	 I guess so long as this is loaded relatively-late, maybe not
[14:26:28] <ema>	 bblack: I think it affects udev, adding a line should be enough
[14:26:39] <volans>	 not sure, I can check it, but for now I'm ok to leave the puppet run verbose for the switchover
[14:26:40] <bblack>	 ok
[14:26:47] <bblack>	 x2
[14:27:18] <ema>	 the module hasn't been loaded on cp2005/cp1063, so it should have worked fine :)
[14:40:04] <bblack>	 so, re: the non-MW parts of https://wikitech.wikimedia.org/wiki/Switch_Datacenter
[14:40:25] <bblack>	 1) I fixed up the Traffic section for the pre/post work there, it's pretty simple, two commits and their reverts
[14:40:45] <bblack>	 (that's at: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Traffic )
[14:41:34] <volans>	 you used "run-puppet-agent -q" :-P
[14:41:40] <bblack>	 yup!
[14:42:09] <bblack>	 in this case, the changes aren't actually-critical
[14:42:17] <volans>	 yeah 30m later at most it's ok
[14:42:20] <bblack>	 if a node fails to apply them, they'll apply in 30 minutes or less from cron and all's fine
[14:43:32] <bblack>	 2) Swift - that section needs at the very least a much simpler rewrite, but I'm still not clear on whether it has any inter-dependencies outside of traffic or not, need to sync up with godog on what steps to execute there
[14:44:09] <bblack>	 3) "Services" - I think this section as we delve into it is suffering from a services-services-services problem like labs-labs-labs
[14:45:07] <bblack>	 the overview outlined there now is just about the services in discovery::services, but also notes some of them are varnish-level services too, and then (I think unecessarily) links the varnish and dns-disc parts together as one sequence of steps
[14:45:22] <bblack>	 there are also "services" in varnish that are not in dns-disc, too
[14:45:47] <volans>	 to simplify the picture
[14:45:51] <bblack>	 since dns-disc "services" are a/a anyways...
[14:46:17] <bblack>	 I think we can split that up into one whole stage of "move varnish-level services to codfw only as appropriate" and then a second stage of just the dns-disc changes for those services
[14:46:26] <volans>	 I know that joe wanted to do a quick bash with all the services commands
[14:46:47] <volans>	 I guess to change the confctls
[14:46:53] <volans>	 for the discovery ones
[14:47:04] <bblack>	 right, but the ones with varnish-level changes are commits, which are blended into it in the current doc
[14:47:09] <bblack>	 I think they don't have to be
[14:47:43] <bblack>	 we can do a "Traffic Services" that does all the varnish-level commit-based switching of services that varnish uses as a seperate stage
[14:47:56] <bblack>	 and then "Services" is just the confctl for dns-discovery for the internal discovery-based services
[14:48:17] <bblack>	 (sequentially - finish up all Traffic Services then run Services)
[14:49:12] <volans>	 yeah I guess too is better to decouple them
[14:49:26] <bblack>	 they're not for the same sets of services anyways
[14:49:35] <bblack>	 some are varnish-only, some are dnsdisc-only, some are both
[14:49:59] <godog>	 bblack: I'm about to jump in an interview, sync up maybe after the switchover meeting?
[14:50:15] <bblack>	 godog: ok
[15:04:35] <volans>	 bblack: re:run-puppet-agent -q, to be completely fair and make clear pro/cons, even without the -q if the race condition happens we're a bit screwed given that the script will continue running puppet on the other DC
[15:06:24] <volans>	 we'll be aware of it looking at the output/scrolling it, but still we'll had already proceeded into the next command
[15:08:54] <bblack>	 there's no way to insert a pause with confirmation?
[15:09:11] <volans>	 if needed sure, we have already one to ask to merge the commit
[15:09:31] <bblack>	 yeah I mean between the 2x DC puppet enable/run
[15:09:54] <volans>	 yeah I mean ofc we can add another one :)
[15:10:02] <bblack>	 the whole reason for that split, is if puppet apply from the second DC happens before it's definitely-applied everywhere in the first DC, it will cause user-facing 508s
[15:11:20] <volans>	 the other option could be what I said before, make a parameter to run-puppet-agent to ensure that changes were applied, even if it doesn't work in the general case, we know that this is the only commit of the switchover
[15:11:34] <volans>	 and we could add a force puppet run just before the RO period on the same hosts
[15:11:51] <bblack>	 yeah
[15:11:56] <bblack>	 that's reasonable too
[15:12:05] <volans>	 but then to be devil's advocate, the race condition could happen before
[15:12:17] <volans>	 and at the second run the changes applied are the previous ones and not the intended one
[15:12:32] <volans>	 "before" in the sense at this first run before the RO period
[15:12:57] <bblack>	 well hopefully we'll have a period of puppet quiescence before we step into this procedure's pre-steps anyways
[15:14:12] <volans>	 another option, anything I could grep in the varnish config to ensure the intended config was applied?
[15:15:21] <bblack>	 I think that gets complicated
[15:16:35] <volans>	 :)
[15:17:05] <volans>	 there is no: varnish-tell-me-your-routing-path command? :D
[15:17:36] <bblack>	 :P
[15:17:47] <bblack>	 another random topic on the wikitech page:
[15:17:55] <bblack>	 Traffic: Pre-switchback in two phases: Mon May 1 and Tues May 2 (to avoid cold-cache issues Weds)
[15:17:58] <bblack>	 MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
[15:18:01] <bblack>	 Services, Elasticsearch, Swift, Deployment server: Thursday, May 4th 2017 (after the above is done)
[15:18:08] <bblack>	 ^ that's the switchback ordered list of things
[15:18:45] <bblack>	 other than deployment server, the other 3 at the end there (services, es, swift) are happening after MW switches back?
[15:19:03] <volans>	 bblack: OT tell me when you've finished to edit the wikitech page, I have some changes to do too ;)
[15:19:03] <bblack>	 this seems to be the opposite of how we did things on switchover, in an "unwinding" sense
[15:19:13] <bblack>	 volans: done for now
[15:20:15] <bblack>	 oh no, I had that backwards by the time I finished thinking about it heh
[15:20:27] <bblack>	 it's just dpeloyment server that seems out of unwinding order
[15:20:40] <volans>	 also traffic
[15:20:44] <volans>	 should happen after MW
[15:20:55] <bblack>	 yeah but it's actually independent
[15:21:22] <bblack>	 I could make the traffic switch today and it makes no "functional" difference to users or what data flows into which apps in which DCs
[15:21:34] <volans>	 yeah
[15:21:47] <bblack>	 ditto for deployment tooling I guess
[15:21:59] <bblack>	 although I don't know if that's true in the details
[15:22:14] <bblack>	 (e.g. deployment scripts that send maintenance API requests to the servers or something)
[15:24:05] <bblack>	 if you want, we can move the traffic switchback days to N+1 and N+2 (after MW)
[15:24:21] <bblack>	 it doesn't really matter when they happen, just that they're spaced out to avoid the cold cache issue
[15:24:37] <bblack>	 but I figure this way around lets us declare sooner that we're back to normal state
[15:24:51] <volans>	 absolutely the same for me
[15:25:41] <bblack>	 eh leave it then, plenty of other things to discuss! :)
[15:26:02] <volans>	 yeah!
[15:31:29] <ema>	 upload ulsfo fully upgraded to 4.9, 19 hosts left in the other 3 DCs
[15:57:59] <bblack>	 volans: so in the MW/RO stuff the "merge the varnish patch This is not covered by the switchdc script
[15:58:02] <bblack>	 "
[15:58:09] <bblack>	 do you want me to prep that patch and insert a link there?
[15:58:23] <volans>	 bblack: no
[15:58:28] <volans>	 first, just updated the page
[15:58:41] <volans>	 second the switchdc pause and ask the operator to merge+puppet-merge
[15:58:53] <volans>	 the patch should be already done by jow
[15:59:04] <volans>	 I'll review the links to the patches after the meeting
[16:52:59] <godog>	 volans / bblack swift instructions LGTM, thanks!
[16:53:14] <godog>	 bblack: re: the actual time of the day for traffic (and swift?) do you have any preference?
[16:53:45] <volans>	 so no more salt in that page! \o/
[16:56:13] <godog>	 "we've successfully replaced salt with cumin" #diet
[16:56:22] <volans>	 lol
[16:58:08] <ema>	 OK, 27 upload hosts upgraded to 4.9, 12 left
[16:58:53] <godog>	 slowly but surely less and less 4.4 https://grafana-admin.wikimedia.org/dashboard/db/kernel-deployment?orgId=1&from=now-7d&to=now
[16:59:24] <ema>	 yay! see you all tomorrow
[17:00:38] <bblack>	 cya ema
[17:09:43] <wikibugs>	 10Traffic, 10DNS, 06Operations, 06Services: icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3175699 (10BBlack)
[17:25:00] <godog>	 bblack: for swift I was thinking 15 UTC, did you have a time in mind for traffic on the 18th?
[17:26:54] <bblack>	 I don't have any specific times in mind, no.  whatever works for you works for me
[17:27:45] <bblack>	 godog: are there some other steps that need to happen either before the (temporary) active/active state, or before/after the switch to codfw-only?
[17:28:01] <volans>	 godog, bblack: FYI I saw a lot of cronspam emails with: No such file or directory: '/tmp/vhtcpd.stats'
[17:28:15] <bblack>	 that's not good
[17:28:54] <bblack>	 oh, prometheus checks, probably on freshly-restarted cache machines
[17:28:56] <bblack>	 that kinda makes sense
[17:29:34] <bblack>	 the check with "systemctl is-active -q vhtcpd &&" isn't quite enough, basically
[17:29:39] <bblack>	 should check the file exists, too
[17:29:46] <godog>	 bblack: not this year no, the mw-config patch we merged last year in the middle is now the default
[17:30:07] <volans>	 from a quick check seems all 4.9
[17:30:09] <bblack>	 (as vhtcp only outputs the file every 15s, and that means it's not there for the first 15s the daemon is up)
[17:30:10] <godog>	 bblack: I've tentatively put swift at 15 and traffic at 16 (UTC) on the 18th
[17:31:21] <bblack>	 godog: if there's truly nothing else to coordinate but "do the traffic bit", we can also just consider swift to be another random cache backend service, which are all being handled together in: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Traffic_-_Services
[17:33:02] <bblack>	 (the steps are the same for all of them and swift, just right now we have swift broken out as a separate set of commits and separate steps, instead of merged up into that unified set of commits/steps for all the others)
[17:34:02] <godog>	 ok, I'm about to join SoS but I'll take a look later
[17:34:12] <bblack>	 ok
[17:35:15] <mutante>	 this is really not an important service, but i CAN switch planet1001 to planet2001 and it'd be active/active
[17:35:27] <mutante>	 i wonder if i should add that somewhere
[17:36:22] <bblack>	 mutante: on that whole topic...
[17:36:37] <bblack>	 basically, I think there are a number of different cache_misc services that are ready to do that
[17:37:11] <mutante>	 i wonder what will happen to phab and gerrit
[17:37:11] <bblack>	 I just don't have time to enumerate and validate them all (we can check mechanically if they seem to have a similar dns hostname or site.pp role in codfw, but even then that doesn't mean all issues are solved to allow it to sanely be active/active)
[17:37:54] <bblack>	 if there's some that we're *sure* are ready for active/active traffic right now today, we can set them a/a right now in cache_misc, and add them to the lists and commits for codfw-only for the switchver too
[17:38:16] <bblack>	 re: phab and gerrit, they'll just stay operating in eqiad like everything else we're not including in testing
[17:38:43] <mutante>	 ok, both phab and gerrit alreayd have 2001 counterparts but afaict they are just "almost ready"
[17:39:30] <bblack>	 right, if it's active/passive warm-standby stuff with some steps needed to handle the transition, basically we haven't planned for those to be in scope and it's too late to sort it out now, so they stay in eqiad this go-round
[17:39:36] <mutante>	 so when i listened to the talk about switchdc there was the part about the discovery zone in DNS
[17:39:44] <mutante>	 but that isnt there yet, is it?
[17:39:49] <bblack>	 it is
[17:39:59] <bblack>	 the discovery dns stuff is only for internal though, not public-facing
[17:40:26] <bblack>	 look at the bottom of the wmnet zonefile for the entries
[17:40:54] <mutante>	 oh! "disc-*" got it now
[17:41:04] <mutante>	 so i could add planet there i suppose
[17:41:15] <bblack>	 no, I don't think planet goes there
[17:41:26] <bblack>	 it's not an internal service that other internal services are consuming
[17:41:50] <bblack>	 the dns discovery stuff is all for internal service<->service traffic within .wmnet
[17:42:02] <mutante>	 aha, ok, so that is a change in misc_web varnish then?
[17:42:09] <bblack>	 right
[17:42:17] <bblack>	 in hieradata/role/common/cache/misc.yaml
[17:42:17] <mutante>	 will look at making that one
[17:42:20] <mutante>	 alright
[17:42:47] <mutante>	 ah, looks like i just add a second backend
[17:42:55] <bblack>	 you can see the example with e.g. "noc" or "pybal_config" in cache::app_directors
[17:42:58] <bblack>	 right
[17:43:02] <mutante>	 ok, cool
[17:43:29] <bblack>	 when you set that, it's active/active in a public-facing sense (as in EU users might hit the eqiad backend and Asia users the codfw backend and such)
[17:44:19] <mutante>	 ok, i just need to enable crons on both and that should be fine
[17:44:22] <bblack>	 (and some horribly-borked corner case users might flip-flop the two backends)
[17:47:43] <mutante>	 some directors have the "1001" in their name, but probably shouldnt if they have multiple backends
[17:48:04] <bblack>	 yeah :)
[17:48:39] <bblack>	 graphite is in an odd state too :)
[17:48:58] <bblack>	 but I'm basically just leaving it all alone for now, no time to sort out exactly what each backend is capable of in this sense and fix them all
[17:49:45] <mutante>	 yep, *nod* and you dont want many changes right before switch either
[17:50:30] <bblack>	 but sometime later in the quarter, after the switchback, we should probably audit all the cache_misc backends and set them up properly where it makes sense
[17:53:52] <mutante>	 yup
[17:57:35] <godog>	 bblack: yeah I think swift can be effectively be considered under 'traffic - services' but I'd argue for keeping separate reviews and puppet runs, on the basis that cache upload has immediate user visible effects if sth goes wrong
[18:00:51] <bblack>	 godog: ok that's fine
[18:01:18] <volans>	 that little special snowflake :-P
[18:02:15] <bblack>	 yeah the rest of them have user-visible impact too, but it's not much trouble to do things either way
[18:04:54] <godog>	 *nod*
[18:05:01] <godog>	 ok I'm off, see you tomorrow!
[18:05:05] <volans>	 cya
[18:40:06] <bblack>	 hmmm ores
[18:40:37] <bblack>	 I see we're doing active/active for it for inter-service stuff.  it also has a public endpoint through cache_misc at ores.wikimedia.org
[18:40:51] <bblack>	 do we want that a/a and/or switched for the switch-testing?
[19:39:11] <wikibugs__>	 10Traffic, 10MediaWiki-Cache, 06Operations, 06Performance-Team: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#3176241 (10aaron) 05Open>03declined The rebound purge is deliberate and hard to de-duplicate in any case (unless two purges came in at the same time a...
[20:11:36] <wikibugs>	 10netops, 06DC-Ops, 06Operations, 10ops-codfw: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3176312 (10RobH) p:05High>03Normal a:05faidon>03ayounsi I chatted with @ayounsi about this via IRC.  He is now aware of this pending task, though it isn't high priority.  Basically codfw ha...
[20:12:24] <wikibugs>	 10netops, 06DC-Ops, 06Operations, 10ops-codfw: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3176317 (10RobH)
[20:13:02] <wikibugs__>	 10netops, 06DC-Ops, 06Operations, 10ops-codfw: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#970935 (10RobH)
[20:13:24] <wikibugs__>	 10netops, 06DC-Ops, 06Operations, 10ops-codfw: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#970935 (10RobH)
[21:13:21] <wikibugs>	 10Traffic, 10DNS, 06Operations, 06Services (watching): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3176556 (10mobrovac)
[21:44:01] <mutante>	 should get rid of those IP alerts on channel:  https://gerrit.wikimedia.org/r/#/c/347984/
[21:44:33] <mutante>	 eh, no, i need to add a parameter, but yea
[21:46:03] <mutante>	 no i don't. it's already there:) it's just plural instead of singular. should just work
[22:40:34] <wikibugs__>	 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3176850 (10DStrine)
[22:42:33] <wikibugs__>	 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown, 07JavaScript: Use Upgrade Insecure Requests on Wikimedia wikis - https://phabricator.wikimedia.org/T101002#3176879 (10Krinkle) >>! In T101002#2500137, @BBlack wrote: >>>! In T101002#1326438, @Krinkle wrote: >> This header currently results...
[23:00:08] <wikibugs>	 10Traffic, 10DNS, 06Operations, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3176931 (10GWicke)
[23:25:03] <wikibugs__>	 10Traffic, 10DNS, 06Operations, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3177050 (10GWicke) There are big drops in *action* API backend requests and huge spikes in latency around both times:  {F7514011}  The same latenc...
[23:30:56] <wikibugs>	 10Traffic, 10DNS, 06Operations, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3177056 (10BBlack) FWIW - I did the same depooling (for reinstalls) in codfw this afternoon, and there was no impact in that case.  So this seems...