[01:43:51] 10Traffic, 10netops, 10Operations: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3499584 (10ayounsi) [03:04:53] 10netops, 10Operations: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499689 (10mmodell) [03:04:55] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499702 (10mmodell) [03:05:10] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499689 (10mmodell) [03:06:00] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499689 (10mmodell) p:05Triage>03High [03:08:45] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499709 (10mmodell) Note that this is high priority but not UBN, simply because git-ssh is barely used currently. Phabricator supports git over https which is... [03:21:29] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499712 (10ayounsi) a:03ayounsi [03:32:17] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499734 (10bd808) Diffusion is the master repo for some sub-set of #toolforge projects. Its not a huge number of people impacted, but it is certainly non-zero.... [04:23:06] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499771 (10mmodell) ok so that takes care of the smtp/ipv6 issue, however, git-ssh still doesn't work. So I guess I was wrong about them being related. [04:24:14] 10netops, 10Operations, 10Phabricator: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499772 (10mmodell) 05Resolved>03Open [04:27:31] 10netops, 10Operations, 10Phabricator, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499790 (10mmodell) We can revert the hard-coded smtp server IPs now: cd461e5cf761f053d453528cd26331c80ba66f17 [04:29:29] 10netops, 10Operations, 10Phabricator, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499792 (10Dzahn) That IP that was removed also existed on iridium before, on eth0, were i removed it: 00:23 mutante: iridium sudo ip a... [04:49:20] 10Traffic, 10netops, 10Operations: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3499804 (10Marostegui) Hi, We have some critical DB hosts on that row that would need to be either failed over or to communicate to users that a period of read-only is happening. To fail over those hos... [04:56:26] 10netops, 10Operations, 10Phabricator, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499806 (10ayounsi) the git-ssh issue is due to LVS not knowing where to forward the packets. ``` ayounsi@lvs1002:~$ sudo ipvsadm -Ln TC... [05:07:00] 10netops, 10Operations, 10Phabricator, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499811 (10Dzahn) picked a new IP in the 10.64.16.0/22 network (row B) and used that instead [06:11:58] 10netops, 10Operations, 10Phabricator, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499869 (10Dzahn) Xionox restarted pybal after this .. and then: 23:04 <+icinga-wm> RECOVERY - PyBal backends health check on lvs1005 is... [06:21:19] 10netops, 10Operations, 10Phabricator, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499870 (10Dzahn) We can now talk to the ssh. Tested from external, IPv4 and IPv6. There is apparently another issue with Phabricator its... [07:23:19] 10netops, 10Operations, 10Phabricator, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499921 (10Dzahn) 05Open>03Resolved 00:21 < mutante> twentyafterfour: try again 00:22 < twentyafterfour> mutante: works!!! [08:04:26] 10Traffic, 10netops, 10Operations: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3500004 (10elukey) A couple of notes from my side after reading the host list: Analytics: 1) all the analytics* host in row D down shouldn't be an issue for a brief amount of time since the Hadoop clus... [09:50:21] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500371 (10elukey) >>! In T121561#3323871, @Ottomata wrote: > We should do some work to understand how ACLs work and what ACLs f... [09:53:06] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500389 (10elukey) > Note that this plan doesn't yet consider encryption of traffic between Kafka and Zookeeper. Should we? We'... [10:03:02] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500417 (10elukey) @Ottomata should we keep this task open given that we already have https://phabricator.wikimedia.org/T166167 ? [13:26:23] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500872 (10Ottomata) Let's keep it open and use this task to track actually enabling TLS / ACLs for different clients. [14:04:26] heh so the non-zero plan has a flaw on the v6 side... [14:05:35] they're documented (to zero) in two places: [14:05:37] https://office.wikimedia.org/wiki/Wikipedia_Zero_Destination_IP_Addresses [14:05:44] https://www.mediawiki.org/wiki/Wikipedia_Zero/IP_Addresses [14:06:10] the old set on office had the text/multimedia still split, but the newer one there and the only on mediawiki.org show the full set (no multimedia diff) [14:06:18] and they use the whole /64 for v6 :P [14:06:57] we actually own the enclosing /46's, but it means we need to do this a bit differently to support ipv6 as well [14:07:22] err, the enclosing /46 (singular) [14:07:31] which is 2620:0:860::/46 [14:08:01] but zero has 4x /64 ranges at 2620:0:86[0123]:ed1a/64 [14:10:02] I've always assumed "ed1a" was chosen to look like "edia" as in wiki[mp]edia [14:11:01] we could stick with the strange naming and pick another for non-zero LVS that's similar, maybe ed14 since it's nearby, making fewer holes in our long-term addressing there [14:11:17] or ed1b if we want to keep them really tight [14:13:04] internal bikeshedding conclusion: ed14 :) [14:13:47] ed1a was indeed to look like edia [15:56:08] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3501481 (10Jgreen) >>! In T171962#3492728, @mark wrote: > No objections from me. It does add complexity somewhat and will probably add some failure modes wher... [15:59:58] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3501493 (10RobH) [16:15:56] :) [16:58:31] bblack: to look at upgrading varnish to varnish5, should the vm be Debian 8 or 9? [17:20:48] good question :) [17:21:00] it would probably be simpler to target 8 for now to minimize the diffs while working on it [17:21:13] we will want to port the package to 9 later, but that might involve other minor diffs that just get in the way at this point [17:24:31] (but I could see the opposite argument if you want to go for it: port the varnish4 package to 9 first in your 9 VM and get that working, then work on the shift to Varnish5) [17:25:21] in theory the 8/9 port is trivial, but nothing ever ends up trivial with Varnish. There will probably be something dumb like a change in the default owners/perms of some obscure directory in /var and it will break Varnish and take a week to track down the cause or whatever :P [17:32:25] bblack: any idea what does the following mean: "Error 400 on SERVER: undefined method `[]' for nil:NilClass at /etc/puppet/modules/role/manifests/cache/kafka.pp:6 on node traffic-varnish5.traffic.eqiad.wmflabs " [17:32:27] ? [17:33:18] $kafka_config = kafka_config('analytics') [17:33:22] is the line [17:33:53] modules/role/lib/puppet/parser/functions/kafka_config.rb defines the function [17:33:59] I've never looked at this bit tbh heh [17:34:34] it ultimately is drawing from hieradata though, and then putting it in some special output form [17:34:39] yeah, and the error message isn't very verbose :) [17:34:54] I'm probably missing some hiera lines indeed [17:35:26] something to do with the $::site being eqiad I think [17:35:32] you probably have to fake that in various places [17:35:49] bblack: is it possible to see _all_ the hiera variables assigned to a prod host? [17:36:10] probably? [17:36:24] maybe with utils/hiera_lookup [17:36:45] hmm it requires a key though [17:37:37] this is why I hate trying to work on complex magic like the cache cluster roles from a labs VM heh [17:37:51] you might try looking hieradata/labs.yaml at all related bits set there to make the beta-cluster text/upload caches work [17:38:05] under "Cache layer stuff" [17:38:30] but I bet what's missing is $::site = "eqiad", or some equivlant/derivative [17:38:51] that seems to be what that kafka_config() stuff is keying on in this case [17:52:59] bblack: I think it's half puppetized now, but varnish fails to start (and it's needed to finish the puppet run) [17:57:18] bblack https://www.irccloud.com/pastebin/lj2kddFh/ [18:15:26] XioNoX: so the -sfile size=6G is too big thing [18:15:31] (in your paste) [18:15:47] modules/role/manifests/cache/base.pp [18:15:57] $storage_size = $::hostname ? { [18:15:57] /^cp1008$/ => 117, # Intel X-25M 160G (test host!) [18:16:02] [...] [18:16:05] default => 6, # 6 is the bare min, for e.g. virtuals [18:16:49] so you could add a stanza there in the puppetization to cover your hostname with a smaller value. or get more storage on /srv/sdX if it's easy. or change the prod low-end default for virtuals to be even lower [18:17:45] for ease-of-puppetization, the normal text cache role always creates two storage files, in places controlled by hieradata "storage_parts" [18:17:56] on prod that's /srv/sda3 + /srv/sdb3 (2x SSDs) [18:18:10] for beta-cluster it's: [18:18:13] role::cache::base::storage_parts: - vdb - vdb [18:18:14] bblawhere is that variable on the varnish side (just out o curiosity) [18:18:22] (both on same filesystem) [18:18:39] so the 6G min means 2x 6G files = 12G of space you'd need, if you want to solve it by adding more virtual disk space somewhere [18:18:53] XioNoX: it's in the systemd unit file after puppet is done templating it [18:19:01] ahh, I see, I was like "the disk has more than 6G of free space!" [18:19:08] /lib/systemd/system/varnish.service [18:19:32] there will be arguments like -sfile=/vdb/varnish.main1,size=6G or whatever [18:19:36] a pair of them probably [18:19:46] bblack: hum " varnishd[31629]: Error: (-sfile) size "2G": larger than file system" [18:20:16] make it smaller, you don't have to use G as the unit either [18:20:20] M works too [18:20:40] you may need to manually delete files it has already created and failed on, too [18:20:48] well in puppet I can't pick the units [18:21:03] right, I meant if you're editing the unit file manually just to confirm what you want to puppetize [18:22:19] I wonder if it takes floating point, e.g. 0.2G [18:22:38] would be easier than verifying we can move the suffix to $storage_size and not mess up cache_upload's file-split calculations [18:23:25] is it somewhere else than /lib/systemd/system/varnish.service ? [18:23:41] I edited both values to 50M and I'm still getting the 2G error [18:24:36] after editing a unit file, you have to do "systemctl daemon-reload" to take effect, before trying to start/restart/whatever [18:25:03] (after manual editing that is. puppet takes care of it in the puppetized cases) [18:25:28] ah! I think it's starting [18:25:41] I have to run though, will look at it more later [18:25:44] "FATAL Cannot get nodes from SRV records lookup _etcd._tcp.traffic.eqiad.wmflabs: no such host" [18:25:51] that one doesn't look hard to fix [18:25:55] hah [18:26:01] well... [18:26:02] thank you! [18:26:34] XioNoX: for later: probably easiest to just disable the etcd stuff, like we do for deployment-prep [18:26:54] XioNoX: in hieradata/labs.yaml: varnish::dynamic_backend_caches: false [18:26:55] noted [18:27:03] oh great! [19:10:16] bblack: is there doc on how to build varnish with our packages? [19:10:24] er with our patches [19:14:24] I guess that's a good start https://wikitech.wikimedia.org/wiki/Varnish#Package_a_new_Varnish_release [19:16:08] XioNoX: yeah I think that's from ema's work during the 3-to-4 transition [19:16:37] I'd avoid pushing anything back up to gerrit anywhere until you two sync up on it, some things are hard to undo and branch for deb packages here is complicated [19:17:00] yeah :) [19:20:27] if the patches don't apply well to 5 (I doubt they do!), something to keep in mind is that the current patch 0002-exp-thread-realtime.patch is basically-unused and might be skippable [19:20:40] it was an idea we were trying out to work around an issue, and it didn't ultimately help [19:29:57] okay, added to my notes [19:32:10] bblack: where are our patches btw? I can't find the repo