[08:08:55] <moritzm>	 I'll reimage bast1002 starting in ~ half an hour, please switch to a different bastion in the mean time
[08:10:12] <Zppix>	 i assume thats prod?
[08:14:13] <moritzm>	 yeah, that's one of the bastions for prod access, nothing changes for accessing Cloud VPS of Toolforge
[08:14:39] <moritzm>	 one alternative is to switch to bast2002.wikimedia.org
[08:16:58] <Zppix>	 thanks just wanted to confirm :), good luck
[08:35:23] <moritzm>	 starting now
[09:29:03] <arturo>	 moritzm: I just opened T275599
[09:29:04] <stashbot>	 T275599: debmonitor: returns proxy error when user is in too many groups - https://phabricator.wikimedia.org/T275599
[09:29:31] <moritzm>	 ack, thanks. didn't get to it on Friday, but will merge a fix in the next days
[09:30:09] <arturo>	 great, thanks!
[09:31:44] <moritzm>	 bast1002 is up again
[09:37:05] <dcaro>	 \o/
[09:37:08] <dcaro>	 thanks!
[09:42:19] <dcaro>	 took the opportunity to update the ssh config and use ProxyJump instead of ProxyCommand, looks cleaner now :)
[10:23:01] <apergos>	 woo hoo
[12:02:11] <vgutierrez>	 dcaro: now that you mentioned it I did the same, thanks
[12:40:53] <XioNoX>	 not finished yet, but that's a good beginners introduction podcast to BGP, peering/transit, etc https://blog.ipspace.net/2020/06/bgp-navel-gazing.html
[15:08:53] <kormat>	 puppet style question: is it allowed to use lookup() in puppet functions?
[15:09:55] <volans>	 kormat: no, see https://wikitech.wikimedia.org/wiki/Puppet_coding#Hiera
[15:09:55] <cdanis>	 AIUI the only place lookup() is technically allowed is in the arguments given to profile classes
[15:09:59] <volans>	 CI should also vote -1
[15:10:03] <cdanis>	 there are many exceptions in the codebase ofc
[15:10:09] <volans>	 IIRC
[15:10:41] <cdanis>	 (and said exceptions mostly have the wmfstyle linter disabled in a line comment)
[15:10:46] <kormat>	 mmph. so instead i'll need to put the lookup in every profile that calls the function, and pass in the hash as a param
[15:11:04] <cdanis>	 or make an exception ;)
[15:11:08] <kormat>	 that partially defeats the purpose of making a function to not repeat this code multiple times
[15:11:33] <cdanis>	 IMO don't let the style guide get in the way of doing the right thing when it makes sense
[15:11:43] <volans>	 kormat: what's the use case?
[15:11:46] <kormat>	 cdanis: it's puppet, the only right thing involves napalm
[15:12:17] <volans>	 kormat: we tried that with Arzhel, didn't work well with netbox how we wanted :-P (https://github.com/napalm-automation/napalm )
[15:12:34] <kormat>	 volans: https://phabricator.wikimedia.org/T275497#6856476. i'm defining a hash in hiera that contains an entry per section, with 2 parameters in each entry
[15:13:00] <kormat>	 i'd like to have a few small functions to do lookups in the hiera hash and provide a simple answer to the caller
[15:14:59] <kormat>	 e.g. an `is_in_writeable_dc` function.
[15:14:59] <cdanis>	 there's some precedent of that with the services hiera block
[15:15:28] <volans>	 we do have some cases of lookupvar called in wmflib fwiw, I'd 302 to jb.ond (but he's out today)
[15:16:24] <kormat>	 volans: i don't know lookupvar. can it do "look up this variable, if the value is mw_primary, then return the value of mediawiki::state('primary_dc')"?
[15:19:05] <volans>	 not by itself ofc, and to have them in scope you still need to pass them I think so maybe not useful
[15:21:49] <kormat>	 you're right, getting my hopes up like that was just foolish
[15:22:51] <Zppix>	 Now, i know nothing about puppet, but just an idea, what if you looked up the mw_primary and mediawiki::state('primary_dc'), then combined that into some sort of string or something in a function, and then used the function for whatever you needed?
[15:22:56] <RhinosF1>	 kormat: you said puppet and simple in the same sentence, that's just way too hopeful in itself
[15:24:32] <Zppix>	 feel free to ignore me, if what I said is impossible or entirely not what you needed
[15:24:59] <kormat>	 RhinosF1: 💯
[15:25:51] <RhinosF1>	 kormat: puppet is an amazing tool when it works but getting there normally leaves me wondering if I'm speaking the right language by the end
[15:29:00] <cdanis>	 that is true of every configuration management tool I've ever used
[15:30:03] <_joe_>	 yeah no one invented one that doesn't suck
[15:30:13] <_joe_>	 and I frankly think it's basically impossible to do
[15:30:16] <cdanis>	 the problem space doesn't allow a good solution
[15:31:12] <_joe_>	 also puppet became much easier to write (and less to debug) over the last few years
[15:31:29] <jynus>	 do you know if puppet is broken on alert1001?
[15:31:45] <jynus>	 I got an "Error 500 on SERVER"
[15:32:12] <jynus>	 modules/monitoring/functions/build_notes_url.pp, line: 22, column: 13
[15:32:26] <_joe_>	 kormat: I'll take a look at your problem in a few
[15:32:30] <cdanis>	 https://puppetboard.wikimedia.org/node/alert1001.wikimedia.org
[15:32:40] <volans>	 since 14:05:10
[15:32:42] <_joe_>	 I think I had that problem already, and somehow solved it basically
[15:33:01] <volans>	 Error while evaluating a Function Call, The $dashboard_links and $notes_links URLs must not be URL-encoded (file: /etc/puppet/modules/monitoring/functions/build_notes_url.pp, line: 22, column: 13) (file: /etc/puppet/modules/profile/manifests/mediawiki/alerts.pp, line: 46
[15:34:01] <_joe_>	 effie: ^^
[15:34:33] <cdanis>	 not sure how easy it is to add a spec test for that, but, it should have been caught by pcc
[15:34:35] <effie>	 sigh, when it wasnt url encoded, it didn' like it
[15:34:41] <effie>	 when it is, it still doesnt like it
[15:35:02] <jynus>	 which is the patch, I may be blind, but don't see it
[15:35:11] <effie>	 volans: I will push a patch to fix it
[15:35:17] <effie>	 sorry I didn't see it
[15:35:21] <effie>	 mybad
[15:35:32] <jynus>	 np
[15:35:37] <jynus>	 is it this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/666614/3/modules/profile/manifests/mediawiki/alerts.pp
[15:37:19] <jynus>	 O, I see, I was looking at the function called, not the caller
[15:37:40] <jynus>	 s/function/resource/
[15:38:43] <effie>	 jynus: yes that is the patch tha makes this complaint
[15:42:50] <_joe_>	 kormat: so you want to define the data structure in hiera, correct?
[15:43:08] <_joe_>	 or well, in a specific place in puppet
[15:43:16] <_joe_>	 and retrieve it from a function
[15:43:24] <_joe_>	 we have precedent, IIRC, let me find it
[15:43:48] <cdanis>	 _joe_: the service catalog functions :)
[15:44:25] <_joe_>	 cdanis: it's a bit different in terms of usage, but yes
[15:44:45] <_joe_>	 I was thinking of https://github1s.com/wikimedia/puppet/blob/HEAD/modules/role/lib/puppet/parser/functions/kafka_cluster_name.rb
[15:45:29] <_joe_>	 (that  will need to be moved to call lookup btw)
[15:47:08] <_joe_>	 cdanis: the catalog functions use loadyaml() actually
[15:47:16] <cdanis>	 hah
[15:47:32] <_joe_>	 see https://github1s.com/wikimedia/puppet/blob/HEAD/modules/wmflib/functions/service/fetch.pp
[15:47:33] <klausman>	 Um, what is the equivalent of `racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE` with modern idracs? That command doesn't work anymore.
[15:47:52] <_joe_>	 klausman: I think we have updated instructions on wikitech
[15:48:07] <_joe_>	 you just have to find the right revision of the platform specific docs
[15:48:21] <klausman>	 https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation#Reboot_and_boot_from_network_then_console shows the old stuff still
[15:48:43] <_joe_>	 oh that's been completely reorganized
[15:48:56] <klausman>	 Great! Where?
[15:49:05] <_joe_>	 I don't know!
[15:49:09] <Zppix>	 haha
[15:49:24] <_joe_>	 I meant the platform-specific docs have been reorganized, we used to have multiple pages for dell hardware
[15:51:09] <effie>	 godog:
[15:51:13] <klausman>	 And dell's docs that I can find are either a) vague or b) paywalled.
[15:51:18] <_joe_>	 yeah
[15:51:19] <_joe_>	 :/
[15:51:37] <Zppix>	 I wonder if my .edu can get past some of the paywalls
[15:51:42] <_joe_>	 klausman: I'd go ask in #-dcops
[15:53:21] <klausman>	 done.
[15:56:14] <godog>	 effie: "you called?"
[15:57:03] <effie>	 godog: yes I am waiting on a pcc
[15:57:07] <effie>	 give me 1s
[15:58:51] <effie>	 I am wondering if this is a CI issue
[15:58:56] <effie>	 I am testing running pcc for https://gerrit.wikimedia.org/r/c/operations/puppet/+/666663/
[15:59:33] <effie>	 and I am still getting https://puppet-compiler.wmflabs.org/compiler1002/28199/alert1001.wikimedia.org/prod.alert1001.wikimedia.org.err
[15:59:43] <effie>	 which is the error I am trying to fix
[16:02:54] <jynus>	 interesting, I don't understand it either
[16:06:24] <effie>	 I think it is a CI problem than an actual one
[16:06:50] <jynus>	 are we sure that will actually happen, and it is not marking that is the "old" error?
[16:07:38] <jynus>	 I'd say to merge as is to check, and then refine
[16:08:01] <jynus>	 or with the previous value, up to you
[16:12:44] <effie>	 if godog  says yes
[16:12:46] <effie>	 I will do so
[16:12:50] <jynus>	 ofc
[16:12:52] <shdubsh>	 effie: that's the production error.  your patch (666663) looks like it resolves the error.
[16:13:14] <effie>	 shdubsh: since you are here, then I will merge
[16:13:21] <shdubsh>	 +1
[16:13:29] <effie>	 awesome, thanks !
[16:13:41] <cdanis>	 effie: as shdubsh says you were looking at the wrong file -- https://puppet-compiler.wmflabs.org/compiler1002/28199/alert1001.wikimedia.org/change.alert1001.wikimedia.org.err verifies it is fixed
[16:14:19] <cdanis>	 sorry, I didn't notice from the link earlier
[16:14:39] <effie>	 I used the link that pcc from my cmd gave
[16:14:41] <effie>	 mmmm
[16:14:45] <cdanis>	 yes
[16:14:50] <cdanis>	 and then you clicked 'production errors/warnings'
[16:14:55] <cdanis>	 not 'change errors/warnings'
[16:14:57] <cdanis>	 :)
[16:14:58] <effie>	 ah !
[16:15:15] <cdanis>	 maybe "before" and "after" are better names
[16:15:15] <effie>	 I didn't noticed I clicked production
[16:15:26] <effie>	 yeah yeah, too much ado for nothing
[16:15:28] <effie>	 thank you
[16:16:17] <jynus>	 I am running puppet on alert1001 now
[16:16:42] <effie>	 jynus: I am running too
[16:16:48] <jynus>	 oh
[16:16:51] <effie>	 probably mine will wait for yours
[16:16:56] <shdubsh>	 FWIW, Icinga does its own url mangling which is why that gate exists.  A `%20` in the url would be itself url-escaped.
[16:17:30] <jynus>	 lots of stagged changes applying now
[16:17:49] <jynus>	 Profile::Mediawiki::Alerts including
[16:17:49] <cdanis>	 check_prometheus rules are also, like, three layers of indirection of quoting, and that is also somewhat unavoidable :/
[16:18:02] <jynus>	 yay!
[16:18:12] <jynus>	 thank you, effie, it worked
[16:18:20] <jynus>	 and others that helped too
[16:18:53] <effie>	 shdubsh: thanks
[16:19:16] <jynus>	 check icinga, there may be changes that may not be applied for some time
[16:23:06] <klausman>	 _joe_: hey, so I have a question about the partman recipe for kubernetes nodes (partman/custom/kubernetes-node.cfg). Is it *meant* to be semi-manual?
[16:24:46] <kormat>	 (that's my theory based on the fact that it doesn't create any filesystems/mountpoints)
[16:36:28] <_joe_>	 klausman: 301 to jayme (in a meeting)
[16:36:58] * jayme looks up
[16:37:36] <jayme>	 408
[16:38:12] <klausman>	 Do you *want* me to spam requests until one goes through? :)
[16:38:34] <jayme>	 eheh, nono. I'll have a look
[16:39:29] <klausman>	 thanks :)
[16:42:09] <jayme>	 is there anything specific you are missing?
[16:42:52] <jayme>	 It should create / ofc and all the docker volumes will be created by docker in lvm directly
[16:44:49] <jayme>	 please bear with me as I don't really speak partman
[16:47:12] <jayme>	 hmm..but compared to other files is indeed does not look as if it would create a root-fs. I wonder why it did the last time we set up nodes...
[16:47:42] <jayme>	 did you actually try klausman? And ended up without root-fs?
[16:48:34] <klausman>	 No, I just hit return and it seems to have made a good install
[16:50:39] <jayme>	 so you needed to hit return in the installer interface?
[16:51:01] <_joe_>	 yeah I think we had that issue originally, "to be fixed" later
[16:51:11] <_joe_>	 and apparently akosiaris never did
[16:51:19] <klausman>	 Alex has mentioned that this may be due to the machines getting Buster, not Stretch.
[16:51:28] <_joe_>	 also that, yes
[16:51:41] <klausman>	 That in turn is a bit of a bigger topic, since for AMD GPUs, using an ancient kernel is not so great™
[16:52:07] <_joe_>	 oh I think we should move to buster too ftr
[16:52:18] <_joe_>	 although buster has already an ancient kernel overall :)
[16:52:21] <jayme>	 ah, okay. Yeah...we're not using buster currently - unfortunately :/
[16:52:37] <akosiaris>	 _joe_ oh it was fixed, if you reimage a kubernetes node now it's hands off
[16:52:48] <akosiaris>	 but apparently it doesn't work on buster
[16:53:00] <_joe_>	 heh
[16:53:07] <_joe_>	 good grief, partman
[16:53:31] <_joe_>	 klausman: thankfully we have one of the greatest partman experts worldwide in our ranks
[16:53:42] * _joe_ stares at kormat
[16:53:44] <cdanis>	 you're being really cruel to her today _joe_
[16:53:50] <klausman>	 Oh I've been talking to her behind the scenes already
[16:53:53] <_joe_>	 cdanis: this is *true*
[16:54:02] <cdanis>	 some truths are better left unsaid
[16:54:05] <_joe_>	 cdanis: there are 3 people who understand partman in the world
[16:54:10] <_joe_>	 2 are the authors
[16:54:51] <_joe_>	 cdanis: the mtail thing was a cheap joke, but this is actual admiration
[16:55:48] * jayme returns to grafana clicking
[16:57:40] <volans>	 jynus: FYI as I'm not sure if you're aware, database-backups-snapshots.service is in failed state on cumin1001
[16:59:23] <akosiaris>	 kubernetes -node.cfg btw does create filesystems and mountpoints. See https://github.com/wikimedia/puppet/blob/production/modules/install_server/files/autoinstall/partman/custom/kubernetes-node.cfg#L29
[16:59:27] <volans>	  ERROR - Backup process completed, but some backups finished with error codes, it triggers the systemd check, not sure if it's the desired behaviour
[17:00:31] <akosiaris>	 I 'd happily never have to deal again with partman for the rest of my life though
[17:00:42] * akosiaris sorry kormat :-(
[17:02:24] <kormat>	 🥀
[17:06:58] <akosiaris>	 🧑‍🌾
[17:07:17] <akosiaris>	 the amount of emojis that exist continues to amaze me
[17:07:41] <Zppix>	 I still like my ascii emotes
[17:07:49] <akosiaris>	 also, no way all these are used only the way they were justified as
[17:08:34] <Zppix>	 no they arent ;)
[17:41:31] <_joe_>	 def not :D
[17:42:54] <apergos>	 what te heck does the "farmer emoji" have to do with anything?
[17:51:38] <Zppix>	 farmers-only :P
[17:58:25] <_joe_>	 🤌
[18:16:14] <kormat>	 apergos: farmer emoji == all tech is terrible; time to change career and be a farmer instead
[18:16:23] <apergos>	 hahahahahaha
[18:16:32] <apergos>	 because the farmers are doing so well career-wise...
[18:19:13] <mutante>	 when i had the "all tech is terrible" moment my alternative job was always "zookeeper", you know, feed the penguins sounded better than farming
[18:23:56] <Zppix>	 nah it should always be a goat
[18:24:40] <mutante>	 you wouldn't be limited to one or the other
[18:28:13] <mutante>	 I gotta use sprintf() or something in puppet to add leading zeros to Integers, if I want to use them in systemd calendar events/timers:
[18:28:20] <mutante>	  '/usr/bin/systemd-analyze calendar  *-*-1 0:0:00' returned 1: Failed to parse calendar specification ' *-*-1 0:0:00':
[18:28:52] <mutante>	 turns into a puppet error
[18:28:53] <mutante>	   Original form: *-*-1 0:0:00
[18:28:53] <mutante>	 Normalized form: *-*-01 00:00:00
[18:30:01] <rzl>	 mutante: are you sure it isn't the leading space before the first asterisk?
[18:30:07] <mutante>	 when running the systemd-analyze command locally, it just normalizes it and still understands "from now: 4 days left"
[18:30:19] <mutante>	 in puppet code.. it fails
[18:30:21] <rzl>	 '*-*-1 0:0:00' works for me but ' *-*-1 0:0:00' doesn't
[18:31:01] <mutante>	 rzl: ooh.. yes, looks like it
[18:31:14] <mutante>	 thanks, let me try fixing that
[18:33:46] <mutante>	 still requires something ugly to allow for either leaving $weekday completely undefined or defining but and get the spaces right
[18:34:19] <mutante>	 but was already helpful to say it out loud
[18:34:43] <rzl>	 join() with a space delimiter, no?
[18:34:51] <rzl>	 but, sweet, glad it worked 👍
[18:39:42] <elukey>	 hello hello, anybody knows what's happening to restbase?
[18:40:34] <elukey>	 it seems throttling busts of wikitext to html
[18:42:52] <elukey>	 I am very ignorant, does restbase call parsoid? I don't see pressure from the RED dashboard
[18:45:04] <apergos>	 if the page doesn't have a current rendered revision then I expect parsoid would be called one way or another
[18:45:12] <apergos>	 *revision stored in restbase that is
[18:45:55] <apergos>	 and I assume there would be an attempt somewhere to check the parser cache to see if something good is in there first before trying a re-render but I don't know what the flow of that would be
[18:47:46] <elukey>	 I am checking logstash for the throttling :)
[18:48:25] <apergos>	 as we rememmber from a few days ago restbase will retry a few times on failure
[18:48:31] <apergos>	 so there's that as well
[18:50:01] <elukey>	 seems all traffic for bcl.wikipedia.org, with VisualEditor as UA
[18:52:36] <elukey>	 an example is https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-restbase-2021.02.24?id=H0tT1XcBsCn0xdb8Djvf
[18:55:27] <elukey>	 now, how to get to external ips is not clear to me
[18:55:40] <elukey>	 (without waiting for analytics webrequest data)
[18:55:51] <elukey>	 I checked sampled-1000 on centrallog but cant find much
[18:56:05] <apergos>	 why is the root req url from another wiki
[18:56:45] <bblack>	 translation tools?
[18:58:28] <apergos>	 huh
[18:58:49] <elukey>	 yeah no idea the difference
[18:59:09] <elukey>	 also the UA is Visual Editor
[19:00:13] <bblack>	 does VE have some translation aid to load an article from one wiki for translation to another? I know that seems odd, but
[19:00:32] <elukey>	 no idea
[19:00:56] <elukey>	 need to go now sorry, will check later (the impact was brief, nothing ongoing)
[19:04:14] <apergos>	 yeah 20 minutes of quite still, so maybe we're ok
[19:04:26] <apergos>	 *quiet
[19:04:44] <apergos>	 I'm going to drift off too as it's definitely my late evening here
[19:04:46] <bblack>	 this is the second time in recent memory that I've incidentally noticed these restbase-involved ::ffff:10.64.0.100 client IPs
[19:05:01] <cdanis>	 https://www.mediawiki.org/wiki/Content_translation ?
[19:05:21] <bblack>	 I still think they seem 'wrong', and I'm not sure why they're in that form.  it's likely a mysterious misconfiguration, and it could have some real impact
[19:05:43] <apergos>	 I wonder how the rate gets limited anyways, on what basis
[19:05:52] <apergos>	 dang it I was going to drift off :-P
[19:06:23] <bblack>	 (because we probably have tooling that parses XFF and client IPs (maybe even for ratelimiting?) and doesn't recognize these IPs as legitimately-internal the way they would the corresponding true IPv4 or the appropriate standard wmf-mapped ipv6)
[19:07:12] <apergos>	 yeah tat would be not great
[19:07:34] <bblack>	 these look the result of something using ipv4 internal source address to reach a server's port, which is answering an ipv4 request using an ipv6 listening socket without IPV6_ONLY, and thus getting this auto-mapped fake IPv6 IP that we don't normally see in our infra and don't have revdns or hieradata about these IP ranges, etc
[19:08:07] <bblack>	 I think "::ffff:10.64.0.100" is what the client IP would appear to be in such a case
[19:09:40] <bblack>	 same curiosity as last time:
[19:10:33] <bblack>	 restbase1019 has 10.64.0.100 + 2620:0:861:101:10:64:0:100 on eno1
[19:10:43] <bblack>	 dns only knows the ipv4 forward+rev, not the mapped-ipv6
[19:11:01] <bblack>	 actually, rb1019's eno1 has 3 other ipv4s too
[19:11:13] <bblack>	 almost looks like an LVS service IP setup, but they're not on the loopback
[19:11:56] <bblack>	 there's 10.64.0.10[123] IPs for restbase1019-a, restbase1019-b, restbase1019-c
[19:12:00] <bblack>	 all with /32 masks...
[19:12:22] <bblack>	 I don't know how much that is really part of the novel mystery vs just some standard well-understood part of its setup
[19:12:29] <cdanis>	 is that some artifact of however Cassandra is sharded for restbase?
[19:13:09] <bblack>	 it sounds like it, but why eno1 IPs with /32 masks?
[19:14:25] <apergos>	 really out of my area of knowledge... but. maybe hnowlan (probably also off though) would have a clue about the restbase piece of things
[19:14:29] <bblack>	 answering myself: maybe to make sure it doesn't make outbound connections with those IPs
[19:14:50] <bblack>	 but in that case, it might've been better to define them on loopback like the LVS case
[19:15:00] <cdanis>	 it sounds like maybe we need a refresher SRE session on our restbase setup :)
[19:15:21] <apergos>	 there are't any presentations on these details are there? I think
[19:15:39] <cdanis>	 I'm not sure, but I do know that I barely know the basics
[19:15:43] <apergos>	 same
[19:16:11] <apergos>	 well hugh, if you're around later and read this scrollback, wanna present at a meeting? :-)
[19:16:24] <bblack>	 well so far I can't even quickly grep up what mechanism in puppet is even creating those IPs
[19:17:20] <cdanis>	 bblack: https://netbox.wikimedia.org/ipam/ip-addresses/?q=restbase
[19:19:17] <apergos>	 restbase1019.eqiad.wmnet   huh
[19:20:09] <bblack>	 yeah I found it in puppet now
[19:20:35] <bblack>	 modules/cassandra/manifests/instance.pp -> $instance_rpc_address
[19:20:41] <bblack>	 which uses interface::alias
[19:21:22] <XioNoX>	 why it's setup like that is mostly historical reasons that don't hold anymore
[19:21:25] <bblack>	 which adds secondary IPs to the primary interface using a fixed /32 or /128 mask as appropriate
[19:21:33] <bblack>	 as inbound-only that aren't selected for outbound traffic
[19:21:40] <bblack>	 so that part all makes some kind of sense
[19:21:45] <bblack>	 and probably isn't part of the problem here
[19:21:50] <XioNoX>	 I forgot if it got documented somewhere when we did the Netbox import and tried to remove those odd ducks
[19:22:32] <XioNoX>	 instead of have one service per port, they do one per IP
[19:24:34] <bblack>	 ah ok, I think I found the ::ffff: part, and as I suspected last time around, it's envoy
[19:24:42] <apergos>	 oh?
[19:24:51] <bblack>	 in modules/envoyproxy/manifests/tls_terminator.pp :
[19:25:04] <bblack>	 # @param listen_ipv6
[19:25:05] <bblack>	 #     Listen on IPv6 adding ipv4_compat  allow both IPv4 and IPv6 connections,
[19:25:07] <bblack>	 #     with peer IPv4 addresses mapped into IPv6 space as ::FFFF:<IPv4-address>
[19:25:26] <bblack>	 bblack@haliax:~/repos/puppet$ git grep 'listen_ipv6: true'
[19:25:26] <bblack>	 hieradata/role/common/idp.yaml:profile::tlsproxy::envoy::listen_ipv6: true
[19:25:29] <bblack>	 hieradata/role/common/idp_test.yaml:profile::tlsproxy::envoy::listen_ipv6: true
[19:25:32] <bblack>	 hieradata/role/common/parsoid/testreduce.yaml:profile::tlsproxy::envoy::listen_ipv6: true
[19:25:35] <bblack>	 hieradata/role/common/restbase/dev_cluster.yaml:profile::services_proxy::envoy::listen_ipv6: true
[19:26:15] <bblack>	 the problem with this mode of listening for v4+v6 on one socket, as our envoy can apparently be configured to do
[19:26:40] <bblack>	 is it's going to produce these fake ipv6 client/source IPs that the rest of our infra doesn't expect in various network ACLs or XFF-parsing or whatever-else
[19:26:43] <volans>	 I'm not yet caught up with backlog, bblack if you need some context on the cassandra IP setup I can help
[19:26:57] <bblack>	 volans: no I think I got that part, it's not the issue
[19:27:05] <volans>	 as I had to make netbox work for it (as it's an exception)
[19:27:18] <volans>	 and proposed also patches to fix the netmask on the host, but not deployed them yet
[19:27:44] <volans>	 what's the issue?
[19:27:48] <bblack>	 volans: I think the /32 netmasks on the hosts are probably correct per the intent (whether the design intent is right is out of scope I guess)
[19:28:08] <bblack>	 because that prevents the host from using those alias IPs as source addreses for random outbound connections
[19:28:56] <volans>	 is anything related to T253173 ?
[19:28:57] <stashbot>	 T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) - https://phabricator.wikimedia.org/T253173
[19:29:07] <volans>	 restbase are not v6 ready AFAIK
[19:29:11] <bblack>	 possibly, tangentially
[19:29:30] <bblack>	 restbase are in the class of hosts which have mapped-ipv6 defined on their $interface_primary, but do not have it delcared in DNS
[19:29:33] <bblack>	 *declared
[19:29:47] <bblack>	 but I don't think that particular quirk is causing this
[19:30:00] <volans>	 it's intended
[19:30:04] <volans>	 they are not v6 ready
[19:30:08] <bblack>	 the quirk we're looking at here is this:
[19:30:44] <bblack>	 restbase is connecting to some other service which intentionally only has an IPv4 service address, like our lvs'd ones that are all in 10.2.2.x or whatever
[19:31:13] <bblack>	 but that service is configured with an envoy listener, and that envoy listener only has an IPv6 listen socket, which is configured to accept traffic from both ipv4+ipv6
[19:31:30] <volans>	 ok
[19:31:33] <bblack>	 so when any client connects to this service, they'll use Ipv4 because there's only an IPv4 service address
[19:32:04] <bblack>	 but on the envoy side, it gets received on a universal-style ipv6 listen address, and thus the client IP gets recorded as ::ffff:a.b.c.d
[19:32:25] <bblack>	 which is not the client host's actual ipv4 nor its ipv6, nor will it match various ACLs embedded all over our infra, etc...
[19:33:07] <volans>	 got it
[19:33:29] <bblack>	 brb, quick meeting, but I'm not even sure if this is causing any other problem at present
[19:33:34] <bblack>	 I've just run into it twice, and it smells fishy
[19:36:34] <volans>	 :)
[20:10:04] <sukhe>	 question about decommissioning a host: cescout1001 -- a physical host and not a VM because of the disk requirements for the database replica that we were syncing -- is no longer required because the database updates have been deprecated;
[20:10:41] <sukhe>	 so I am thinking of moving this to a VM. question: since I have never done this before, the process includes opening a ticket and then running the decom cookbook. correct?
[20:11:34] <mutante>	 sukhe: yes, that is correct
[20:11:39] <mutante>	 there is a template for that kind of ticket
[20:12:21] <mutante>	 it starts with transition on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Server_transitions
[20:12:36] <sukhe>	 mutante: thanks, yes I saw! on the puppet side, I noticed the dry-run output in the decom cookbook says, "DRY-RUN: Removed from Puppet master and PuppetDB
[20:12:39] <sukhe>	 "
[20:12:53] <mutante>	 from there you get to https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Remove_from_production
[20:13:05] <sukhe>	 on the puppet repo side, what else needs to be done? should I remove the role from site.pp?
[20:13:17] <mutante>	 and there you get the actual phab link https://phabricator.wikimedia.org/project/profile/3364/
[20:13:23] <mutante>	 no wait :)
[20:13:36] <mutante>	 https://phabricator.wikimedia.org/maniphest/task/edit/form/52/
[20:13:43] <mutante>	 this is the link you should use
[20:14:05] <sukhe>	 thanks!
[20:14:42] <mutante>	 sukhe: yes, remove it from site.pp but _after_ running the cookbook
[20:14:49] <sukhe>	 got it
[20:14:52] <mutante>	 while the cookbook will also warn you that it is .. still in site.pp
[20:14:59] <mutante>	 then you say "yea, i know, right"
[20:15:16] <mutante>	 and it will be in DHCP
[20:15:22] <cdanis>	 if only we could have automated gerrit edits :)
[20:15:23] <mutante>	 you can do that before
[20:15:35] <mutante>	 you don't have to worry about DNS anymore though
[20:16:21] <mutante>	 possibly an edit in netboot.cfg / partman recipe
[20:16:46] <mutante>	 possibly remove from cumin aliases
[20:17:03] <sukhe>	 thanks, I will make read the wikitech link once again to make sure I have everything covered
[20:17:06] <Zppix>	 cdanis: what do something similar to like how dbctl works?
[20:17:29] <cdanis>	 or something that can upload a patchset that a human could then +2 and merge
[20:17:34] <mutante>	 sukhe: just "grep -r hostname *" in puppet/repo to check
[20:17:35] <Zppix>	 that would be neat
[20:18:13] <sukhe>	 mutante: yeah!
[20:18:19] <Zppix>	 is the db script in the software repo cdanis ?
[20:18:54] <cdanis>	 dbctl?  yeah https://wikitech.wikimedia.org/wiki/dbctl links to its repo
[20:18:56] <Zppix>	 i could look into hacking something up
[20:19:47] <Zppix>	 ill get back to you on what i can figure out cdanis
[20:20:34] <cdanis>	 I'm not sure that dbctl is the best example here, but sure :)
[20:20:58] <Zppix>	 Well, i was going to use it to get a understanding on how it "commits" and kinda fork it from there so to speak
[20:21:29] <Zppix>	 plus its written in python, a lang i understand
[20:21:39] <Zppix>	 anyway imma stop rambling
[20:41:06] <Zppix>	 cdanis:  So im looking at how dbctl sends the paste to phab, i think if gerrit had a similar backend script to phab (the phaste) we could basically fork up a version of conftool/dbctl and make a way to convert whatever into a .patch or .diff file and use a backend  script to upload to gerrit
[20:41:26] <Zppix>	 I dont know if gerrit has such script
[20:46:35] <Zppix>	 (i hope that made sense)
[20:49:05] <bblack>	 FWIW, I've recorded the ::ffff: issue in https://phabricator.wikimedia.org/T255568#6858439 for now
[20:58:14] <vgutierrez>	 uhm.. I need to take that into account as well
[20:59:13] <vgutierrez>	 and basically duplicate a lot of envoy config :_)
[21:00:19] <mutante>	 I will revert the "let envoy listen on Ipv6" for the testreduce machine. Because it did not work anyways.
[21:00:20] <_joe_>	 vgutierrez: you know you have... templates right?
[21:00:54] <vgutierrez>	 _joe_: I'm slightly aware, yes
[21:01:04] <_joe_>	 :D
[21:04:48] <bblack>	 at least v4mapped is one of the less-crazy of these auto-ipv6 schemes to have to deal with :)
[21:06:01] <bblack>	 the 6 I know of (from having to do best-effort support for dns geoip lookups) are (copypasta from docs):
[21:06:04] <bblack>	          ::0000:NNNN:NNNN/96   # RFC 4291 - v4compat (deprecated)
[21:06:06] <bblack>	          ::ffff:NNNN:NNNN/96   # RFC 4291 - v4mapped
[21:06:09] <bblack>	     ::ffff:0000:NNNN:NNNN/96   # RFC 2765 - SIIT (obsoleted)
[21:06:11] <bblack>	        64:ff9b::NNNN:NNNN/96   # RFC 6052 - Well-Known Prefix
[21:06:14] <bblack>	     2001:0000:X:NNNN:NNNN/32   # RFC 4380 - Teredo (IPv4 bits are flipped)
[21:06:17] <bblack>	            2002:NNNN:NNNN::/16 # RFC 3056 - 6to4
[21:06:28] <bblack>	 they're all a mess! :)
[22:18:28] <akosiaris>	 IPv6 a mess? that can't be :P
[22:18:45] * akosiaris couldn't resist
[22:19:45] <akosiaris>	 root_req.headers.x-client-ip
[22:19:45] <akosiaris>	 ::ffff:10.64.0.100
[22:19:47] <akosiaris>	 wait what?
[22:21:58] <akosiaris>	 nodejs    127896    restbase   13u  IPv6  109896146      0t0  TCP *:7233 (LISTEN)
[22:21:58] <akosiaris>	 nodejs    127896    restbase   14u  IPv6  109896147      0t0  TCP *:7231 (LISTEN)
[22:22:41] <akosiaris>	 yeah, it looks like restbase is opening up just the IPv6 socket and relying on the ipv4 compat behavior, but this should be happening for years now
[22:34:07] <akosiaris>	 Yeah I can see entries in logstash going back to at least Dec 2020, chances are it's been around for ever (tm)
[22:58:39] <bblack>	 I guess I don't understand restbase internal loopiness
[22:58:55] <bblack>	 I still read that as restbase1019's IP as the client side of some connection
[22:59:25] <bblack>	 maybe it's an rb->rb request?
[23:03:04] <bblack>	 I do agree, it does look like nodejs is only listening on the v6-any, which implies nodejs is doing what I'm complaining about heh
[23:03:53] <bblack>	 still, you'd think we'd have noticed this a long time ago.  maybe something else subtle has changed (on the rb nodes) more-recently
[23:04:55] <Zppix>	 or maybe something has changed in  how envoy works
[23:05:43] <bblack>	 well for the nodejs listener to be the reason, it has to be something connecting to RB's nodejs
[23:05:48] <bblack>	 and yet it's also an RB client IP
[23:06:00] <bblack>	 so it's probably rb->rb
[23:06:15] <bblack>	 unless there's rb->envoy->rb?
[23:07:30] <Zppix>	 why would rb route to rb?
[23:22:50] <dwisehaupt>	 /ac/ac