[09:53:21] Any prometheus expert around? I've noticed that the swift check for number of thumbs in the 2 DC is unknown since ~1day and if I try the query used on Grafana it actually stops ~1d ago. [09:53:34] the query is a divideSeries and the two series have data up until now [09:54:01] but the divideSeries one stops [10:07:08] cc cdanis, godog ^^^ [10:08:30] Also i planed to reboot the prometheous servers today so please let me know if there is anythink to whatch out for [10:10:55] The reboots can create monitoring artifacts that will probably trigger some IRC alert. There should be a new alert that alerts of Prometheus uptime and reminds that artifacted alarms can be triggered [10:11:24] so basically to just check those that would fire (hopefully not many) and ensure are indded caused by the reboot and are not real issues [10:12:37] ack cheers volans [10:12:59] look for "beware" in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=prometheus1003 [10:13:45] ack [11:38:31] hey, dumb question. In which kubernetes version is based the prod k8s cluster? cc akosiaris [11:39:32] 1.11.9-1 and some staging hosts have 1.12.7-1 [11:39:37] https://debmonitor.wikimedia.org/packages/kubernetes-node [11:40:36] ok thanks [11:55:24] arturo: I'm trying to understand better the https://phabricator.wikimedia.org/T223902 thing [11:55:35] o/ [11:55:52] what connects to these APIs/endpoints/whatever? [11:55:58] (cloud instances?) [11:56:13] (we're not actually trying to offer this to the real public, right?) [11:56:39] yes, we are, potentially. We have tons of scripts and other stuff that connect to those endpoints [11:56:57] openstack endpoints are meant to be public, at least in a public cloud deployment like ours [11:57:04] ok [11:57:39] all of these I guess, the set of ... 6 different endpoints... if we offer public service we expose them all? [11:57:55] (public service I guess meaning, someone can run scripts from the outside that can auth into us and control things with wmcs instances) [11:58:36] are they all HTTP, or some are other protocols like LDAP? [11:58:52] they are all HTTP right now [11:59:02] (they are all restful API endpoints) [11:59:35] well except the proxy service, that's something else that's internal-only? [11:59:37] these are the endpoints for the record [11:59:40] https://www.irccloud.com/pastebin/cEeTpO6i/ [11:59:43] dynamicproxy I mean [11:59:55] the proxy case is not clear to me ATM [11:59:58] ok [12:00:51] also, I'm not sure either if we would go 'fully internet open' from the very beginning [12:01:05] sure [12:01:27] and these live in the regular prod vlans, or some kind of wmcs dmz vlan or something? [12:01:40] I mean the existing, e.g. cloudcontrol1003 [12:01:42] regular prod vlan [12:01:56] cloudcontrol servers are regular prod servers [12:02:00] with public addressing [12:02:03] right [12:02:36] that's maybe a whole separate angle on this conversation, about whether those should be better-isolated from the rest of prod before becoming fully open to the public. [12:03:03] it's just a lot of surface area for potential compromises into the control hosts [12:03:25] anyways, that aside (which can probably be addressed separately later) [12:03:31] I agree, that's why I mention the internet open thing. Right now we have iptables in the hosts to limit connectivity to known servers [12:03:44] this doesn't seem like something we'd want to push through prod cache termination either [12:03:47] i.e, I can't connect right now from my home to the endpoints [12:04:31] and going back to your last ticket update: that wmcloud.org lacks a policy is easily remedied (can just add it to the list, or make a separate one) [12:04:36] (an HTTPS policy) [12:05:06] ok [12:05:07] the issue about wikimedia.org and subdomains is only if you were trying to use our existing wildcards via the prod caches [12:05:34] we could still easily puppetize having LE issue certs for e.g. *.wmcs.wikimedia.org directly to cloudcontrol hosts to do their own direct termination. [12:05:45] yes, what vgutierrez suggested [12:06:20] personally, I'd go for fuller seperation of concerns though and use wmcloud.org (or make up another) [12:07:19] just to avoid all kinds of ??? about mixing so many things into the wikimedia.org that have different scopes of control and security, and how browsers and other software sometimes look at a 2LD like wikimedia.org as a sort of "zone of trust" in some senses. [12:07:40] ok [12:07:52] the thing with `wmcloud.org` is where to place the NS for that domain, since we plan to make it managed by desginate. We should avoid chicken-egg problems [12:08:05] would it be any problem hosting the high level domain in prod dns servers? [12:08:11] and delegate a subdomain to designate? [12:08:36] we can, but is it necessary? [12:08:49] (is designate not meant to host a 2LD on its own?) [12:08:50] what is the alternative? [12:09:31] I guess if there's some issue where it can't self-host so to speak [12:10:05] "since we plan to make it managed by designate" - in this hypothetical, which parts of wmcloud.org is designate managing? [12:10:17] (the hostnames of these public services?) [12:10:47] but that is more or less a replacement for the current wmflabs.org and friends [12:11:08] we dont know yet <--- missing message before [12:11:16] ok [12:11:46] I could try describing the chicken-egg problem I can see with designate [12:12:03] so what you're meaning is have prod DNS host "wmcloud.org", and have direct records for these "public"(ish, for now) API endpoints, and also delegate some subdomain(s) off to designate for instance stuff, like "inst.wmcloud.org" or whatever? [12:12:20] yup [12:12:34] yeah that could work [12:13:14] it might be nice to create at least some basic simple static landing page for https://www.wmcloud.org/ as well, if we go that route [12:13:25] * arturo nods [12:14:10] well, the most simple thing is a redirect to wikitech [12:14:21] ok dumping some of the outputs of this back to ticket. I just wanted to avoid a long drawn out ticket convo trying to figure out what this all means :) [12:14:24] as we do right now with www.wmflabs.org [12:14:36] right [12:15:11] thanks bblack :-) [12:15:35] you know what, I think it was you who put the idea in my head in the SRE summit in Prague last year [12:16:03] btw [12:16:09] I think we bought "wikimedia.cloud" [12:16:30] oh! nice! xd [12:16:39] we are in the future now [12:16:49] I remember seeing something like that before [12:16:50] so with two nice domains, you could use those separately and avoid the delegation too [12:17:03] put the control stuff in wmcloud.org and designate handling wikimedia.cloud directly, or whatever [12:17:07] yup, it's delegated to us [12:17:11] sounds good [12:17:23] the real question is [12:17:37] lol, via ARUBA [12:17:43] well ok, all real questions, but [12:17:56] one of the domains or subdomains will use have to replace wmflabs.org eventually [12:18:01] 7dig ns cloud [12:18:03] and be a public suffix [12:18:16] * jbond42 wrong window [12:18:20] did we get it added to PSLs in the past? [12:18:31] yes [12:18:34] ok [12:19:04] paravoid: so, what is the question then? [12:19:16] // Wikimedia Labs : https://wikitech.wikimedia.org [12:19:21] // Submitted by Yuvi Panda [12:19:24] wmflabs.org [12:20:54] so you need (domain = not necessarily 2nd-level) a) a domain for floating IPs and instances b) a domain for public endpoints and potentially other public-facing material c) ??? for the data services in a limbo in between [12:21:13] I think these are questions that should be brought in scope and answered now, rather than decide (b) in isolation [12:22:02] before we start burning domains in the PSL or with an HSTS policy etc. [12:22:24] do the data services need a separate domain? [12:22:30] I don't know [12:22:50] I guess not? I haven't thought about it [12:23:08] data services is a wide thing. If you refer to database (wikireplicas and friends) they should be covered by (a). If NFS/Ceph, then that should probably be tied to what networking layout we use for them in the future? [12:23:09] is there some other ticket already about dumping wmflabs.org (due to branding away from "labs"?) [12:23:23] bblack: I don't think so [12:24:01] well I kind of assumed there must be. Is there any reason not to just continue using the working wmflabs.org for instance stuff? [12:24:14] (and .wmflabs-sans-org :) [12:26:17] bblack: currently we are only evaluating the endpoint stuff because implementing HA in the cloudcontrol servers will help us upgrade openstack and move forward the infra [12:26:41] wmflabs.org and wmflabs-sans-org are in the TODO as well, but no short term plans for them [12:27:15] I'm arguing that you shouldn't "burn" a domain like wmcloud.org if you haven't thought about them too [12:27:29] so the implementation can wait, but I think you should like at all this a bit more holistically [12:27:29] sure, I agree [12:28:07] and make a plan of what gets named how -- even if the execution of some of that happens months later [12:31:58] ack [12:32:07] summarized to ticket [12:32:10] thanks bblack :) [12:32:14] thanks! [12:32:32] sorry for enlarging the scope [12:32:38] I don't think it should be something super heavyweight [12:33:05] I expect no less than a 50 slide presentation to the board [12:33:13] naming could be hard heh [12:33:33] api.domain vs api.svc.domain vs ... [12:33:45] bblack: :D [12:34:33] https://twitter.com/secretGeek/status/7269997868 [12:34:47] xd [12:34:50] arturo: other thing to keep in mind is that the world is moving away^W^W^W has moved away from using random in-house TLDs [12:34:58] so .wmflabs is something that we should ditch, as is .wmnet [12:35:16] what is the replacement? [12:35:37] use a domain that exists in the hierarchy :) [12:35:49] so for prod we may use .wikimedia.net, it's unclear how yet [12:35:57] we need to make a plan too ;) [12:36:08] mind the WMF may be renamed to WPF :-P [12:36:11] yes :) [12:36:48] the issue is that TLDs are now available, someone may go and actually buy .wmnet [12:37:07] and host stuff on the internet under that [12:37:19] or we could :) [12:37:30] * mark puts it in the budget [12:37:34] wmflabs is, granted, less of a risk [12:37:43] but still, all kinds of issues will start cropping up [12:37:44] e.g. [12:38:04] DNSSEC :) [12:38:42] at this point I feel like we should at least evacuate our server/infra hostnames out of wikimedia.org [12:38:55] maybe those and the wmnet stuff can all go somewhere together [12:38:58] mark: put it under the brand awareness budget ;) [12:39:30] but then there's two twin concerns left in wikimedia.org that are difficult to ever move: @wikimedia.org email/gsuite, and the various wikimedia.org wiki hostnames. [12:39:35] bblack: yeah, if it wasn't for that, the .wikimedia.net decision would be easy [12:39:39] and then yeah all of this is potentially affected by the branding stuff [12:39:43] it would be s/wmnet/wikimedia.net/ [12:40:26] the current set of wmnet and wikimedia.org server/infra names don't necessarily need to remain separate, they could share a domain [12:40:33] yeah [12:40:50] wmnet. has always felt dirty to me [12:41:10] i used to do split domain stuff at my past gigs, and even at home [12:41:17] although that was painful too [12:41:22] with bind views and such, hehe [12:41:24] for that matter, we could always split-view whatever holds all of those (present a public view that's basically-empty with some static landing page or whatever, and only serve the actual infra hostname stuff to our internal caches) [12:41:29] heh [12:41:33] I type too slow! [12:41:34] why :) [12:42:03] because inevitably some will be intentionally-reachable and some not, and that does create some confusion. [12:42:29] I actually like we don't have different views, has always become a source of confusion in past jobs [12:42:33] and later I didn't bother and put private ips in public dns and people told me that was dirty [12:42:37] yes [12:42:46] what is a private IP nowadays :) [12:42:50] indeed [12:42:51] when someone spins up fooserver.wikimedia.net on a public VLAN to offer a public service, they have to go make an explicit decision and extra step to create a real public hostname Elsewhere for it. [12:43:10] instead of spilling out and letting something start relying on that server hostname as a public resource [12:43:18] yeah [12:43:21] a service IP. [12:43:34] we should totally have a session on DNS naming on the SRE meeting in Dublin [12:43:46] I'd attend that :) [12:44:06] templates/wikimedia.org:icinga1001 1H IN A 208.80.154.84 [12:44:12] templates/wikimedia.org:icinga 5M IN CNAME icinga1001 [12:44:22] the mix of "service" and "host" names in one public domain, annoys me even today [12:44:30] there are even worse examples :) [12:44:33] sure [12:44:38] but yeah, agreed [12:45:06] we actually have dns views, you do realize that right [12:45:12] geodns is pretty much exactly that ;) [12:45:22] yeah but not in the same way :) [12:46:11] for what it's worth (not much) my nameserver at home forwards wmnet to wikimedia's NS [12:46:14] it's pretty convenient [12:47:12] technically we have hostnames in wmnet that map to public IPs too [12:47:19] (the interface hostnames for LVS on public vlans) [12:47:47] and vice-versa, LVS interface hostnames in wikimedia.org that aim at private IPs [12:47:56] there are so many messy corner cases all over the place [12:49:03] relatedly, it'd be nice to be clear and explicit about the standards for what gets to live directly as a public service host, vs being behind at least LVS if not caches. [12:49:17] and to have those enumerated clearly with rationale, etc [12:50:20] something like: everything gets private IP, and public IP only live in LVS? [12:50:49] no, there are some things we know we don't want being behind any unecessary infra, because it's used to monitor/fix that infra [12:51:25] * arturo nods [13:02:43] re: WMF->WPF isn't that wordpress foundation? lol [13:04:26] * jbond42 lunch [14:55:00] hello! who can I ask questions about Reprepro/APT? [15:02:02] XioNoX: feel free to shoot [15:02:42] and I'll feel free to ping mor.itz/jb.ond once I realize I don't know the answer :) [15:03:17] yeah I pinged moritz too, probably easier to discuss it here [15:04:38] so I have a deb package, that I built from the sources with "cargo deb" the packaging tool for Rust. So it's just the .deb and no .changes, etc.. I was told it was fine to use it as it (as long as I documented how I built it) but I don't know how to add it to reprepro withouth the .changes and similar [15:05:45] you can include it with "includedeb" instead of "include", see https://wikitech.wikimedia.org/wiki/Reprepro#Importing_packages [15:05:58] great, thanks! [15:06:57] careful that depending on licensing [15:07:01] you may need to include the .dsc as well [15:07:17] if e.g. GPL code is included anywhere [15:07:28] so to be on the safe side, do that anyway :) [15:10:55] not sure what's the .dsc or how to generate it, but I'll look. Fyi the license is BSD ( https://github.com/NLnetLabs/routinator/blob/master/LICENSE ) [15:11:59] the .dsc + .orig.tar.gz + (another file) is the source package [15:12:08] "dpkg-buildpackage -S" will generate that [15:12:24] the dsc is the descriptor, links to the other two [15:12:38] "dcmd cp ..." or "dcmd scp ..." would help you copy these all together [15:12:48] and then "reprepro includedsc" will include it in apt [15:13:21] I have no idea what "cargo deb" does, though :) [15:13:53] yeah, it doesn't generate that dsc file, nor uses dpkg-buildpackage [15:15:35] fwiw I spoke with Md [15:15:42] https://salsa.debian.org/md/routinator/ is the draft package [15:16:29] Md = Marco d' Itri, some people here will be familiar with the name ;) [15:16:45] it was in #networker, he was talking with redLED at all, pushing some of the changes he made upstream [15:16:54] s/at all/as well/ [15:17:33] he has a systemd file that could prove useful [15:19:27] yeah, I reused most of his systemd file, as well as file structure so when the official package makes it upstream not much changes, but one line causes a segfault (see https://github.com/NLnetLabs/routinator/issues/82#issuecomment-495455364 ) [15:20:25] but I haven't used anything else as I don't know how ready his work is [16:03:11] paravoid: also the salsa debian stuff requires dependencies that are in buster/sid only [16:03:24] correct [16:03:29] and I think rust 1.34 which is in experimental [16:05:16] alright, so I have no idea on how to compile it from there :) [16:05:28] yeah I dunno either [16:05:54] is the .dsc file mandatory? I already have a working (and tested) .deb file :) [16:15:44] paravoid: ^ [16:16:09] I guess that's fine if it's temporary [16:17:51] yeah, goal is to use the official one as soon as we can [16:32:29] * arturo thinks on GPL trolls [17:28:28] Is codfw pronounced "dallas" or "cod-faw"? or "cod-fwuh"? [17:33:03] in my head its cod firewall but i suspect im alone on that [17:37:07] https://upload.wikimedia.org/wikipedia/commons/e/ee/Gadus_morhua_Cod-2b-Atlanterhavsparken-Norway.JPG F W [17:38:00] note that we also have a router in eqdfw, so "Dallas" can be ambiguious [17:38:32] and I usually spell e-q-d-f-w [17:38:53] codef-w [17:39:15] (where w==double-u ofc) [17:39:31] or cod-f-w [17:39:33] like the fish [17:39:55] chaomodus: (see above) [17:40:16] :) [17:40:41] ahh [17:40:43] hah [17:49:42] cod-dallas and ek-dallas? [17:49:50] co-dallas, I guess [17:50:16] I just want to make sure the pronunciation convention is different for each location, to maximize our security in depth [17:52:04] 90% of us don't ever have to pronounce the network-only sites, so it boils down to the other 5 so far [17:52:26] As codfw is our only Cyrus One DC, it's also non-ambiguious [17:52:33] (To say Cyrus One) [17:53:34] but yeah, Dallas is what most of the people uses for non written exhanges :) [17:57:00] I bet only because of easier pronunciation [17:57:37] we should have just done "codal" since DAL is the other Dallas airport and it rolls of the tongue much better [17:57:56] but it was determined DFW is a bit closer to the DC :) [17:58:26] by how much? and most important, real distance or driving distance? [17:58:35] :-P [17:59:05] hehe, i don't recall. you have to pull out the trigonometry like for the distance to the gym in Rome :) [17:59:16] ahahah [17:59:22] well played [18:01:56] someone wants to +1 an easy DNS change? https://gerrit.wikimedia.org/r/c/operations/dns/+/512405 [18:04:17] thx Cas! [18:20:59] Someone knows how to use hieradata/common/monitoring.yaml ? [18:21:14] XioNoX: what's the issue? [18:21:18] I usually can manage :) [18:21:46] I'm creating 2 VMs, rpki1001 and rpki2001 [18:21:57] should I add rpki_eqiad and rpki_codfw ? [18:22:15] or it's not considered as a cluster as it's only 1 machine per site? [18:22:16] depends, it's used only if $cluster is setup with a different name [18:22:27] different than? [18:22:31] in case you don't set it and it goes to misc [18:22:34] it's not needed [18:22:42] so in this specific case [18:22:49] I don't think it's needed [18:22:57] ok :) [18:23:01] that works for me [18:25:24] I don't think you need to define a cluster only for those 2 hosts honestly, they seems misc to me [18:25:38] unless you want to create a network_tools cluster :-P [18:27:08] https://gerrit.wikimedia.org/r/c/operations/puppet/+/512411 :) [18:27:50] volans: does it need to be merged after the role is created or anytime is fine? [18:30:05] why on a different CR? [18:30:11] add it to the same where you're adding the role [18:30:54] you probably also want to add that alias it to the existing misc-ops alias [18:32:13] volans: I didn't want to have too many moving parts in a single CR, otherwise it gets even more scary to review (eg. I have prometheus in https://gerrit.wikimedia.org/r/c/operations/puppet/+/508956 ) [18:32:45] then chain them you they are in order [18:32:52] s/your/so/ [18:32:54] s/you/so/ [18:33:26] how do I do that? [18:33:56] same local branch, multiple commits, git review (will send them all, 1 CR per commit) [18:34:08] if you have already multiple branches, cherry-pick the other commits into one [18:34:23] or rebase interactively and add them [18:35:28] hm [18:36:10] if I add "merge after merging the role" in the description, does that work too? :) [18:37:17] I'm worried about losing the CRs in the process :) [18:38:38] how can you loose the CR? [18:38:59] CRs are never removed based on local commits [18:39:16] if the Change-ID doens't change Gerrit will update the related CR [18:39:35] ok [18:40:47] that's what the change id is for and it's saved my butt many times [18:48:17] ok, I think I didn't break anything [18:48:26] surprising for a Friday [18:49:17] hum, now that all my changes are in the same branch, how do I amend a specific CR? [18:50:55] (thx for the tip btw, I didn't know I could do that) [18:51:09] git rebase --interactive $SHA1 of the commit before the one you want to modify [18:51:30] change 'pick' with 'e' (for 'edit') on those that you want to modify [18:51:45] save and it will interactively rebase into those commit [18:52:01] at each step change files, amend the commit, git rebase --continue [18:53:23] ok [18:54:58] useful aliases [18:54:58] ri = rebase -i [18:54:59] rc = rebase --continue [18:55:02] amend = commit --amend --no-edit [18:55:15] related to the topic ofc :D [18:55:36] and clearly [18:55:37] r = rebase [18:57:20] * volans mostly off [20:14:06] Friday trivia question: What would you expect to happen if I run this command? [20:14:07] puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=10 --certname=consoletest-01.testlabs.eqiad.wmflabs --server=thisisnotarealpuppetmaster [20:20:45] error about not being able to resolve thisisnotarealpuppetmaster? [20:21:28] so NXDOMAIN I guess [20:23:11] That's what I would've expected! [20:23:15] That or 10 seconds of retries. [20:23:34] But… nope, instead it says [20:23:36] "Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known)" [20:23:42] And then waits a few seconds, and then says it again [20:23:43] forever [20:24:00] well the error message itself is quite reasonable [20:24:14] waiting forever... maybe not so much [20:25:14] I was investigating T223920, and assume that if the puppetmaster is unreachable that the firstboot script will eventually despair and exit. But… it does not. [20:25:14] T223920: Ensure/confirm a way to shell into unpuppetized VMs - https://phabricator.wikimedia.org/T223920 [20:53:24] andrewbogott, what about --waitforcert=0 ? [20:54:08] that seems to be the same as omitting —waitforcert entirely. [20:55:09] Which is to say, it doesn't wait at all. [20:55:37] Maybe that's fine with autosign turned on? I assumed that we needed —waitforcert in order to actually get a catalog at all [20:55:58] * andrewbogott tries it [20:56:57] I'm not too sure about that [20:57:15] default is waitforcert=120 [20:57:22] and we do not set it in puppet.conf [20:57:42] I think waitforcert=0 might be okay with autosigning so lets try [20:57:49] it looks like it is [20:59:37] the behavior I see is the same for =0 and for omitting the arg [20:59:45] and the behavior is different if I set =1