[05:08:21] There is one change from papaul pending to be merged on puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/683762 [05:08:31] As it doesn't look dangerous, I am merging it [08:25:44] good morning, could someone please assist me in deploying a Gerrit message change. That is to potentially fix up comments written from Gerrit back to Phabricator when abandoning a change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/683810/1/modules/gerrit/templates/its/PatchSetAbandoned.soy.erb [08:26:11] which don't even need a restart of the service afaik and has very limited impact (at worth it screw up messages written to phabricator when ones abandon a change) [10:30:26] jbond42: hi, I'm trying to use cfssl to create ssl certificates for deployment-mediawiki11, however it's failing with "bad request" (https://phabricator.wikimedia.org/P15669), any hints where to look? [10:31:38] uh, the json file in /etc/cfssl/csr has hosts and names fields empty [10:32:30] Majavah: one sec, just finishing something of then will take a look [10:34:23] jbond42: found the issue, turns out I had an extra space in the cfssl_label hiera field, works now [10:34:50] (still some issues with the envoy config itself, but looks like not cfssl specific) [10:35:24] cool thanks FYI you are the only other person to use this excluding me at the moment so feedback welcome and dont hesitate to ping me :) [11:08:13] why does profile::services_proxy::envoy require at least one service? deployment-prep does not have those and it's controlled by a shared hiera key with envoy tls terminator and forces workarounds like https://phabricator.wikimedia.org/diffusion/CLIP/browse/master/deployment-prep/deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud.yaml [11:12:41] Majavah: not familure with that class question better suited for sevice ops althugh im not sure if anyone is in today. one thing i do note is the following comment [11:12:44] [*ensure*] Whether the proxy should be present or not. We don't use it in deployment-prep. [11:13:02] which suggest in deployment-prep you shuld set `profile::services_proxy::envoy [11:13:09] profile::services_proxy::envoy::ensure: absent [11:14:25] would that help? the hiera lookup is for profile::envoy::ensure which also controls profile::(tlsproxy::)envoy that I need enabled [11:16:43] looking at the code ensure is only used in one place and it is specificly the place which drigeres the fail() and what 'dummy-to-workaround-requirements' is also working around [11:16:47] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/services_proxy/envoy.pp#L35-L40 [11:17:01] drigeres == triggers [11:18:07] oh wait ignore me i see we are overloading profile::envoy::ensure which is used by both this and tlsprox [11:19:25] should probably refactor profile::services_proxy::envoy and profile::tlsproxy::envoy to have there won unique ensure value which defaults to "%{alias('profile::envoy::ensure')}" [11:31:52] I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/683836, I'll test that on deployment-prep [11:32:59] Majavah: :D you beat me https://gerrit.wikimedia.org/r/c/operations/puppet/+/683837 [11:33:43] yours is better, since I wasn't sure what to do with tlsproxy I didn't touch it :D [11:34:58] feel free to test away, i dont want to merge either on a friday with out some one from serviceops taking a look [11:45:56] that works, now envoy is complaining about "invalid path" for the cfssl certificate, that's a long path but I don't see why it would be invalid since it's correct and the file exists [12:08:29] Majavah: i have a feeling its because the file is owned by root (i haven't tested the cfssl tlsproxy integration at all) give me a sec and i should be able to send a quick fix [12:14:48] still failing even with https://gerrit.wikimedia.org/r/c/operations/puppet/+/683849/ :( [12:15:18] I wonder if it's related to the file name being so long? [12:15:19] Majavah: which box? [12:15:28] deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud [12:15:55] * jbond42 looking [12:17:55] you might need to run `sudo /usr/local/sbin/build-envoy-config -c '/etc/envoy'` if you modified something and want to retry envoy [12:20:05] ack [12:23:55] Majavah: i applied https://gerrit.wikimedia.org/r/c/operations/puppet/+/683854 and ran /git-sync-upstream on the puppet master git i hit an issue with a rebase [12:24:09] jbond42: already looking at that [12:24:13] ahh thanks [12:25:37] fyi looks like it workd (althugh i need to check why its reneweing certs on every run) [12:29:27] jbond42: looks like working, thanks! [12:29:40] cool :) [12:29:42] I manually rebased https://gerrit.wikimedia.org/r/c/operations/puppet/+/668701/ to get git update working again [12:29:47] ty!! [12:30:06] ahh cool thanks [12:45:05] what do I need to do to get an instance (deployment-cache-text06) to trust that certificate? profile::pki::client::ensure: present didn't do that [13:13:59] jbond42 (if you're still here, Friday afternoon after all): what do I need to set to have other deployment-prep hosts trust those certs? [13:14:49] Majavah: 5 mins [13:15:26] take your time, not in a hurry at all [13:25:10] Majavah: i broke sync again (i will just stop running that command sorry) [13:25:46] to answer youe questions thugh i think we need to apply two fixes [13:26:30] 1) https://gerrit.wikimedia.org/r/c/operations/puppet/+/683891 this tels envoy to use the full chained certificate file instead of just the leaf certificate. I would like to apply this first and see where we get to althought i dont think this will fix things [13:27:02] to fully fix thins i think we will also need to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/683892 [13:29:55] ack, do those need to be cherry picked or can you merge them now? [13:30:29] i can merge nwo but want to do one at a time [13:30:38] sure [13:31:24] ok the tlsproxy one is merged [13:32:12] I'll run puppet on deployment-mw11 and see what happens [13:32:29] ack [13:33:38] not sure if that did anything? at least no changes on my puppet run [13:33:47] (and yes, the patch is on puppetmaster04) [13:34:12] hmm ok let me check [13:38:53] ahh i see the issue one sec [13:41:01] ok mergeing fix https://gerrit.wikimedia.org/r/c/operations/puppet/+/683896 [13:41:16] (once ci has funushed) [13:42:27] Majavah: merged [13:43:10] now failing with "error initializing configuration '/tmp/.envoyconfig/envoy.yaml': Invalid path: /etc/envoy/ssl/deployment-prep_eqiad1_wikimedia_cloud__deployment-mediawiki11_deployment-prep_eqiad1_wikimedia_cloud_server_chained.pem" [13:43:35] looking [13:43:44] /etc/envoy/ssl only has ca_chain.pem but not that [13:44:17] hmm ok i no what that issue is [13:48:22] ok merged fix [13:50:10] error initializing configuration '/tmp/.envoyconfig/envoy.yaml': Failed to load certificate chain from /etc/envoy/ssl/deployment-prep_eqiad1_wikimedia_cloud__deployment-mediawiki11_deployment-prep_eqiad1_wikimedia_cloud_server_chained.pem [13:50:29] that file exists at least [13:52:05] looking [14:01:20] Majavah: ok going back to using the none chained file and also merging the truststore change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/683892) [14:03:27] i get Verification: OK [14:03:35] with openssl s_cleitn now at least [14:04:25] also curl https://$(hostname -f) [14:04:28] works [14:04:43] "verify error:num=20:unable to get local issuer certificate" on deployment-cache-text06 after running puppet on both [14:07:43] Majavah: you need to have profile::pki::client running on deployment-cache (maybe that shouldn't be the case but it is currently) [14:08:04] *on deployment-cache-text06 [14:09:05] can even be with ensure: present (which arguble is a bug) [14:09:12] ensure: absent even [14:09:24] ah [14:10:45] jbond42: thank you for the Gerrit templating merge earlier this morning. I had an ISP outage and could not follow up unfortunately ;] [14:11:14] hashar: no problem [14:11:21] it works! [14:11:34] woot woot :D [14:11:49] Majavah: ok so now when you say it works what exactly is working :) [14:11:49] I guess I could try changing https://gerrit.wikimedia.org/g/cloud/instance-puppet/+/c5a85ea2182ff7e2dfd093bf771bf1b2375faa58/deployment-prep/deployment-cache-text.yaml#114 to use http [14:12:02] i had planned to look into bvery simlar things next week [14:12:06] curl on cache-text06 trusts that certificate! [14:12:24] ahh ok cool so the next thing toi do is see if ats trusts the cert [14:12:33] yeah, I'll try that next [14:14:07] Majavah: i think it will need a patch simlar to this https://gerrit.wikimedia.org/r/c/operations/puppet/+/683604 (however see my -1 note) [14:14:33] "Error: 502, connect failed" [14:14:44] ah, yeah, looks broken, reverting [14:18:02] how bad of an idea would it be to deploy profile::pki::client on all deployment-prep instances? instead of adding them individually when needed [14:19:07] Majavah: it should be fine, its pretty light wieght by default it just installs cfssl and creates a bunch of files/dirs under /etc/cfssl [14:19:38] it is installed on the whole of production allready [14:20:15] ah, in that case it should be pretty safe [14:21:33] yes exactly in fact i will probably add it explicitly to the profile::base next week. currently its silently included by debmonitor::client -> profile::pki::get_cert -> include profile::pki::client [14:23:23] jbond42: sorry to interrupt but I gotta revert the gerrit template change. Proposed as https://gerrit.wikimedia.org/r/c/operations/puppet/+/683879 (it is broken in some way :-\) [14:23:34] guess I need to setup a test platform bah [14:24:02] ack mergeing now [14:25:24] sorry for the mess :\ [14:25:40] sometime testing in prod is just the fastest way, but if that fail I guess I have to revisit ;] [14:25:42] no probs [14:25:49] :) [14:30:27] so looks like most thins are working, thanks! using this on ats is blocked by that bundle issues but that can wait if needed [14:31:26] Majavah: ack ill be working on the bundle thing next week also thanks a lot for the trouble shooting today. like i said i planned to go through a lot of thins next week so has been a big help to me and always usefull to get an extra pair of eyes [14:53:18] o/ Would we allow a non critical path service (for wikidata only) that handles failed requests etc well to run on labs and be called from an extension roughly 0-2 requests per second? Or is there a hard rule that says no? [15:59:07] addshore: I think from a sec/privacy perspective alone (not getting to latency/availability etc.) probably a had no [16:00:18] I could imagine latency could end up being a question, but the plan would be there would be a cache in front of this in mediawiki too, so this would end up being less than a request per seccond to this external service. [16:01:26] Just trying to figure out if there is a way for us to do this A/B test before having to jump through the hoops of getting the thing in prod, incase the answer is actually we don't want it in prod :P [16:02:17] I'm wondering what the correct / proper way to even reques this is now? Not an RFC? [16:20:00] I mean we do have some external API calls liek for Flickr and machine translation [16:20:05] so I guess it's not a hard-no [16:21:17] but it'd be important to really make sure no PII can get into it, and that any response we get we make no assumptions about in terms of it being correct or trustworthy (e.g. no raw HTML strings or other cleint/server actions as consequence that aren't filtered by some kind of allow list) [16:21:26] that's just me thinking out loud, this ins't policy to be clear :) [16:22:17] Yup. Any idea the process to get a clear decision from whoever ends up needing to decide? [16:26:54] There is one piece of data (short string) that comes from the user that would need to be passed to the service. I imagine if anyone wants to flag anything up that would be the closest PIIable thing. But I'd like to think that tere is still something to be done to minimize the "risk" there etc. [16:27:17] This would probably all be easier to write up in a phab ticket or doc and send to whoever ultimately needs to say yes or no. :) [16:27:53] No is a fine answer, but obviously it extends the timeline for performing the A/B test by multiple quarters etc and needs sec review, service request etc. [16:44:46] it'd maybe be helpful to step out to the bigger context of what the effort is really trying to achieve, once we're talking about these kinds of constraints and processes. [16:45:38] (what are we pulling from where and why, and to what end?) [16:46:01] yup, I'll try to lay this out in a ticket or doc [16:46:28] Krinkle: do you have pointers to where we already call out to flickr / machine translation so that I can see what is happening there? [16:48:25] addshore: we do have AB testing code in prod, and most services/extensions are able to deploy more or less within reason any such code directly to prod on a weekly/daily basis. So it's not obvious to me why it would "help" to do this in labs. [16:48:49] I'm sure it's more complicated, I look forward to reading about it [16:49:08] so the service being called out to is a recomemndation service for properties on wikidata, written in go using graphy treey things. [16:49:32] so in prod it would end up requiring a service request and own place to live etc (rather than just an A/B test within mediawiki) [16:50:01] the old service is entirely within a mediawiki extension [16:51:03] Krinkle: I'll be sure to add some pretty diagrams too! [17:15:40] The only part I can say from the WMCS side of this question is that there would be no special service guarantees from the WMCS team about the instance/project/proxy that was being accessed. It is unreasonable to escalate the "best effort" guarantee for that the Cloud VPS environment to a more responsive SLA for a single experiment. [17:16:36] * bd808 realizes that he is not the lead in WMCS anymore, but has practiced this response [17:30:24] addshore: I'm not aware of any written standards/best practices around calling external service providers in the request flow, but if we did have them, I'd expect all the concerns we've talked about so far -- scrubbing PII, worrying about latency, still providing an acceptable user experience in the event of failures, etc [17:31:24] addshore: for now I think the best place to start would be a serviceops ticket. I'm not sure if this is something that should also be asked of the new-ish Technical Forum, nor am I sure how that works for WMF-external folks :) https://www.mediawiki.org/wiki/Technical_Decision_Making_Process/Technical_Forum [17:31:32] yeah and the PII bit gets complicated, because it's not just explicit PII. It can also be timing attacks on low-volume terms that correlate the user-facing activity to the backend outbound requests. [17:32:54] (as an aside I think we also need to put work into making it easier to do this sort of experimentation within production, but that's a topic for another time) [18:04:50] thanks for the comments all, i'll take that all in to consideration when i write a thing, and I agree it might end up being a "Technical Decision Making Process" thing, so I should go an read up. I know we (wmde) are part of the forum [18:05:46] and re that topic for another time, I dream of that day ;) As right now the alternative is probably a multi quarter effort to see if its worth investing in the solution, but at that point we have already invested quite some! [20:03:14] marostegui sobanski kormat : FYI - pt-heartbeat changes for next week (probably) https://gerrit.wikimedia.org/r/c/mediawiki/core/+/657471 [20:03:31] long underdue, but with your blessing this will ride the train next week [20:03:38] also happy to delay if you prefer that for any reason [20:16:09] andrewbogott: so I see you contributed to bootstrap-vz in the past, have you already figured out a replacement for it? I just ran into the fact that it doesn't know about bullseye https://phabricator.wikimedia.org/T281596#7049780 [20:18:12] legoktm: I'm trying to move to a new process that starts with an official debian daily. https://gerrit.wikimedia.org/r/c/operations/puppet/+/674184 [20:18:32] It works on Buster. If bullseye is now officially v11 I can try that on Bullseye today [20:18:45] Are you using bootstrap-vz for other purposes or just cloud-vps base images? [20:19:17] I think Anders cut it loose and it's in need of a new home if anyone wants it to stay alive; I'm hoping that I don't need it and don't need to personally adopt it :) [20:19:41] we use it to build the base production docker images [20:19:58] see https://gerrit.wikimedia.org/g/operations/puppet/+/f32a67145f271b39f6ffbf3501029c472eb2842e/modules/docker/templates/images/build-base-images.erb [20:20:12] oh :( [20:20:26] Maybe /you/ want to adopt it? [20:20:33] haha no [20:20:54] But probably there's some new tool that is the 'normal' way to do this... I got the impression that for the last couple of years we were the only remaining users [20:21:03] hmm [20:21:24] I'm currently planning to try to patch /usr/share/bootstrap-vz/bootstrapvz/common/releases.py to see if that's good enough to build a bullseye image [20:21:39] I wonder what the official Debian docker images are made with [20:22:30] dammit the debian dailies stopped building the day after I told zigo I was using them [20:26:04] :| [20:26:32] I'm going to file a task to find a replacement [20:26:34] legoktm: if you get desperate I probably know vaguely how to get Bullseye to work on bootstrap-vz. I'd start by figuring out what tool other builds use though. [20:28:08] so I want to run this on buster to build a bullseye image [20:28:13] does it matter what version I'm building? [20:29:17] what version of what? [20:30:32] sorry, does it matter that I'm building bullseye vs buster images? I'd assume the tool just downloads a different set of packages [20:30:58] based on https://github.com/andsens/bootstrap-vz/blob/fcdc6993f59e521567fb101302b02312e741b88c/bootstrapvz/plugins/docker_daemon/__init__.py it looks like it shouldn't matter [20:31:10] probably I should just give it a try and see how far it gets [20:31:32] I'm not sure. I had to make some code changes to get it from stretch to buster... [20:32:39] if you do a git log you should be able to find my patch for buster and see which things it touched [20:33:00] (there's probably a way to do that on github too but the github UI always annoys me) [20:33:13] yeah, I'm just not seeing similar stuff for the docker provider [20:33:34] I'm going to try building a bullseye image locally and will report back :) [20:33:40] cool [20:55:10] Hi SREs! Can someone +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/683989 please? [20:56:21] here's the puppet compiler report for it: https://puppet-compiler.wmflabs.org/compiler1003/29338/deploy1002.eqiad.wmnet/index.html [20:56:32] Krinkle: did you mean 1.37 for notes or are you gonna backport? [20:57:33] RhinosF1: backpprt, announced on wikitech few moths ago linked from comment [20:57:47] Cool