[07:48:01] so the gutter pool is now taking traffic for mc1020 [07:48:11] more specifically, mc-gp1002 [07:48:23] no evictions, the TTL for all keys is capped at 10 mins [07:48:39] get hit ratio around 0.85 after ~30 mins [07:48:56] some nice metrics in https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=mc-gp1002&var-slab=All [07:49:11] at the bottom there is the "1.5.x" panel, those are the new memcached metrics [07:49:29] keys are split into 3 LRUs, hot / warm / cold [07:52:28] more info in https://memcached.org/blog/modern-lru/ [07:56:42] (I have also to mention that these hosts have 256G of RAM, so no evictions is kind of expected, they should be able to absorb traffic from multiple failing shards) [08:17:22] <_joe_> yeah, the size of the datastore is larger [08:20:11] so far everything looks nice [09:52:20] For anyone interested in networking, this looks like a gold mine: https://learn.nsrc.org/bgp [10:04:33] XioNoX: nice ;) [10:23:11] And more Juniper specific - https://learningportal.juniper.net/juniper/user_activity_info.aspx?id=10175 [10:57:16] XioNoX: <3 [11:02:38] is there an easier way of copying stuff between servers than... this: [11:02:42] `ssh install1003.wikimedia.org sudo "tar -C /srv/tftpboot/ -cf - buster-installer" | ssh apt1001.wikimedia.org sudo "tar -C /srv/tftpboot -xvf -"` [11:04:04] kormat: transfer.py :-D https://wikitech.wikimedia.org/wiki/Transfer.py [11:04:56] rsync::quickdatacopy source_host dest_host file_path [11:05:50] mutante: this is for a once-off, i'm guessing that's puppet? [11:06:16] and that is why we created transfer.py, let me know what you want transfered, and I will do it in a single command [11:06:19] kormat: if not too big you can use the cumin host with ssh -3 [11:06:45] s/ssh/scp/ [11:07:24] kormat: yea, but it's still easier than the alternatives, imho [11:07:27] SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 -p root@source:/path/ root@destination:/path [11:10:46] I find "transfer.py apt1001.wikimedia.org:/srv/tftpboot/buster-installer install1003.wikimedia.org:/srv" quite easy [11:11:01] (from cumin1001) [11:11:57] mutante: because i'm a puppet newbie - is it possible to run ad-hoc puppet commands (a la ansible)? [11:12:14] jynus: that looks very handy, thanks :) [11:21:02] kormat: no we should not do that (puppet ad-hoc) [11:23:04] i will fix it in puppet so we don't need any ad-hoc things [11:43:53] Notice: /Stage[main]/Aptrepo/File[/srv/tftpboot/buster-installer.10.3/ldlinux.c32]/ensure: created [11:44:29] kormat: it should work now. /srv/tftpboot/buster-installer.10.3/ got populated from volatile, so should be current [11:45:53] mutante: <3 [11:46:41] runs puppet so it also happens in codfw [12:24:05] XioNoX: thanks for the links [12:30:55] hey guys, just a reminder we are in the middle of migrating contint [12:31:01] so now V+2 right now [12:31:06] but hopefully won't take long [12:31:10] s/now/no [14:00:07] <_joe_> kormat: do what jynus was suggesting :P [14:10:09] not sure if sarcastic- transfer.py comes with built-in checks to prevent data overwrite [14:10:22] and it runs 40 times a day to run backups 0:-D [14:11:34] it is far from great, but it preciselly is the use case of "transient copy" :-P [14:13:38] <_joe_> jynus: it was not sarcastic, it's by far the best option [14:19:40] the issue is already fixed and none of that is needed [14:19:51] and if it would have been done then it would have popped up again soon [15:42:34] someone knows what this mean? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=DNS+Discovery+operations+diffs [15:43:32] interesting [15:43:39] i wish there were some docs about the alert on https://wikitech.wikimedia.org/wiki/DNS/Discovery [15:45:18] ah right, Alex wrote this alert [15:45:53] also say what is the desired state [15:46:02] the desired state is hardcoded in the python file ;) [15:46:26] https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/files/configmaster/disc_desired_state.py#22 [15:46:53] that being said... they do appear pooled for codfw and eqiad in https://config-master.wikimedia.org/discovery/discovery-basic.yaml [15:46:56] so now I'm more confused [15:47:19] oh [15:47:23] {"eqiad": {"ttl": 10, "pooled": true, "references": []}, "tags": "dnsdisc=eventgate-analytics-external"} [15:47:25] {"codfw": {"ttl": 10, "pooled": true, "references": []}, "tags": "dnsdisc=eventgate-analytics-external"} [15:47:27] ttl=10 instead of ttl=300 [15:47:45] https://sal.toolforge.org/log/G3Ks83EBj_Bg1xd3-2vr [15:47:47] _joe_: ? [15:48:12] <_joe_> cdanis: yeah it was me and alex, I thought he also reset the ttl [15:48:26] seems like no [15:48:32] shall I just reset the ttl? [15:48:33] <_joe_> ok, we can change it [15:48:42] <_joe_> yes [15:48:46] <_joe_> but is that a problem? [15:49:09] I didn't write the monitoring script ;) [15:49:19] <_joe_> me neither [15:49:26] hahaha [15:49:49] <_joe_> and I think it's pretty useless [15:50:03] <_joe_> it should check that for everything we have at least one dc pooled I guess [15:50:06] it would be neat if conftool objects had a utility method for producing a short, textual diff that could be embedded in such things [15:50:20] Current Status: OK (for 0d 0h 0m 43s) [15:50:31] <_joe_> not sure what you meant [15:50:37] <_joe_> but also, I need a break, ttyl [15:53:22] o/ [15:54:57] moritzm: looks like you left puppet disabled on mw2165, can that be re-enabled? [15:56:56] most definitely yes, let me quicky checl [15:57:17] cdanis: needs a hand for that checj? [15:58:09] volans: no we're good now [15:58:29] unless you want to debate the larger design issues [15:58:39] in which case, gee look at the time, gotta go have lunch [15:59:54] cdanis: enabled puppet on mw2165 and ran it there [15:59:57] ty! [19:35:23] This virtual conference might be interesting to some folks hanging out here -- https://openobservability.io/ -- 2020-05-27T18:00Z start time. Agenda is still TBD, but looks like ELK, Jaeger, Zipkin, Prometheus sort of things will be covered. [21:08:45] mutante: have a minute to h elp me understand an issue with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451821/ ? [21:09:07] I'm wondering if one of the apache::conf or apache::site things automatically included ::apache? [21:09:18] which, the httpd resources seem not to include ::httpd [21:11:20] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Class[Httpd]' in parameter 'before' (file: /etc/puppet/modules/puppetmaster/manifests/passenger.pp, line: 68) on node abogott-puppetmaster.testlabs.eqiad.wmflabs [21:13:12] Apologies in advance for the slight text wall: [21:13:22] As part of the Discovery->Search team's wdqs data transfer process, we rely on the following cookbook to write a `ferm` rule that will configure iptables to allow connections from the desired wdqs instance on port 9876: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/wdqs/data-transfer.py#L66-L69 [21:13:22] Note the line I linked in master has syntax errors and thus `ferm` fails to restart. We are working on a patch to fix the syntax error here: https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/595061/ [21:13:22] That patch is syntactically valid (i.e. ferm restarts fine as opposed to blowing up on syntax error), but it doesn't seem to be having the intended effect on the actual firewall rules. We've been testing on the test instance `wdqs1009` and `sudo iptables -L | grep 9876` is not showing the rule [21:13:22] Question: Does anyone here have experience with fern and perhaps know where I could find verbose logs that would show ferm parsing its configuration file and generating the corresponding iptables rules? [21:15:25] ryankemper: if you have't done this already, you can start by looking in /etc/ferm/conf.d at the specific rule files [21:15:34] that's usually enough to sort me out when ferm won't start [21:15:59] (And in my case, the issue is usually either commas where there shouldn't be, or lists that aren't in (parentheses) [21:16:00] ) [21:16:09] So, ferm will start but we're not seeing the intended effect on `iptables` [21:16:29] The file in question is `/etc/ferm/conf.d/10_cookbooks.sre.wdqs.data-transfer`, and its contents are the following: [21:17:21] ryankemper: see man ferm, in particular --noexec and --lines [21:17:43] mutante: added you to a proposed (but not very well-considered) patch [21:18:07] https://www.irccloud.com/pastebin/9RNRzhyM/ [21:18:12] volans: Thanks, I'll check that out [21:20:05] ryankemper: I'd use '(tcp', instead of '([tcp]', might be equivalent, so could be a red herring, but it's the syntax we're using in the other rules [21:20:15] https://www.irccloud.com/pastebin/3iSn4ol4/ [21:20:48] ryankemper: your paste refers to an eqiad host as being in codfw [21:22:46] I'd expect ferm to error out when trying to resolve that but maybe it does something quieter [21:23:14] andrewbogott: Good catch, just fixed that [21:23:23] did it help? [21:23:58] It looks like it [21:24:03] Lemme test by doing one more run [21:24:56] LGTM, thanks all https://www.irccloud.com/pastebin/rdmoTK3w/ [21:27:24] Very interesting that it would be silently failing instead of throwing out an error [21:42:18] Related to the above, anyone free to review https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/595061/? (Still learning the gerrit workflow so not 100% sure on the best way to tag people)