[09:18:02] <hashar>	 I have a question for the general audience.  How one would transfer a million files / ~ 300GB of data from an host to another one?
[09:18:46] <hashar>	 we tried rsync::quickdatacopy  but the rsync server runs in chroot and that prevent it from mapping the uid/gid between the source and destination resulting in lot of madness to fix up the ownership
[09:19:02] <hashar>	 so I felt like one can run rsync over ssh as root (which would preserve permissions)
[09:19:49] <hashar>	 then I guess using rsync::quickdatacopy  rsync server instead of ssh has a reason. It just escapes me entirely
[09:21:52] <jayme>	 hashar: Don't know if it has trouble with that many files, but maybe https://wikitech.wikimedia.org/wiki/Transfer.py ?
[09:22:16] <volans>	 hashar: E_NOTENOUGHINFO, can you do stuff on the source host? is a one time thing? can you make an archive on the source host?
[09:22:20] <hashar>	 I guess ssh encryption might be too much overhead
[09:22:45] <hashar>	 the context is to switch over Jenkins from an host to another one, so that is plenty of build history to move around ;]
[09:23:35] <volans>	 don't we already have a setup that replicates jenkins to a passive host?
[09:24:13] <hashar>	 yeah we have a spare jenkins, but the builds history/artifacts are not synchronized. They are only on a single host
[09:25:31] <hashar>	 jayme: I had a quick look at transfer.py  , seems like it wants one big tarball of the file  and that would not fit on disk unfortunately
[09:25:43] <volans>	 I think you need something slightly more structured
[09:26:00] <volans>	 for the one off we could do a streaming tar.gz
[09:26:13] <volans>	 and then setup a puppetized rsync to keep it in sync
[09:26:27] <volans>	 unless jenkins has a better way to sync 2 servers
[09:28:07] <hashar>	 Jenkins doesn't really have such system afaik.   I could change to have two Jenkins master but there is a long tail of tasks to accomplish for that
[09:28:30] <hashar>	 or I could move the data to a distributed file system such as Ceph / Swift or whatever  and have the Jenkins primary and spare to point to it
[09:28:42] <hashar>	 but that is a lot of refactoring :]
[09:29:06] <hashar>	 what is a stream tar.gz / rsync would look like?  Is that about running tar on one side and netcat it to the other side ?
[09:30:05] <volans>	 pretty much, piping tar compression and usually encryption on one side to netcat and the opposite on the target host
[09:30:29] <volans>	 we used to do that for dbs before transfer.py
[09:30:43] <volans>	 and we still do something very similar in the wdqs data trasnfer cookbook
[09:31:07] <hashar>	 assuming the destination is empty, I guess  rsync over ssh as root  is similar so ?
[09:31:30] <volans>	 for a large number of files my bet is that it will be much slower
[09:32:07] <volans>	 and nc uses the bandwith myuch better, even too much if you have to not interfere with current work
[09:32:13] <volans>	 but can be rate limited
[09:34:04] <volans>	 (example for example at the end of this section: https://wikitech.wikimedia.org/wiki/MariaDB/ImportTableSpace )
[09:35:49] <hashar>	 cool thanks
[09:36:14] <hashar>	 I guess I am going to read all the transfer.py and ImportTableSpace pages  then emit proposals and reach out to the ops list for the general audience ;]
[09:36:18] <hashar>	 thank you both!
[09:48:49] <volans>	 hashar: FYI CI for spicerack has this weird thing that sometimes pylint finds erroneously a bug
[09:48:59] <volans>	 I can't repro locally and a 'recheck' fixes it
[09:49:01] <volans>	 any idea? https://integration.wikimedia.org/ci/job/tox-docker/15800/console
[10:09:17] <kormat>	 volans: it smells a bit weird that spicerack does `from cumin import NodeSet`, when spicerack also imports from ClusterShelll directly
[10:09:30] <kormat>	 (though i don't see a connection between this and the issue you're having)
[10:10:44] <volans>	 kormat: yeah we could unify that or use cumin.nodeset (that uses the RESOLVER_NOGROUP) everywhere
[10:14:03] <kormat>	 volans: i'm wondering if RemoteHosts sometimes gets a list as the `hosts` parameter. that would cause the issue
[10:14:19] <kormat>	 does pylint honour type hints?
[10:17:14] <volans>	 not sure how much they affect the checker internally
[10:18:13] <kormat>	 because otherwise how is it inferring the type of `remaining`, and inconsistently, too
[10:18:35] <kormat>	 volans: smells like a pylint bug to me
[10:18:38] <volans>	 from how the tests uses it potentially
[10:19:13] <kormat>	 volans: alright - i'd check to see if any tests pass in a list
[10:20:01] <volans>	 if you want, but don't loose time on it, I've already jo.hn running tox in a loop to see if he can repro locally on his linux setup as I wasn't able to
[10:20:04] <volans>	 repro
[10:23:42] <hashar>	 volans: yeah you have mentioned it last week. I looked a bit into it and I cant find any clue in CI  nor do I know anything about the spicerack code
[10:23:53] <hashar>	 what would be possible is that the linter is confused when it is parsing the files
[10:24:08] <hashar>	 possibly cause files are processed in a different order frmo a build to another one
[10:24:23] <hashar>	 or it is hitting a random bug unrelated to ci / spicerack :-\
[10:24:39] <hashar>	 maybe it can be reproduced by reusing the same PYTHONHASHSEED
[10:25:24] <hashar>	 (all of the above is a work of fiction and should not reflect the reality . No animals have been harmed under supervision of PETA)  (c) 2020 Wikimedia Movies Pictures
[10:28:19] <hashar>	 volans: or what I do is download the  console logs and artifacts for a good and bad build  and diff them
[10:28:26] <hashar>	 it is a bit tedious, and often doesn't lead any result
[10:28:28] <hashar>	 but sometimes it does
[10:28:45] <hashar>	 or maybe propsector can be run with some debug output  and it can be used for comparison
[10:29:58] <hashar>	 it might also be a bug in the python3.7 version we use. I had past experience with issues in the Debian provided pythons cause all hotfixes are not necessarily cherry-picked in the debian package.
[10:34:09] <kormat>	 volans: my suggestion: add `assert isinstance(remainining, NodeSet)` to that function, and run it in CI
[10:34:23] <kormat>	 (yeah bandit will complain due to assert being used but..)
[10:34:37] <volans>	 who cares
[10:34:41] <volans>	 we can test it yes
[10:37:38] <elukey>	 the cumin output in cookbook is REALLY nice
[10:37:51] <elukey>	 wikilove to Riccardo :D
[10:38:18] <volans>	 might be too verbose in *some* cases, we can tweak them down
[10:42:00] <volans>	 kormat: let's see with https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/647657
[10:42:23] <volans>	 locally unit tests pass
[13:26:35] <jbond42>	 hi all going to disable puppet for about 30 mins whle i reboot the puppetmasters and puppetdbs
[14:50:05] <ottomata>	 q: is there a comprehensive list of top level (e.g. wikimedia.org) that we operate and might receive requests for?
[14:50:19] <ottomata>	 I've got a hardcoded list here
[14:50:19] <ottomata>	 https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/646808/3/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#84
[14:50:29] <ottomata>	 would rather have something more official
[14:51:14] <cdanis>	 ottomata: counting 'non-canonicals' or not? :)
[14:57:19] <ottomata>	 non-canonicals?
[14:57:36] <ottomata>	 anything that we might receive an HTTP request for
[14:57:45] <ottomata>	 at the frontend cache level
[14:59:16] <volans>	 ottomata: configured ones are all between teh dns repo and the ncredir config
[14:59:29] <volans>	 but there are additional ones that delegate their dns to us and we don't know
[14:59:32] <volans>	 or similar
[14:59:43] <volans>	 so the full canonical list is actually more complicated than that
[14:59:52] <volans>	 in general traffic should give you the pointers to what you want
[15:00:20] <jbond42>	 ottomata: i think for cache ones it may be `yaml.load('hieradata/role/common/cache/text.yaml')['cache::alternate_domains'].keys()`
[15:01:19] <bblack>	 and the upload domains
[15:01:39] <bblack>	 the non-canonicals are no longer routed to the cache layer
[15:01:51] <bblack>	 (or to analytics in any way, AFAIK)
[15:02:07] <ottomata>	 ok cool, so cache/text.yaml cache/upload.yaml should suffice?
[15:02:10] <ottomata>	 i'll check that out, thank you!
[15:02:27] <bblack>	 you're looking for only the 2LD domains I guess right?
[15:02:43] <bblack>	 as in, just "wikipedia.org", not the list of language subdomains and www and all that beneath it?
[15:03:40] <bblack>	 ottomata: ^ ?
[15:04:35] <bblack>	 I don't think we have it as a succinct list, anywhere
[15:05:54] <bblack>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#313
[15:06:04] <bblack>	 ^ this is the closest thing to such a list, but it's a regex heh
[15:06:33] <ottomata>	 yes
[15:06:35] <bblack>	 any HTTP hostname that doesn't match that regex, we reject as invalid
[15:06:48] <ottomata>	 hehe, i built my list ffrom a regex we had in refinery source
[15:07:11] <ottomata>	 from this one
[15:07:11] * ottomata https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/646808/3/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#76
[15:07:30] <bblack>	 the main surprise there might be wikiworkshop.org - it's not in our primary unified cert, but we're handling it as canonical now through a separate LE cert
[15:07:49] <ottomata>	 oo and wmfusercontent
[15:07:49] <ottomata>	 ?
[15:07:51] <bblack>	 also wmfusercontent is missing yeah
[17:24:50] <cdanis>	 not sure what happened, haven't dug in to the actual reports at all, but: https://grafana.wikimedia.org/d/43OLwO2Mk/cdanis-hacked-up-nel-stats?viewPanel=2&orgId=1&from=1607584665237&to=1607600907284
[17:25:22] <cdanis>	 from 10:00 through 10:40 or so today we had way more than usual h2.ping_failed / tcp.closed error reports
[17:25:48] <cdanis>	 ah!
[17:25:50] <cdanis>	 https://sal.toolforge.org/log/8jAVTHYBpU87LSFJ5dDQ
[17:25:52] <cdanis>	 amazing
[17:36:36] <rzl>	 cooool
[17:37:16] <cdanis>	 I believe but am not sure that h2.ping_failed is tantamount to "connection timed out"
[17:37:26] <cdanis>	 (not on establishment; an idle background connection)
[22:59:49] <twentyafterfour>	 should I be worried about 1.6k requests per second spike that are all erroring with 415 responses?
[22:59:52] <twentyafterfour>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-6h&to=now&refresh=30s