[09:18:02] I have a question for the general audience. How one would transfer a million files / ~ 300GB of data from an host to another one? [09:18:46] we tried rsync::quickdatacopy but the rsync server runs in chroot and that prevent it from mapping the uid/gid between the source and destination resulting in lot of madness to fix up the ownership [09:19:02] so I felt like one can run rsync over ssh as root (which would preserve permissions) [09:19:49] then I guess using rsync::quickdatacopy rsync server instead of ssh has a reason. It just escapes me entirely [09:21:52] hashar: Don't know if it has trouble with that many files, but maybe https://wikitech.wikimedia.org/wiki/Transfer.py ? [09:22:16] hashar: E_NOTENOUGHINFO, can you do stuff on the source host? is a one time thing? can you make an archive on the source host? [09:22:20] I guess ssh encryption might be too much overhead [09:22:45] the context is to switch over Jenkins from an host to another one, so that is plenty of build history to move around ;] [09:23:35] don't we already have a setup that replicates jenkins to a passive host? [09:24:13] yeah we have a spare jenkins, but the builds history/artifacts are not synchronized. They are only on a single host [09:25:31] jayme: I had a quick look at transfer.py , seems like it wants one big tarball of the file and that would not fit on disk unfortunately [09:25:43] I think you need something slightly more structured [09:26:00] for the one off we could do a streaming tar.gz [09:26:13] and then setup a puppetized rsync to keep it in sync [09:26:27] unless jenkins has a better way to sync 2 servers [09:28:07] Jenkins doesn't really have such system afaik. I could change to have two Jenkins master but there is a long tail of tasks to accomplish for that [09:28:30] or I could move the data to a distributed file system such as Ceph / Swift or whatever and have the Jenkins primary and spare to point to it [09:28:42] but that is a lot of refactoring :] [09:29:06] what is a stream tar.gz / rsync would look like? Is that about running tar on one side and netcat it to the other side ? [09:30:05] pretty much, piping tar compression and usually encryption on one side to netcat and the opposite on the target host [09:30:29] we used to do that for dbs before transfer.py [09:30:43] and we still do something very similar in the wdqs data trasnfer cookbook [09:31:07] assuming the destination is empty, I guess rsync over ssh as root is similar so ? [09:31:30] for a large number of files my bet is that it will be much slower [09:32:07] and nc uses the bandwith myuch better, even too much if you have to not interfere with current work [09:32:13] but can be rate limited [09:34:04] (example for example at the end of this section: https://wikitech.wikimedia.org/wiki/MariaDB/ImportTableSpace ) [09:35:49] cool thanks [09:36:14] I guess I am going to read all the transfer.py and ImportTableSpace pages then emit proposals and reach out to the ops list for the general audience ;] [09:36:18] thank you both! [09:48:49] hashar: FYI CI for spicerack has this weird thing that sometimes pylint finds erroneously a bug [09:48:59] I can't repro locally and a 'recheck' fixes it [09:49:01] any idea? https://integration.wikimedia.org/ci/job/tox-docker/15800/console [10:09:17] volans: it smells a bit weird that spicerack does `from cumin import NodeSet`, when spicerack also imports from ClusterShelll directly [10:09:30] (though i don't see a connection between this and the issue you're having) [10:10:44] kormat: yeah we could unify that or use cumin.nodeset (that uses the RESOLVER_NOGROUP) everywhere [10:14:03] volans: i'm wondering if RemoteHosts sometimes gets a list as the `hosts` parameter. that would cause the issue [10:14:19] does pylint honour type hints? [10:17:14] not sure how much they affect the checker internally [10:18:13] because otherwise how is it inferring the type of `remaining`, and inconsistently, too [10:18:35] volans: smells like a pylint bug to me [10:18:38] from how the tests uses it potentially [10:19:13] volans: alright - i'd check to see if any tests pass in a list [10:20:01] if you want, but don't loose time on it, I've already jo.hn running tox in a loop to see if he can repro locally on his linux setup as I wasn't able to [10:20:04] repro [10:23:42] volans: yeah you have mentioned it last week. I looked a bit into it and I cant find any clue in CI nor do I know anything about the spicerack code [10:23:53] what would be possible is that the linter is confused when it is parsing the files [10:24:08] possibly cause files are processed in a different order frmo a build to another one [10:24:23] or it is hitting a random bug unrelated to ci / spicerack :-\ [10:24:39] maybe it can be reproduced by reusing the same PYTHONHASHSEED [10:25:24] (all of the above is a work of fiction and should not reflect the reality . No animals have been harmed under supervision of PETA) (c) 2020 Wikimedia Movies Pictures [10:28:19] volans: or what I do is download the console logs and artifacts for a good and bad build and diff them [10:28:26] it is a bit tedious, and often doesn't lead any result [10:28:28] but sometimes it does [10:28:45] or maybe propsector can be run with some debug output and it can be used for comparison [10:29:58] it might also be a bug in the python3.7 version we use. I had past experience with issues in the Debian provided pythons cause all hotfixes are not necessarily cherry-picked in the debian package. [10:34:09] volans: my suggestion: add `assert isinstance(remainining, NodeSet)` to that function, and run it in CI [10:34:23] (yeah bandit will complain due to assert being used but..) [10:34:37] who cares [10:34:41] we can test it yes [10:37:38] the cumin output in cookbook is REALLY nice [10:37:51] wikilove to Riccardo :D [10:38:18] might be too verbose in *some* cases, we can tweak them down [10:42:00] kormat: let's see with https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/647657 [10:42:23] locally unit tests pass [13:26:35] hi all going to disable puppet for about 30 mins whle i reboot the puppetmasters and puppetdbs [14:50:05] q: is there a comprehensive list of top level (e.g. wikimedia.org) that we operate and might receive requests for? [14:50:19] I've got a hardcoded list here [14:50:19] https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/646808/3/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#84 [14:50:29] would rather have something more official [14:51:14] ottomata: counting 'non-canonicals' or not? :) [14:57:19] non-canonicals? [14:57:36] anything that we might receive an HTTP request for [14:57:45] at the frontend cache level [14:59:16] ottomata: configured ones are all between teh dns repo and the ncredir config [14:59:29] but there are additional ones that delegate their dns to us and we don't know [14:59:32] or similar [14:59:43] so the full canonical list is actually more complicated than that [14:59:52] in general traffic should give you the pointers to what you want [15:00:20] ottomata: i think for cache ones it may be `yaml.load('hieradata/role/common/cache/text.yaml')['cache::alternate_domains'].keys()` [15:01:19] and the upload domains [15:01:39] the non-canonicals are no longer routed to the cache layer [15:01:51] (or to analytics in any way, AFAIK) [15:02:07] ok cool, so cache/text.yaml cache/upload.yaml should suffice? [15:02:10] i'll check that out, thank you! [15:02:27] you're looking for only the 2LD domains I guess right? [15:02:43] as in, just "wikipedia.org", not the list of language subdomains and www and all that beneath it? [15:03:40] ottomata: ^ ? [15:04:35] I don't think we have it as a succinct list, anywhere [15:05:54] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#313 [15:06:04] ^ this is the closest thing to such a list, but it's a regex heh [15:06:33] yes [15:06:35] any HTTP hostname that doesn't match that regex, we reject as invalid [15:06:48] hehe, i built my list ffrom a regex we had in refinery source [15:07:11] from this one [15:07:11] * ottomata https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/646808/3/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#76 [15:07:30] the main surprise there might be wikiworkshop.org - it's not in our primary unified cert, but we're handling it as canonical now through a separate LE cert [15:07:49] oo and wmfusercontent [15:07:49] ? [15:07:51] also wmfusercontent is missing yeah [17:24:50] not sure what happened, haven't dug in to the actual reports at all, but: https://grafana.wikimedia.org/d/43OLwO2Mk/cdanis-hacked-up-nel-stats?viewPanel=2&orgId=1&from=1607584665237&to=1607600907284 [17:25:22] from 10:00 through 10:40 or so today we had way more than usual h2.ping_failed / tcp.closed error reports [17:25:48] ah! [17:25:50] https://sal.toolforge.org/log/8jAVTHYBpU87LSFJ5dDQ [17:25:52] amazing [17:36:36] cooool [17:37:16] I believe but am not sure that h2.ping_failed is tantamount to "connection timed out" [17:37:26] (not on establishment; an idle background connection) [22:59:49] should I be worried about 1.6k requests per second spike that are all erroring with 415 responses? [22:59:52] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-6h&to=now&refresh=30s