[09:32:54] <_joe_> my 2 cents on the discussion on types above [09:33:01] <_joe_> use integer as a type [09:33:06] <_joe_> and then check in the code [09:33:14] <_joe_> that it is 8 or 11, else fail [09:33:24] <_joe_> that will make your code more readable [10:32:37] <_joe_> akosiaris: so elukey has our same needs - he is building a debian package that involves using a docker image [10:32:41] <_joe_> an external one [10:32:47] ouch [10:32:49] <_joe_> he's been doing so on his laptop [10:33:04] elukey: I feel for you. which package? [10:33:04] <_joe_> I was proposing to institutionalize the use of the "packaging" project in cloud [10:34:17] akosiaris: so I am testing Apache BigTop as replacement for Cloudera, currently in Hadoop test. [10:34:18] I 'd rather elukey tried it out first and see if it fits what he wants to do. the packaging project is highly tailored for building kubernetes packages currently [10:34:35] they build packages via images like bigtop/slaves:1.4.0-debian-9 [10:34:40] <_joe_> he's doing so on his laptop [10:35:06] <_joe_> he was asking about running it on boron, I'm not enthusiastic at the idea of doing that [10:35:15] it wouldn't work anyway [10:35:17] now normally they offer all packages via apt repo, but of course I found some bugs that need a rebuild (all merged upstream of course but they are focusing on the new release, so they prefer not to rebuild/release new packages) [10:37:17] elukey: you might want to try to build it on builder01.packaging.eqiad.wmflabs [10:37:45] it has docker around and some 110G available from what I see [10:37:46] I'm planning to add a buster-based build host next week, BTW [10:37:50] akosiaris: I will, how should I move debs to install100x afterwards? [10:38:06] in parallel to boron initially and then we can switch off boron when all is fine [10:38:30] elukey: I just do a ssh builder01.packaging.eqiad.wmflabs cat lala.deb | ssh install1002 sh -c 'cat > lala.deb' [10:38:54] not great, but it's not like my packages are gigabytes in size [10:39:08] we're hitting certain limits of running our build host on a outdated toolchain... [10:39:27] akosiaris: ack, I have java packages so you can imagine how big they are :D [10:39:45] lol, do I dare ask? [10:40:05] well my are usually ~100-150MB in size [10:40:31] 188M output/hadoop [10:40:32] 629M output/oozie [10:40:35] not that horrible [10:40:49] (these are the out dirs after building etc..) [10:40:50] not the debs mind you, but the entire *orig.tar.gz, *deb, *dsc, *changes [10:40:55] ouch [10:41:42] we could run an nginx on that host to allow services the packages [10:42:04] I 've been postponing doing that on purpose, to avoid instutionalizing building packages outside of our architecture [10:42:11] s/architecture/infrastructure/ [10:42:20] so in theory this use case should be rare for me [10:42:32] since I hope to rely on bigtop's official packages [10:42:37] like we do for cloudera [10:42:54] but it might happen that we'll need to rebuild some of them for bugs etc.. [10:43:14] so I wouldn't ask to change a ton of things only for my use case :) [10:47:09] I don't have a good answer tbh. [10:47:59] akosiaris: people using java are discriminated in here, I know [10:48:04] :D [10:48:34] lol [10:48:50] I don't think java has anything to do with it fwiw [10:49:15] yeah, we also discrimate against many other languages :-) [10:49:17] it's just that adding support for pulling random docker images from the internetz in our production infrastructure [10:49:31] is a discussion that we need to have? [10:50:10] I am all these years consciously waltzing around the issue to get our kubernetes packages built, you just found yourself in the same position [10:53:21] let me ask a wider question. We are at some point going to have a new CI (hopefully), one that is differently architected in order to make it more difficult for an attacker to gain a foothold in production. Interestingly, there's a very high chance this CI is not going to be in our production infrastructure [10:53:49] I would very much like to get our debian packages built by this CI/CD infra and use them as is [10:54:20] instead of the folly that we now have where CI builts the packages in labs and then we rebuilt them on boron [10:55:16] so, regarding using random docker images from the internet to use them for building above said packages, is this a use case we want to support? I think so. [10:55:44] it's not like we don't download random nodejs modules from the internet right now [10:56:02] granted, debian packages run on installation as root, not as a random user [10:56:45] but maybe we can build some safeguards? [11:02:04] <_joe_> tbh I don't want to support that [11:02:11] <_joe_> but it's impossible not to [11:02:18] <_joe_> most new things are moving to that model [11:02:22] <_joe_> it's horrible and sad [11:02:40] <_joe_> but going against that trend is just time draining and frankly a bad use of our scarce time [11:04:22] +1 [11:19:09] <_joe_> I'm still resisting using externally-built docker containers in production for the security reasons we've all talked about extensively [13:04:14] If nutcracker has two hosts configured, with one down, what behaviour should I see when connecting? Will it fail over to the working server or will a client get a timeout if nutcracker hits the bad host? [13:05:22] Currently the client is getting timeouts so I'm assuming it's the latter. [13:19:16] akosiaris: btw: look up scp -3 [13:21:00] cdanis: I know. old habits die hard I guess [13:35:35] hnowlan: iirc it is rehashing until the failed one comes back up [13:35:40] but I am not 100% sure [13:37:32] * DominicBM waves godog [13:39:55] DominicBM: hi! [13:40:15] DominicBM: thanks for reaching out re: T248151 [13:40:15] T248151: Big number of uploads from DPLA bot - https://phabricator.wikimedia.org/T248151 [13:41:55] So, to answer some of your questions, I am uploading media from DPLA, which is a national aggregator of library metadata in the US. The reason there are so many pages from books right now is mainly because some contributors (especially university archives) consist almost entirely of those. [13:44:14] The way I am doing this is by going to each item's IIIF manifest and uploading everything in sequence. So if they have chosen to digitize materials with an individual JPG for every page of a 500-page book, those are what are getting uploaded. [13:47:10] (Btw, I have a bot flag and am using pywikibot, which I know sleeps when there is server lag. I am trying to do everything above board. I guess I just assumed the server already was enforcing some kind of rate limit! :) ) [13:48:18] DominicBM: that explains, thanks! I don't know much about what's the preferred way on the commons side re: one file per page, hence my question [13:49:38] DominicBM: yeah re: rate limit server lag would be related to database lag vs rate limit of uploads [13:49:59] It usually depends on the source [13:50:12] If it's a N page tiff, upload it as such. If it's jpeg per page... Do that [13:50:59] sounds good [13:51:06] There is also apparently some issue where Pywikibot uploads won't succeed for larger files, like 100+ MB? There are some multi-page PDFs in the set, but I've had to exclude most book-size PDFs from my uploads, because chunked uploads fail. [13:51:14] MW does have rate limits for uploads [13:51:20] Defaults: [13:51:20] // File uploads [13:51:21] 'upload' => [ [13:51:21] 'ip' => [ 8, 60 ], [13:51:21] 'newbie' => [ 8, 60 ], [13:51:21] ], [13:51:27] Think it's https://phabricator.wikimedia.org/T129216 ? [13:51:37] *I think [13:52:11] Reedy: what's the unit there ? [13:52:18] I can never remember :P [13:52:23] I think it's uploads per 60s [13:52:25] just checking [13:52:43] Yeah. That's 8 uploads (files) per 60s [13:53:01] But.. [13:53:08] That presumably means users/bots/whatever have no limits [13:53:10] * Reedy checks prod conf [13:53:41] that matches what I've seen yeah, certainly not 8 uploads a minute per bot [13:53:47] ["user"]=> [13:53:47] array(2) { [13:53:47] [0]=> [13:53:47] int(380) [13:53:47] [1]=> [13:53:48] int(4320) [13:53:50] } [13:54:10] 380 per... 72 minutes? [13:54:25] And some of the groups on commons are even higher [13:54:27] I think I am around 30 uploads/min at my peak. [13:55:03] https://phabricator.wikimedia.org/P10742 [13:56:06] ack, thanks! [13:56:25] DominicBM: do you know how big is the dataset in terms of files / bytes ? [13:57:31] Reedy: presumably bots and DPLA bot specifically are in the "unlimited" class ? [13:57:44] like 1k uploads a minute [13:58:16] godog: It's turned out much bigger than I already expected. It's only about 50,000 actual works. I am using the DPLA API to filter works by appropriate rights, and then uploading each. I knew some would have multiple files, but I never expected how many would have hundreds! [13:58:51] godog: I would've hoped the user rights applied to bots [13:59:22] I don't have a count right now, because I was just blindly uploading the files from the manifest. I guess I would have to write a script to calculate the true number of media files, as opposed to items. [14:00:33] DominicBM: a rough idea is fine too btw, especially in terms of bytes, or another indication would be how much is left [14:01:00] But yeah.. looking at https://commons.wikimedia.org/w/index.php?title=Special:Contributions/DPLA_bot&offset=&limit=500&target=DPLA+bot [14:01:05] That doesn't seem to be the case [14:01:22] indeed [14:02:01] DominicBM: to be clear nothing is broken or on fire :) although it is an early warning system for unusual upload activity [14:02:04] DominicBM: to clarify a bit -- we're happy to take the uploads, just want to be sure we buy enough hard drives :) [14:03:18] I'm definitely done with more than half current batch from what I can tell, but it's been almost 400,000 uploads already... :) [14:03:53] out of curiosity, is number of bytes available in their metadata API, before you actually fetch the item contents? [14:05:10] I actually don't know that IIIF has a field for that. [14:06:29] DominicBM: ack, fair to say the current batch should be done by early next week (?) [14:06:35] I use the DPLA API for the list of item IDs, and then I can take each ID in order to query for a IIIF manifest of all the media assets for the item. It looks like https://lib.digitalnc.org/nanna/iiif/35395/manifest [14:08:19] FYI, this project is a Sloan-funded grant in collaboration with WMF. I'm working in close contact with Ben and Sandra, and now Fiona, from Community Programs. [14:09:08] godog: Hoping so. Or at least the rate will be lower. [14:10:38] Or if you want, I can increase the rate with more concurrent sessions, since there are no limits. :D :D [14:11:20] haha! the systems can certainly take that [14:12:26] I am actually more surprised not to have gotten complaints from Commons editors yet. ;) [14:13:36] godog: Duh. Found it [14:13:41] DominicBM: thanks for your help so far btw, feel free to reach out on phab at sre-swift-storage tag if there are other big batches, it'd be nice to know if e.g. we're talking hundres of GBs of data that will be uploaded periodically, one-time thing isn't a whole lot of a problem for capacity planning purposes [14:13:42] Bots have "noratelimit" [14:13:57] So of course, he can go as fast as MW/our servers allows it [14:14:06] Reedy: ow, that explains it alright [14:14:45] I'd certainly welcome tuning that down [14:15:02] Would need some work to MW [14:15:12] Cause noratelimit just overrides *all* rate limits [14:15:22] Which probably isn't a problem on other wikis [14:15:28] But commons where this allows uploads... Is more of an issue [14:16:03] And I doubt just turning removing noratelimit from all bots on commons would be necessarily welcomed [14:16:25] Ah, we have a workaround though [14:16:53] We just need to set '&can-bypass' => false [14:18:14] ah, so in theory doable [14:18:25] godog: Technically, this is only a small part of total collections, since DPLA is aggregating all US cultural heritage (it's a project similar to Europeana, if you're familiar). So there are about 1.5 million items with compatible rights for Commons. And I've only done <30,000 of them (with about 10+ files per item). [14:18:25] happy to file a task (which tags ?) [14:18:41] https://gerrit.wikimedia.org/r/582046 is the change [14:18:50] We probably want to reference/link a task though [14:19:07] As we're doing it for perf/site reliability reasons, we don't need "community consensus" [14:19:37] godog: We were starting with uploading a "small" collections a pilot/demo, so what I am doing right now is just trying to get all North Carolina items uploaded. [14:20:00] But we *might* want to look at the figure we're using and set it to a bit higher than 380 per 72 minutes, as we know we can handle more than that [14:20:56] There may be a lull after I finish North Carolina off, but, long term, we're expecting many more over the next year. [14:21:44] DominicBM: I see, thanks for the context! yes that definitely sounds something like we should get a rough idea of how many bytes we're talking about in total [14:23:35] One question: is enforcing upload rate limits on Commons going to necessitate an update to Pywikibot? I don't know if it already handles that. [14:24:01] No idea [14:24:50] same here [14:27:22] I think Pywikibot is likely how all or most bot users who would be affected by the rate limit are performing the uploads. I would just want to make sure this change doesn't break all Commons upload bots, because that may not be well-received by the community [14:29:04] No one needs to upload thousands of images per hour ;) [14:29:07] Nothing is that urgent [14:32:01] Oh, okay, I thought you were setting the limit lower, like at the speed I was operating at. [14:33:28] Not quite. It's possible we might, but no major reason to do so just yet [14:35:29] I'm not saying people need to do uploads that fast at all. I wouldn't mind if Pywikibot throttled me. [14:35:40] I am just pointing out that ideally it should handle that gracefully with throttling to get you under the rate limit, but hopefully not just break if it's an API error it doesn't expect currently. [14:37:12] indeed [14:38:42] DominicBM: all good from my point of view now, I'll update the task with a summary of this conversation and next steps, thanks again! [14:39:52] I'd presume pywikibot would handle MW throttling [14:40:04] As it doesn't have to be run from bot/admin accounts with no rate limit rights [14:40:47] Okay, great! I can get on IRC easily enough, but don't idle on it these days. But I am also always reachable on-wiki, Telegram, or dominic@dp.la [16:07:11] ol [16:14:24] weechat refresh is rough but slack is working again yay