[06:17:41] <godog> hi folks, not sure if the tags are correct on this task, but who can take a look at it? T314835 [06:17:42] <stashbot> T314835: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 [07:52:00] <gehel> tanny411: hello! any chance we could move our 1:1 slightly earlier? For example now? [07:52:38] <tanny411> Sure, Im here [07:52:53] <gehel> thanks! [08:12:22] <gehel> godog: how urgent is it (before we run out of space) [08:13:06] <gehel> David and Erik are both on vacation (I know, bad planning). They are the ones who really understand how all of this works. [08:13:37] <gehel> I can dig into it, but I might need to give them a call. [08:14:53] <godog> gehel: when are they back? [08:15:20] <godog> I think a week might be ok, but I'd rather not push it [08:17:25] <gehel> Next week [08:20:08] <gehel> It sounds like Flink stopped some kind of cleanup process, but knowing how to fix that is going to require more knowledge of Flink than I have. [08:20:26] <gehel> Let me check with Platform Engineering to see if they can help [08:22:14] <godog> gehel: thank you! [08:23:20] <godog> let me know how it goes, if we can't mitigate it I can purge some historical metrics, but I'd rather not have to [08:25:48] <gehel> Purge metrics? To recover space from a different application? [08:26:43] <gehel> Nah, worst case, we stop the updater. That's inline with our SLO (at least mostly). [08:48:09] <godog> gehel: oh ok! thank you that's good to know we can stop the updater instead [08:49:16] <gehel> That's a bit of a worst case scenario, it would (obviously) stop WDQS from being updated, which has a significant impact on Wikidata. [08:49:54] <gehel> But we are under resourced to manage WDQS well, so that's part of the expected risks. [08:50:14] <gehel> cc: mpham (for when you wake up) [09:03:26] <godog> *nod* [09:41:39] <gehel> Lunch + errand [12:46:32] <inflatador> greetings [12:49:52] <inflatador> I'm working today in lieu, will start the reimage soon [12:50:07] <inflatador> Doubt I can help much with the Flink situation but do let me know if I can learn [13:02:00] <gehel> inflatador: there is some conversation on Slack: #data-platform-value-stream [13:02:31] <gehel> Not sure if you can help, or learn something in the process [13:05:34] <gehel> inflatador: also, if you could check and merge T314853. Marco is checking if the data import is running correctly. It might be interesting for you to pair with him and learn a bit about airflow in the process. [13:06:22] <gehel> Just a suggestion, and in case the reimages are running as they should [13:08:18] <dcausse> o/ [13:17:09] <dcausse> the problem seems to be codfw only and related to swift [13:18:05] <dcausse> Unable to wrap exception of type class org.apache.flink.fs.openstackhadoop.shaded.org.apache.hadoop.fs.swift.exceptions.SwiftAuthenticationFailedException: it has no (String) constructor" [13:22:13] <godog> dcausse: hi! interesting, I can't follow up ATM though [13:22:41] <godog> likely in ~1.5h though I'll read what you find! [13:23:56] <inflatador> gehel ACK, looking at puppet patch now [13:25:51] <gehel> inflatador: thanks ! [13:30:17] <gehel> dcausse: thanks a lot! This isn't super urgent (we have a few days before we run out of space) and you should be on vacation. So don't drop everything for this ! [13:30:35] <gehel> But honestly, I have some doubts that we will be able to fix it without you [13:31:08] <dcausse> gehel: I won't have access to a computer for 1 week starting from tomorrow [13:31:31] <gehel> :/ [13:31:53] <inflatador> gehel OK, merged. Do I need to run puppet on all/any hosts listed here? https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:33:32] <gehel> Ideally on 'an-airflow1001.eqiad.wmnet', but it runs automatically every 30' and I don't think that Marco is in as much hurty [13:33:46] <gehel> s/hurty/hurry/ [13:37:15] <elukey> hey folks [13:37:19] <inflatador> NP, I ran it on an-airflow1001.eqiad.wmnet [13:37:33] <inflatador> <o/ [13:37:35] <elukey> o/ [13:37:47] <elukey> are we discussing the flink/thanos issue in here? [13:37:50] <elukey> or do we prefer slack? [13:38:03] <dcausse> elukey: o/ here is fine [13:38:17] <elukey> hey dcausse! [13:38:42] <elukey> I just added in the task that there seem to be ~18T of checkpoints saved in thanos [13:38:54] <elukey> for rdf-streaming-updater-codfw+segments [13:38:58] <dcausse> elukey: yes I think codfw is misbehaving [13:39:18] <dcausse> was about to blame the swift client that we still use for flink ha [13:39:55] <dcausse> it caused us problems in the past and we dropped it in favor of the S3 client [13:40:06] <dcausse> but we still use it for flink_ha storage [13:41:23] <elukey> can we drop some of the old checkpoints by any chance? [13:41:26] <dcausse> will try to stop the job and make some cleanups at least [13:41:28] <dcausse> yes [13:41:35] <elukey> super, lemme know if you need any help [13:41:40] <dcausse> sure, thanks! [13:52:59] <dcausse> wdqs lag on codfw might start to scream and might cause bots to stop editing [13:53:40] <dcausse> we should route traffic to eqiad only and update wikidata max lag detection to only check eqiad [13:53:55] <dcausse> inflatador: ^ is this something you could take care of? [13:59:35] <inflatador> dcausse I think so, checking now [13:59:52] <dcausse> thanks! [14:09:15] <dcausse> inflatador: https://gerrit.wikimedia.org/r/821753 should stop polling wdqs for the max lag detection [14:15:31] <dcausse> I mean wdqs@codfw [14:17:34] <inflatador> dcausse cool, merged that PR. Still working on the depool command, I've reached out in sre, just wanna make sure I don't accidentally depool everything [14:18:14] <dcausse> inflatador: thanks! lemme know once it's depooled and I'll stop the wdqs job to (stopped the wcqs job already) [14:18:25] <dcausse> s/to/too [14:30:35] <inflatador> dcausse codfw is depooled from wdqs [14:30:45] <dcausse> inflatador: thanks! [14:30:58] <dcausse> stopping the wdqs job [15:01:15] <gehel> dcausse: huge thanks for taking the time during your vacation! [15:25:22] <dcausse> starting to cleanup the rdf-streaming-updater-codfw swift bucket [15:38:42] <inflatador> dcausse any idea how long we might need to keep codfw depooled? [15:39:22] <dcausse> inflatador: I hope we can repool it in a couple hours if the cleanup is going fast enough [15:39:58] <inflatador> ACK, just wanted to make sure we don't leave it down too long. If there's a dashboard I can watch LMK [15:40:07] <dcausse> inflatador: I'm tempted to move forward with T304914 (at least partially). I mean by using the s3 client on codfw at least [15:40:08] <stashbot> T304914: Remove the presto client for swift from the flink image - https://phabricator.wikimedia.org/T304914 [15:40:33] <dcausse> but I might need help with this (delete some k8s resources and some configmaps) [15:42:55] <inflatador> dcausse I have a mtg in about an hour, do you think we could finish before then? Or maybe after that? I know you're on vacation, just wanna be sensitive to that [15:43:43] <dcausse> the cleanup will take some time, but we could perhaps prepare the k8s namespace in codfw if you have time? [15:45:04] <dcausse> we'd need to fully undeploy the rdf-streaming-updater deployment in k8s@codfw but I'm not sure I can do that [15:48:20] <inflatador> dcausse OK, I'm rescheduling my mtg (lots of ppl cancelled anyway) and will create a Meet shortly [15:49:38] <dcausse> thanks! [15:50:43] <inflatador> OK, up at meet.google.com/ngq-nvrq-mir [15:51:03] <dcausse> joining [16:34:33] <dcausse> purging swift is quite slow and I'm getting errors like "Error Deleting: rdf-streaming-updater-codfw/flink_ha_storage/default/completedCheckpoint26ba64d3a1ec: ('Connection broken: IncompleteRead(6 bytes read)', IncompleteRead(6 bytes read))" [16:36:33] <dcausse> I see some space being reclaimed but the folder I delete with "swift -A https://thanos-swift.svc.eqiad.wmnet/auth/v1.0 -U wdqs:flink -K PASS delete rdf-streaming-updater-codfw --prefix commons/checkpoints/1475a2038f088807f9d695aea3e1c7e3c" still has entries [16:36:55] <dcausse> godog: in case you know if there's a better way to purge some files ^ [16:47:13] <inflatador> Added some notes on the k8s purge/deploy we just did, feel free to add/change: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Flink_On_Kubernetes#Kubernetes_operations [17:02:59] <inflatador> workout, back in ~45 [17:12:14] <dcausse> hm... purging the container like that is not going to work [17:14:11] <dcausse> I could possibly store the savepoints to hdfs and then completely drop the container, this might perhaps be a lot faster [17:24:29] <dcausse> but I don't know how to drop a container so I'm a bit stuck [17:47:34] <inflatador> back [17:50:24] <inflatador> dcausse if you still need help LMK. not sure about dropping containers though [17:51:17] <dcausse> inflatador: thanks but not sure what to do, I'm going to drop the flink_ha_storage folder in priority so that I can resume operation but the mass cleanup will take a while [17:51:30] <inflatador> ACK [17:58:55] <dcausse> updated the ticket, will check back a bit later [18:15:41] <mpham> is slack down for anybody else? [18:16:04] <cbogen_> yeah I can't load threads [18:16:30] <inflatador> mpham I tried sending you a slack msg and it was rejected [18:16:55] <mpham> ok, good to know. yeah, it's not working for me either [19:13:43] <dcausse> well... not much progress on the "swift" cleanup [19:15:04] <dcausse> one option would be resume the jobs on a new container and hope that there's a command to drop a full container [19:15:47] <inflatador> we used to have to delete swift containers for customers all the time at my old job [19:16:04] <inflatador> swiftly was the preferred tool, but it's been a loooong time ( http://gholt.github.io/swiftly/2.06/ ) [19:16:08] <dcausse> everything I see requires deleting all the objects before [19:16:21] <dcausse> but it's from the swift client [19:16:43] <dcausse> there might be admin commands that might allow to bypass this check [19:17:20] <inflatador> dcausse yeah, that was/is the problem...swift won't let you delete the container until it's empty [19:17:36] <inflatador> swiftly and some other tools will do that for you automatically, trying to remember what the best one is [19:18:42] <dcausse> there are bazillions of files in flink_ha_storage/default/ and not sure how much time it'll take before it's empty ... [19:19:24] <dcausse> I should have added an alert an this... [19:20:17] <dcausse> I think best way forward is to put the data in a new container and resume the jobs so that it's "easy" to cleanup the bad container later [19:20:25] <inflatador> swiftly is oooold, probably only works with python2, but it does allow you to delete all, see https://docs.rackspace.com/support/how-to/install-the-swiftly-client-for-cloud-files/ . Also allows concurrent object deletion, but we should check with data persistence before we start hammering the API [19:20:26] <dcausse> please let me know if you have a better option [19:21:12] <inflatador> I don't have any better ideas [19:21:39] <dcausse> ok, going to configure the system to use the "rdf-streaming-updater-codfw-T314835" then [19:21:40] <stashbot> T314835: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 [19:21:56] <inflatador> I'm guessing that deleting multiple TB of data from swift probably will take a few days unless data persistence knows any backend magic [19:26:04] <inflatador> Lunch, back in ~30 [20:12:00] <dcausse> sigh... no luck can't start the job on this new swift container: Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: txda600f7bcca7429ab42ab-0062f2bef2; S3 Extended Request ID: txda600f7bcca7429ab42ab-0062f2bef2; Proxy: null [20:15:31] <inflatador> bah [20:16:42] <dcausse> not sure what's wrong... [20:17:04] <dcausse> perhaps the S3 compat layer is something that needs to be activated on a per container basis? [20:17:44] <inflatador> the only thing I can think of offhand is the path style vs bucket style we saw with the Elastic stuff https://wikitech.wikimedia.org/wiki/Search/S3_Plugin_Enable#Path-style_and_bucket-style_access [20:23:04] <dcausse> yes it's on here [20:25:45] <dcausse> back to square 0... [20:32:16] <inflatador> maybe there's a way to test it outside of k8s? [20:32:39] <dcausse> I was testing from yarn (the analytics cluster [20:33:31] <dcausse> btw testing swift from codfw (search-load2002) I randomly get Container GET failed: https://thanos-swift.discovery.wmnet/v1/AUTH_wdqs/rdf-streaming-updater-codfw?format=json&prefix=flink_ha_storage/default 401 Unauthorized [first 60 chars of response] b'<html><h1>Unauthorized</h1><p>This server could not verify t' [20:33:33] <dcausse> Failed Transaction ID: tx019b7dab38e944ef80bdd-0062f2c481 [20:35:05] <inflatador> Per conversation w gehel , I think we're stable enough if you want to get on with your vacation. Also FWiW, I have seen problems with swift-proxy on the backend manifest as 401/403s on the frontend [20:37:48] <gehel> dcausse: go enjoy your vacation! You've done s lot already! We'll do our best to survive until you get back for real! [20:38:00] <dcausse> ok, I can leave it down for the rest of my vacations but not sure we have enough retention on the kafka topics [20:38:28] <dcausse> I can also run the jobs from yarn using the same swift container [20:39:15] <dcausse> your call [20:39:55] <gehel> If there is something quick that you can do, please go ahead. But you should really be on vacation. [20:40:33] <gehel> Worst case, a bit more work next week to reset everything from scratch, but it should not have user impact. [20:40:43] <dcausse> ok I'll start them from yarn, will update the ticket with paths that should not be cleaned up [20:41:51] <gehel> And we need to talk about how you get that day back!