Unauthorized

[06:17:41] <godog>	 hi folks, not sure if the tags are correct on this task, but who can take a look at it? T314835
[06:17:42] <stashbot>	 T314835: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835
[07:52:00] <gehel>	 tanny411: hello! any chance we could move our 1:1 slightly earlier? For example now?
[07:52:38] <tanny411>	 Sure, Im here
[07:52:53] <gehel>	 thanks!
[08:12:22] <gehel>	 godog: how urgent is it (before we run out of space)
[08:13:06] <gehel>	 David and Erik are both on vacation (I know, bad planning). They are the ones who really understand how all of this works.
[08:13:37] <gehel>	 I can dig into it, but I might need to give them a call.
[08:14:53] <godog>	 gehel: when are they back?
[08:15:20] <godog>	 I think a week might be ok, but I'd rather not push it
[08:17:25] <gehel>	 Next week
[08:20:08] <gehel>	 It sounds like Flink stopped some kind of cleanup process, but knowing how to fix that is going to require more knowledge of Flink than I have.
[08:20:26] <gehel>	 Let me check with Platform Engineering to see if they can help
[08:22:14] <godog>	 gehel: thank you!
[08:23:20] <godog>	 let me know how it goes, if we can't mitigate it I can purge some historical metrics, but I'd rather not have to
[08:25:48] <gehel>	 Purge metrics? To recover space from a different application?
[08:26:43] <gehel>	 Nah, worst case, we stop the updater. That's inline with our SLO (at least mostly).
[08:48:09] <godog>	 gehel: oh ok! thank you that's good to know we can stop the updater instead
[08:49:16] <gehel>	 That's a bit of a worst case scenario, it would (obviously) stop WDQS from being updated, which has a significant impact on Wikidata.
[08:49:54] <gehel>	 But we are under resourced to manage WDQS well, so that's part of the expected risks.
[08:50:14] <gehel>	 cc: mpham (for when you wake up)
[09:03:26] <godog>	 *nod*
[09:41:39] <gehel>	 Lunch + errand
[12:46:32] <inflatador>	 greetings
[12:49:52] <inflatador>	 I'm working today in lieu, will start the reimage soon
[12:50:07] <inflatador>	 Doubt I can help much with the Flink situation but do let me know if I can learn
[13:02:00] <gehel>	 inflatador: there is some conversation on Slack: #data-platform-value-stream
[13:02:31] <gehel>	 Not sure if you can help, or learn something in the process 
[13:05:34] <gehel>	 inflatador: also, if you could check and merge T314853. Marco is checking if the data import is running correctly. It might be interesting for you to pair with him and learn a bit about airflow in the process.
[13:06:22] <gehel>	 Just a suggestion, and in case the reimages are running as they should 
[13:08:18] <dcausse>	 o/
[13:17:09] <dcausse>	 the problem seems to be codfw only and related to swift
[13:18:05] <dcausse>	 Unable to wrap exception of type class org.apache.flink.fs.openstackhadoop.shaded.org.apache.hadoop.fs.swift.exceptions.SwiftAuthenticationFailedException: it has no (String) constructor"
[13:22:13] <godog>	 dcausse: hi! interesting, I can't follow up ATM though
[13:22:41] <godog>	 likely in ~1.5h though I'll read what you find!
[13:23:56] <inflatador>	 gehel ACK, looking at puppet patch now
[13:25:51] <gehel>	 inflatador: thanks !
[13:30:17] <gehel>	 dcausse: thanks a lot! This isn't super urgent (we have a few days before we run out of space) and you should be on vacation. So don't drop everything for this !
[13:30:35] <gehel>	 But honestly, I have some doubts that we will be able to fix it without you 
[13:31:08] <dcausse>	 gehel: I won't have access to a computer for 1 week starting from tomorrow 
[13:31:31] <gehel>	 :/
[13:31:53] <inflatador>	 gehel OK, merged. Do I need to run puppet on all/any hosts listed here? https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:33:32] <gehel>	 Ideally on 'an-airflow1001.eqiad.wmnet', but it runs automatically every 30' and I don't think that Marco is in as much hurty
[13:33:46] <gehel>	 s/hurty/hurry/
[13:37:15] <elukey>	 hey folks
[13:37:19] <inflatador>	 NP, I ran it on an-airflow1001.eqiad.wmnet
[13:37:33] <inflatador>	 <o/
[13:37:35] <elukey>	 o/
[13:37:47] <elukey>	 are we discussing the flink/thanos issue in here?
[13:37:50] <elukey>	 or do we prefer slack?
[13:38:03] <dcausse>	 elukey: o/ here is fine
[13:38:17] <elukey>	 hey dcausse! 
[13:38:42] <elukey>	 I just added in the task that there seem to be ~18T of checkpoints saved in thanos
[13:38:54] <elukey>	 for rdf-streaming-updater-codfw+segments
[13:38:58] <dcausse>	 elukey: yes I think codfw is misbehaving
[13:39:18] <dcausse>	 was about to blame the swift client that we still use for flink ha
[13:39:55] <dcausse>	 it caused us problems in the past and we dropped it in favor of the S3 client
[13:40:06] <dcausse>	 but we still use it for flink_ha storage
[13:41:23] <elukey>	 can we drop some of the old checkpoints by any chance?
[13:41:26] <dcausse>	 will try to stop the job and make some cleanups at least
[13:41:28] <dcausse>	 yes
[13:41:35] <elukey>	 super, lemme know if you need any help
[13:41:40] <dcausse>	 sure, thanks!
[13:52:59] <dcausse>	 wdqs lag on codfw might start to scream and might cause bots to stop editing
[13:53:40] <dcausse>	 we should route traffic to eqiad only and update wikidata max lag detection to only check eqiad
[13:53:55] <dcausse>	 inflatador: ^ is this something you could take care of?
[13:59:35] <inflatador>	 dcausse I think so, checking now
[13:59:52] <dcausse>	 thanks!
[14:09:15] <dcausse>	 inflatador: https://gerrit.wikimedia.org/r/821753 should stop polling wdqs for the max lag detection
[14:15:31] <dcausse>	 I mean wdqs@codfw
[14:17:34] <inflatador>	 dcausse cool, merged that PR. Still working on the depool command, I've reached out in sre, just wanna make sure I don't accidentally depool everything
[14:18:14] <dcausse>	 inflatador: thanks! lemme know once it's depooled and I'll stop the wdqs job to (stopped the wcqs job already)
[14:18:25] <dcausse>	 s/to/too
[14:30:35] <inflatador>	 dcausse codfw is depooled from wdqs
[14:30:45] <dcausse>	 inflatador: thanks!
[14:30:58] <dcausse>	 stopping the wdqs job
[15:01:15] <gehel>	 dcausse: huge thanks for taking the time during your vacation!
[15:25:22] <dcausse>	 starting to cleanup the rdf-streaming-updater-codfw swift bucket
[15:38:42] <inflatador>	 dcausse any idea how long we might need to keep codfw depooled?
[15:39:22] <dcausse>	 inflatador: I hope we can repool it in a couple hours if the cleanup is going fast enough
[15:39:58] <inflatador>	 ACK, just wanted to make sure we don't leave it down too long. If there's a dashboard I can watch LMK
[15:40:07] <dcausse>	 inflatador: I'm tempted to move forward with T304914 (at least partially). I mean by using the s3 client on codfw at least
[15:40:08] <stashbot>	 T304914: Remove the presto client for swift from the flink image - https://phabricator.wikimedia.org/T304914
[15:40:33] <dcausse>	 but I might need help with this (delete some k8s resources and some configmaps)
[15:42:55] <inflatador>	 dcausse I have a mtg in about an hour, do you think we could finish before then? Or maybe after that? I know you're on vacation, just wanna be sensitive to that
[15:43:43] <dcausse>	 the cleanup will take some time, but we could perhaps prepare the k8s namespace in codfw if you have time?
[15:45:04] <dcausse>	 we'd need to fully undeploy the rdf-streaming-updater deployment in k8s@codfw but I'm not sure I can do that
[15:48:20] <inflatador>	 dcausse OK, I'm rescheduling my mtg (lots of ppl cancelled anyway) and will create a Meet shortly
[15:49:38] <dcausse>	 thanks!
[15:50:43] <inflatador>	 OK, up at meet.google.com/ngq-nvrq-mir
[15:51:03] <dcausse>	 joining
[16:34:33] <dcausse>	 purging swift is quite slow and I'm getting errors like "Error Deleting: rdf-streaming-updater-codfw/flink_ha_storage/default/completedCheckpoint26ba64d3a1ec: ('Connection broken: IncompleteRead(6 bytes read)', IncompleteRead(6 bytes read))"
[16:36:33] <dcausse>	 I see some space being reclaimed but the folder I delete with "swift -A https://thanos-swift.svc.eqiad.wmnet/auth/v1.0 -U wdqs:flink -K PASS delete rdf-streaming-updater-codfw --prefix commons/checkpoints/1475a2038f088807f9d695aea3e1c7e3c" still has entries
[16:36:55] <dcausse>	 godog: in case you know if there's a better way to purge some files ^
[16:47:13] <inflatador>	 Added some notes on the k8s purge/deploy we just did, feel free to add/change: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Flink_On_Kubernetes#Kubernetes_operations
[17:02:59] <inflatador>	 workout, back in ~45
[17:12:14] <dcausse>	 hm... purging the container like that is not going to work
[17:14:11] <dcausse>	 I could possibly store the savepoints to hdfs and then completely drop the container, this might perhaps be a lot faster
[17:24:29] <dcausse>	 but I don't know how to drop a container so I'm a bit stuck
[17:47:34] <inflatador>	 back
[17:50:24] <inflatador>	 dcausse if you still need help LMK. not sure about dropping containers though
[17:51:17] <dcausse>	 inflatador: thanks but not sure what to do, I'm going to drop the flink_ha_storage folder in priority so that I can resume operation but the mass cleanup will take a while
[17:51:30] <inflatador>	 ACK
[17:58:55] <dcausse>	 updated the ticket, will check back a bit later
[18:15:41] <mpham>	 is slack down for anybody else?
[18:16:04] <cbogen_>	 yeah I can't load threads
[18:16:30] <inflatador>	 mpham I tried sending you a slack msg and it was rejected
[18:16:55] <mpham>	 ok, good to know. yeah, it's not working for me either
[19:13:43] <dcausse>	 well... not much progress on the "swift" cleanup
[19:15:04] <dcausse>	 one option would be resume the jobs on a new container and hope that there's a command to drop a full container
[19:15:47] <inflatador>	 we used to have to delete swift containers for customers all the time at my old job
[19:16:04] <inflatador>	 swiftly was the preferred tool, but it's been a loooong time ( http://gholt.github.io/swiftly/2.06/ )
[19:16:08] <dcausse>	 everything I see requires deleting all the objects before
[19:16:21] <dcausse>	 but it's from the swift client
[19:16:43] <dcausse>	 there might be admin commands that might allow to bypass this check
[19:17:20] <inflatador>	 dcausse yeah, that was/is the problem...swift won't let you delete the container until it's empty
[19:17:36] <inflatador>	 swiftly and some other tools will do that for you automatically, trying to remember what the best one is
[19:18:42] <dcausse>	 there are bazillions of files in flink_ha_storage/default/ and not sure how much time it'll take before it's empty ...
[19:19:24] <dcausse>	 I should have added an alert an this...
[19:20:17] <dcausse>	 I think best way forward is to put the data in a new container and resume the jobs so that it's "easy" to cleanup the bad container later
[19:20:25] <inflatador>	 swiftly is oooold, probably only works with python2, but it does allow you to delete all, see https://docs.rackspace.com/support/how-to/install-the-swiftly-client-for-cloud-files/ . Also allows concurrent object deletion, but we should check with data persistence before we start hammering the API
[19:20:26] <dcausse>	 please let me know if you have a better option
[19:21:12] <inflatador>	 I don't have any better ideas
[19:21:39] <dcausse>	 ok, going to configure the system to use the "rdf-streaming-updater-codfw-T314835" then
[19:21:40] <stashbot>	 T314835: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835
[19:21:56] <inflatador>	 I'm guessing that deleting multiple TB of data from swift probably will take a few days unless data persistence knows any backend magic
[19:26:04] <inflatador>	 Lunch, back in ~30
[20:12:00] <dcausse>	 sigh... no luck can't start the job on this new swift container: Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: txda600f7bcca7429ab42ab-0062f2bef2; S3 Extended Request ID: txda600f7bcca7429ab42ab-0062f2bef2; Proxy: null
[20:15:31] <inflatador>	 bah
[20:16:42] <dcausse>	 not sure what's wrong...
[20:17:04] <dcausse>	 perhaps the S3 compat layer is something that needs to be activated on a per container basis?
[20:17:44] <inflatador>	 the only thing I can think of offhand is the path style vs bucket style we saw with the Elastic stuff https://wikitech.wikimedia.org/wiki/Search/S3_Plugin_Enable#Path-style_and_bucket-style_access
[20:23:04] <dcausse>	 yes it's on here
[20:25:45] <dcausse>	 back to square 0...
[20:32:16] <inflatador>	 maybe there's a way to test it outside of k8s?
[20:32:39] <dcausse>	 I was testing from yarn (the analytics cluster
[20:33:31] <dcausse>	 btw testing swift from codfw (search-load2002) I randomly get Container GET failed: https://thanos-swift.discovery.wmnet/v1/AUTH_wdqs/rdf-streaming-updater-codfw?format=json&prefix=flink_ha_storage/default 401 Unauthorized  [first 60 chars of response] b'<html><h1>Unauthorized</h1><p>This server could not verify t'
[20:33:33] <dcausse>	 Failed Transaction ID: tx019b7dab38e944ef80bdd-0062f2c481
[20:35:05] <inflatador>	 Per conversation w gehel , I think we're stable enough if you want to get on with your vacation. Also FWiW, I have seen problems with swift-proxy on the backend manifest as 401/403s on the frontend
[20:37:48] <gehel>	 dcausse: go enjoy your vacation! You've done s lot already! We'll do our best to survive until you get back for real!
[20:38:00] <dcausse>	 ok, I can leave it down for the rest of my vacations but not sure we have enough retention on the kafka topics
[20:38:28] <dcausse>	 I can also run the jobs from yarn using the same swift container
[20:39:15] <dcausse>	 your call
[20:39:55] <gehel>	 If there is something quick that you can do, please go ahead. But you should really be on vacation.
[20:40:33] <gehel>	 Worst case, a bit more work next week to reset everything from scratch, but it should not have user impact.
[20:40:43] <dcausse>	 ok I'll start them from yarn, will update the ticket with paths that should not be cleaned up
[20:41:51] <gehel>	 And we need to talk about how you get that day back!