[07:18:13] volans: I am trying the prepare cookbook, and I am getting: https://phabricator.wikimedia.org/P74236 [07:20:16] And for what is worth, I am able to connect just fine from cumin1002 CLI [07:29:12] Same from cumin2002 [07:36:47] The credentials at ./etc/spicerack/cookbooks/sre.switchdc.databases.yaml look good [08:23:11] <_joe_> marostegui: did you look at the cert? [08:23:59] <_joe_> i suspect somehow our cert bundle doesn't get read by python? but it's strange [08:24:02] _joe_: I was trying with test-s4 and that one looks like works fine (there were some issues, but not related to the connection, and that was also expected) [08:24:40] _joe_: We've not changed the cert since the last switchover, but I wasn't present so I don't know if this was something that also happened and was workarounded [08:27:01] <_joe_> marostegui: what's that server role? [08:27:15] _joe_: The databases affected? [08:27:21] <_joe_> yes [08:27:36] mariadb::core [08:27:44] They are all the same [08:28:43] <_joe_> the cert has been generated last april [08:28:59] so yes, no change since the last switch [08:30:45] <_joe_> where is the config of the db server located? [08:30:58] In /etc/my.cnf [08:46:25] <_joe_> marostegui: sorry, do you know if it was connecting to port 3306 or 3307? [08:46:56] <_joe_> nevermind, I can connect to both [08:47:01] <_joe_> 🤔 [08:47:14] :-/ [08:47:39] I can try with a different section to see if it is something specific for that host, given that test-s4 worked [08:49:35] <_joe_> marostegui: I found the problem [08:49:42] what is it? [08:49:46] <_joe_> it's in spicerack, newer pymysql version [08:50:11] <_joe_> conn = pymysql.Connection(host='db2196.codfw.wmnet', db='test', ssl={"ca": "/etc/ssl/certs/Puppet_Internal_CA.pem"}, port=3307) gives the error you get [08:50:33] <_joe_> conn = pymysql.Connection(host='db2196.codfw.wmnet', db='test', ssl={"ca": "/etc/ssl/certs/Puppet_Internal_CA.pem"}, port=3307) does not [08:50:37] <_joe_> err sorry [08:50:52] <_joe_> conn = pymysql.Connection(host='db2196.codfw.wmnet', db='test', ssl_ca= "/etc/ssl/certs/Puppet_Internal_CA.pem", port=3307) does not [08:51:46] <_joe_> but this means that any cookbooks for dbs would fail... [08:52:15] It took me a while to find the difference [08:52:25] between both lines [08:54:21] hello [08:54:29] what did I do? :D [08:54:29] ciao volans [08:55:11] _joe_: I wonder how test-s4 worked then [08:55:18] They do have a different role [08:55:21] so maybe that's why [08:55:45] we set "ssl": {"ca": PUPPET_CA_PATH}, [08:55:50] in pymysql's config [08:55:56] <_joe_> volans: yes that doesn't work anymore [08:56:28] cumin hosts hvae bullseye pymysql (1.0.2-2) [08:56:44] that's from 2021... [08:57:18] <_joe_> volans: yes and pydoc seems to indicate it's still supported [08:57:37] I wonder if with the refactor of the classes I did a mistake, checking both code and git history, give me a minute to understand [08:58:31] <_joe_> volans: conn = pymysql.Connection(host='db2196.codfw.wmnet', db='test', ssl={"ssl_ca": "/etc/ssl/certs/Puppet_Internal_CA.pem"}, port=3307) works [08:59:14] ok so we can change ca with ssl_ca, but I need to understand if we can just change it or we need to do it conditionally only in some cases [08:59:16] <_joe_> volans: the manual says: [08:59:19] <_joe_> :param ssl: [08:59:19] <_joe_> | A dict of arguments similar to mysql_ssl_set()'s parameters. [09:01:06] <_joe_> which... shouldn't need the ssl_ prefix... [09:01:53] marostegui: by any chance was the mysql/mariadb client updated on the cumin hosts? [09:02:10] volans: I haven't done so [09:02:19] <_joe_> ahhh no volans mine is a red herring I think [09:02:27] I am not sure whether moritzm did or they simply got shipped via updates [09:02:42] I see wmf-mariadb105-client [09:03:09] Yeah, we never install our packaged version there, as the client is good enough as it is shipped [09:03:41] that's on the cumin hots [09:03:45] yep [09:03:56] I don't think it could be the client, as test-s4 worked fine [09:04:01] k [09:04:02] With the cookbook [09:04:02] <_joe_> volans: sorry, mine was a red herring [09:04:17] ok, ignoring your backlog then :) [09:04:22] <_joe_> basically if you use the ssl_ca= syntax, you also need to declare ssl_verify_cert=True [09:04:30] Note that test-s4 hosts have a different role: core_test [09:04:31] <_joe_> 🤦 [09:05:00] <_joe_> marostegui: I have the problem verifying the cert even from pure python cli [09:05:06] <_joe_> just doing pymysql.connect [09:05:15] that's so weird then, cause there was no connection error there [09:05:15] <_joe_> so the problem seems to be indeed with the server cert [09:05:25] <_joe_> oh it's not a connection error [09:05:35] <_joe_> the client can't verify the identity of the server cert [09:05:42] <_joe_> somehow [09:05:48] _joe_: Yeah, but why does it work with those hosts, which run same version and same OS [09:06:03] <_joe_> which hosts? [09:06:11] test-s4 hosts, so db1176 and db2230 [09:06:59] <_joe_> ok I think I found the problem [09:07:24] <_joe_> looks like the server has a cert that is signed with a ca that is in /etc/ssl/certs/wmf-ca-certificates.crt but not the puppet CA [09:07:29] no, nothing has been upgraded related to mariadb in the recent weeks [09:10:24] <_joe_> ok yeah connecting to the server directly [09:10:36] <_joe_> $ openssl s_client -starttls mysql -connect db2196.codfw.wmnet:3306 [09:10:49] <_joe_> i:CN = Puppet CA: palladium.eqiad.wmnet [09:10:57] <_joe_> v:NotBefore: Feb 26 16:12:28 2024 GMT [09:11:47] palladium? wasn't that decommissioned years ago? [09:11:48] <_joe_> now [09:11:52] <_joe_> $ openssl x509 -text -in /etc/mysql/ssl/cert.pem [09:12:00] <_joe_> Not Before: Apr 22 10:33:31 2024 GMT [09:12:20] <_joe_> CN=Wikimedia_Internal_Root_CA [09:12:44] <_joe_> so the server is definitely running with a cert that's not the one on disk [09:12:51] it never got reloaded? [09:12:54] <_joe_> marostegui: that's the old CA [09:13:11] marostegui@db2196:~$ uptime [09:13:12] 09:13:06 up 383 days [09:13:33] <_joe_> yeah [09:13:43] <_joe_> it needs to be restarted to pick up the new PKI [09:14:00] didn;t mysql got the reload ssl stuff since a certain version? [09:14:01] <_joe_> I see db2230 has the correct cert signed by the new puppet CA [09:14:03] I can do a switchover now easily, or we can go ahead and try with also s6 [09:14:18] without the need of a restart [09:14:20] <_joe_> marostegui: sorry but is test-s4 in production? [09:14:26] <_joe_> volans: no idea [09:14:27] _joe_: "production" [09:14:38] <_joe_> marostegui: I mean do you need a switchover? [09:14:53] <_joe_> in any case, I'll leave it to you, we need to reload the certs, basically [09:14:56] <_joe_> then you should be fine [09:16:07] volans: we have https://mariadb.com/kb/en/flush/#flush-ssl [09:16:10] but I've never used it [09:16:43] _joe_: I need a switchover if I have to stop mariadb on x1 master [09:17:02] Let me try to run the cookbook on s6 then and see if it works as expected [09:17:05] <_joe_> the alternative is [09:17:27] we can also hotpatch spicerack [09:17:34] if that helps [09:17:38] <_joe_> we change the PUPPET_CA_PATH in spicerack to use the more inclusive path [09:17:53] and then fix the thing properly [09:18:06] The switchover shoulnd't take too long though, so as you guys prefer [09:18:09] <_joe_> which is /etc/ssl/certs/wmf-ca-certificates.srt [09:18:11] For now, ok to run this on test-s6? [09:18:14] Sorry, s6 [09:18:15] <_joe_> yes [09:18:23] will it have the same issue? [09:18:27] <_joe_> no [09:18:29] volans: it shouldn't [09:18:33] <_joe_> it's just one server afaict [09:18:43] Oh, wait [09:18:43] try and let's see [09:18:48] root@db2214:~# w [09:18:48] 09:18:38 up 345 days, [09:18:49] maybe? XD [09:18:55] because if you changed the CA everywhere [09:19:05] it might not be the only one not reloaded [09:19:28] | ssl_ca | /etc/ssl/certs/wmf-ca-certificates.crt | [09:19:31] let me try to connect to all masters and check [09:28:18] marostegui: I get ssl failures for db2165, db2179, db2192, db2196 and db2204 among all the masters of core sections across both DCs [09:28:32] sorry db2192 is ok, ignore it, bad paste [09:29:08] volans: Also in eqiad? [09:29:16] I am surprised if you'd get a failure for db1173 for instance [09:29:26] As that was rebooted like a couple of weeks ago [09:29:26] no it worked [09:29:30] ah,ok [09:29:43] i can check also the replicas if you want [09:29:47] eqiad ones should be a lot better as they've been rebooted lately [09:29:47] to get a better picture [09:29:50] volans: No, it is fine [09:29:54] yeah eqiad is all good [09:29:59] es6 and es7 should also be fine [09:30:04] for codfw as well [09:30:15] So with all those masters, maybe it is easier then to patch the code [09:30:18] I check all CORE_SECTIONS [09:30:23] so ('s6', 's5', 's2', 's7', 's3', 's8', 's4', 's1', 'x1', 'es6', 'es7') [09:30:39] And we'll get them fixed as soon as codfw is depooled (they need to get rebooted for kernel and mariadb upgrade) [09:31:09] volans: Right, so es hosts are fine, so I am tempted to run the cookbook for es6 and es7 to confirm it is all good [09:31:38] as you want, in the meanwhile I'll check if a single value for the CA works on both set of hosts [09:31:46] or we need to hardcode some logic to pick the right one [09:31:55] volans: Thanks, it should be just temporary to bypass this situation [09:31:57] I will run for es6 [09:32:00] k [09:32:14] I can disable writes on es6 to be fully sure that we don't break anything [09:32:17] I will do that [09:34:03] I will push https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128351/ [09:34:33] sure, extra safety [09:35:31] deploying [09:41:02] marostegui: confirmed with "/etc/ssl/certs/wmf-ca-certificates.crt" I can connect to all, so I'll just hotpatch spicerack for you to use that one [09:41:11] volans: thank you! [09:41:36] scap still running [09:45:02] hotpatch applied, retested, I can connect to all masters [09:45:11] nice! [09:45:37] volans: Let's try with es6 as writes will be disabled, and if it all goes, I will go for x1 again then [09:45:49] SGTM [10:01:08] [11:00:54] <+logmsgbot> !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128351|db-production.php: Disable writes on es6 (T388626)]] (duration: 23m 25s) [10:01:09] T388626: Prepare databases circular replication for the DC switchover - https://phabricator.wikimedia.org/T388626 [10:01:10] going for es6 [10:01:58] wow it took its time [10:01:59] (scap) [10:02:45] worked like a charm the script [10:02:48] https://orchestrator.wikimedia.org/web/cluster/alias/es6 [10:03:14] glad it did :) [10:04:42] Going to revert es6 writes then [10:04:47] And yeah, scap took quite long [10:06:13] volans: es7 was done finely too, going for x1, the problematic one! [10:06:21] finger crossed [10:07:11] volans: all good! [10:07:17] _joe_ volans thanks a lot for the help! [10:07:51] <3 sorry for the troubkle, I'm sending a fix to spicerack, we can probably replace everywhere the puppet path with the bundle cert, but I need to test it first [10:07:59] So what is/was the one line summary regarding the certs? [10:08:35] <_joe_> jynus: cert and ca were changed, mysql not restarted/ssl not reloaded [10:08:53] but in general, using /etc/ssl/certs/wmf-ca-certificates.crt is safer as it includes both the new PKI CA and the old puppet CA [10:09:21] <_joe_> volans: indeed [10:11:11] volans: I think also then the TLS error was not caught because only recently mysql proto was used for connecting to mysql (before it was ssh + local socket) [10:11:43] in any case the checks did the right thing, which is abort the operation :-D [10:19:46] wasn't there use to be a test mode without actual action, or I am confusing it with something else? [10:22:10] volans: We will probably have the same issue with sre.switchdc.databases.finalize ? [10:22:59] with the hotpatch it should work anyway [10:23:17] I'm preparing the fix but I'm not sure if we will deploy this week before the switchdc to not alter what they have already tested [10:23:25] there are other changes in master pending release [10:23:43] volans: We will run the finalize on thursday, so it is fine to leave it like this till thursday [10:23:45] I plan to have the fix merged by today so that if we release you have the fix, if we don't release you have the hotpatch [10:24:00] in both cases you should be covered [10:24:08] es6 back to being writable 10:21:58 Finished scap sync-world: Backport for [[gerrit:1128356|Revert "db-production.php: Disable writes on es6"]] (duration: 14m 41s) [10:37:28] could someone test a regular ssh connection to dbprov2005. Just checking it is not only me and icinga? [10:37:40] (that I cannot connect) [10:38:16] but I see the host up and responding from the DRAC [10:39:03] ah, no need, it says so: "The Integrated NIC 1 Port 1 network link is down." [10:39:22] both cumin and from me locally hangs [10:39:38] yeah, network is down, host is "fine" [10:40:17] I will try a greceful reboot to make sure it is not a one 1 thing then file a task for faulty cable, card or switch port/config [10:40:24] *one time [10:40:28] thanks, volans [10:41:07] the fact that I saw a disk getting evicted and united around the same time does look suspicious [10:41:34] ignore also that part, it is from 1 year ago [10:58:11] I filed T389052 [10:58:11] T389052: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052 [11:22:37] Amir1: I see some SVG changes going past from you - are you aware of https://en.wikipedia.org/wiki/Wikipedia:SVG_help#Rendering_issue ? [11:23:11] I've checked the relevant thumbnail containers and they're working so it's not a repeat of T383053 (which is where it was reported to me) [11:23:12] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [11:24:27] Emperor: funnily enough, this is one day before deployment of my change so it can't be caused by it (I just deployed the svg change) [11:24:52] until an hour ago, all svg files were bypassing thumbnail steps [11:25:33] 'cos trying to visit the thumbs of e.g. https://en.wikipedia.org/wiki/File:%E0%A6%8F%E0%A6%B6%E0%A7%80%E0%A6%AF%E0%A6%BC_%E0%A6%AE%E0%A6%BE%E0%A6%B8_%E0%A6%B0%E0%A7%8C%E0%A6%AA%E0%A7%8D%E0%A6%AF%E0%A6%AA%E0%A6%A6%E0%A6%95.svg is returning me an internal server error [11:26:21] swift list --prefix c/c6/এশীয়_মাস_রৌপ্যপদক.svg wikipedia-commons-local-thumb.c6 finds me small thumbs (25,38,50,60,106,150,159) but nothing larger [11:27:01] that's definitely not related to my change then, I do stuff in mw only [11:27:11] Hm, maybe thumbor is sad then [11:27:26] I was about to say, maybe a thumbor issue or an encoding issue? [11:27:31] it's mojibake for me [11:27:46] https://usercontent.irccloud-cdn.com/file/7vMU2CqG/image.png [11:28:21] I tested on these svg files and both seem to work fine (and follow steps now) https://test.wikipedia.org/wiki/Inactive_Bot_Statistics [11:28:51] inkscape opens the .svg for me OK [11:29:37] Amir1: you sure that's not a font issue? filename works OK for me in my browser (but my terminal lacks the necessary font) [11:29:55] yeah it's probably a font issue [11:29:58] but meh :D [11:30:10] It wouldn't crack the top 200 list of issues I need to handle [11:31:01] I'll make a ticket and tag thumbor. Is there some wikimarkup to make the little "reported as Phab: XXX" on village pump? [11:31:16] yeah, I think Template:Tracked [11:31:41] https://en.wikipedia.org/wiki/Template:Tracked [11:34:30] TY [11:35:44] hnowlan: you OK to eyeball T389060 please? [11:35:45] T389060: Thumbnail failures on some SVGs - https://phabricator.wikimedia.org/T389060 [11:41:41] amusingly, whilst my terminal lacks the relevant font, it does C&P without loss, so that ticket makes it look like my terminal can do all the characters OK [18:22:30] Emperor: sorry, I am OOO today - looks like something a bit deeper than just thumbor. requests aren't even hitting thumbor