[06:11:56] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) From what I can see this maintenance did not happen yesterday in the end - as the host is still off and its IPMI is unreachable. And as the IPMI is involved, I cannot power this host b... [06:21:42] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [06:31:40] 10DBA, 10Expiring-Watchlist-Items, 10Community-Tech (Kanban-Q3-2019-20), 10Core Platform Team Workboards (Clinic Duty Team): Create required table for new Watchlist Expiry feature - https://phabricator.wikimedia.org/T240094 (10Marostegui) From what I can see, `watchlist` is indeed private. So I am going to... [06:38:14] 10DBA, 10Patch-For-Review: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 (10Marostegui) [08:43:08] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) Replication has been running fine for weeks from 10.1 and also to its 8.0 slave. Replaying production traffic, around 6M selects (from API, special and main traffic), has shown no regressions or apparent crazy query... [08:43:17] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) p:05Triage→03Medium [08:57:56] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:59:55] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:05:28] 10DBA, 10Operations, 10Phabricator, 10Release-Engineering-Team: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) [09:05:39] 10DBA, 10Operations, 10Phabricator, 10Release-Engineering-Team: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) p:05Triage→03Medium [13:12:58] 10DBA, 10Expiring-Watchlist-Items, 10Community-Tech (Kanban-Q3-2019-20), 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Create required table for new Watchlist Expiry feature - https://phabricator.wikimedia.org/T240094 (10Marostegui) Can I get a quick check on https://gerrit.wiki... [13:37:50] Not sure if you agree with my comment on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/570161 [13:39:04] I don't see what is the relationship between the aim of the patch and your message. From what I gather, that patch basically what we'll be using for the DC switchover [13:39:29] I also don't get https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570792/ :) [13:39:35] but should't we be using dbctl to get dbs, not cumin? [13:39:50] there's stuff not in dbctl, like parsercache [13:40:09] sure, but there is stuff on dbctl not on cumin [13:40:21] Like what? [13:40:30] I think the #1 use case is to "check replication is not lagged" [13:40:33] I think everything on dbctl is on cumin [13:40:49] and that will fail if you check under-maintenance hosts [13:41:03] (we don't care too much about lag on parsercache, for example) [13:41:17] Sorry, I still don't understand your point [13:41:23] You mean we should not use cumin? [13:41:46] not as an absolute thing, but we shouldn't use if for some things [13:41:58] like "checking if mediawiki dbs are lagged" [13:42:08] which I belive is the case now [13:42:21] But it is the easiest way to integrate it with the rest of the switchover automated steps I reckon [13:42:36] let me give you an example, db1090:3317 is currently lagged [13:42:44] but I am guessing it is depooled [13:42:48] correct [13:42:51] sorry [13:42:55] db1091 [13:43:01] you get the idea [13:43:02] sure, it doesn't matter [13:43:16] if we check lag based on cumin, the check will fail due to db1091 [13:43:33] but it will succeed because we don't care about it at the moment [13:43:41] if done with dbctl [13:43:58] again, I am not saying what you should do [13:44:01] Right, but what if a parsercache (which is not in dbctl) is down? [13:44:06] I am just pointing to a flaw [13:44:23] I think both approaches have flaw until we have an unique source of truth [13:44:27] ok, again, with cumin you cannot do anything [13:44:29] (and we consider it so) [13:44:44] if you use, for example, zarcillo, you can edit it out live [13:44:56] take it out of rotation [13:45:22] like we do with prometheus inventory [13:45:48] There are several things here I think [13:45:57] so my only point is, I don't have a problem with cumin being used for "give me a list of hosts that have mysql" [13:46:18] but it is a very bad idea for "give me a list of pooled s8 hosts" [13:46:22] Ideally we should not have any host lagging behind (as we normally stop maintenance a few days before the DC failover), so if something is lagging, I think we should definitely know [13:46:40] *"give me a list of pooled s8 instances" [13:46:48] Also, I don't see how easy would be to integrate zarcillo/dbctl with all the orchestration on spicerack which is already in place [13:47:13] I don't think the idea is to: "give me a list of pooled s8 instances" but "give me a list of instances" instead [13:47:27] * volans just back online, reading backlog [13:47:31] yeah, and that is the core of the issue, cumin knows about hosts, not mysql instances [13:47:34] volans! [13:47:43] so few things to clarify [13:48:32] 1) yes that patch is just the adaptation of what's already there with new/changed/refactored puppet roles/classes/defines to achieve the same capability of selecting hosts [13:48:53] 2) it would be nice to have dbctl support in spicerack, we should ask its maintainer ;) [13:49:06] 3) dbctl doesn't have FQDN IIRC [13:49:24] if I remember correctly in the end we decided not to add them, but I migth be wrong [13:49:34] volans: I never use FQDN for dbctl indeed [13:49:48] again, hosts vs instances [13:50:04] jynus: I think zarcillo _will be_ the source of truth, but I am not sure if we are there yet (I have found myself forgetting to add new hosts for instance, when provisioning them). [13:50:28] I created this: https://phabricator.wikimedia.org/T242571 [13:50:31] but there no alternative [13:50:45] cumin/puppet only have hosts, does not have a list of instances [13:51:31] jynus: yeah, but how can we integrate it with spicerack at the moment? didn't we have this same issue on the past switchover? [13:52:02] if I run "cumin A:db-core" I get "as list of hosts with a certain role" [13:52:27] no difference if there is a mysql running, 0 running or 4 instances [13:52:51] sure, agree with that [13:52:52] I just want to make clear that, if that is good enough to work with, that is up to you [13:53:11] and agree that we should have a proper mysql instance support into spicerack [13:53:22] it's just the matter to find who and when can do that [13:53:25] jynus: I do agree, what I am saying is that I don't know if we are there yet with zarcillo as a source of truth [13:53:26] I just made you aware of it [13:53:27] based on all other priorities [13:53:43] and that prometheus already uses zarcillo as a source of truth [13:54:10] (which is used for monitoring, and which is a similar use case as what dc failover does with mysql) [13:54:16] We did have multinstance in the past switchover, no? [13:54:23] no idea [13:54:40] I can see multiinstance host being selected [13:54:51] whether they are ignored on check or not, I have not checked [13:56:06] also my other point is that if I remember correctly, we already considered the way dc failover did its checks was very hacky last times [13:56:24] and "good enough for now" [13:56:39] marostegui: and to reply to the 'C:mariadb::config' line [13:57:02] the point is that the current multiinstance hosts don't set the replication_role parameter, that hence gets the default 'standalone' [13:57:20] while it should be 'master' or 'slave' for our core prod [13:57:22] you can keep the cumin inventory and use the mysql.py command line utility instead [13:57:32] that is another option [13:58:58] So how did we solve this problem in the past failover? [13:59:24] well I think we just ignored it [13:59:35] which is what I am just trying to reevaluate [13:59:43] if it can help I can probably find the list of hosts matched last time [13:59:49] I am ok with compromising "this works for now" [13:59:51] adn we can easily see if there are multiinstance involved [13:59:56] as long as it is a conscient decision [14:00:11] I hope you understand, even if don't share, my worries [14:00:48] but a higher complexity problem like this may require a bit or improvement of the model [14:00:58] I share them, but I want https://phabricator.wikimedia.org/T242571 also to be fixed before we can trust zarcillo, because I have forgotten myself to add hosts when working on 20 things at the same time [14:01:12] it is all a chain of things! [14:01:29] sure, but we may have "forgotten" the multi-instance hosts last time! [14:01:31] :-D [14:01:36] hehe [14:01:49] I am ok if you say "zarcillo" not ok yet [14:02:09] but I would focuson on dbctl for things that are there already [14:02:25] and maybe puppet for pc an others [14:02:32] That's what I was saying, I think zarcillo is the way to go, but precisely, as I found myself forgetting to add stuff, I don't trust it much (because I don't trust myself) [14:02:35] I don't have clear a good proposal [14:02:52] So either "this works for now" or we find a way to integrate dbctl+spicerack+cumin [14:02:54] or something [14:03:31] I also don't know how much time volans has for this :) [14:03:46] Cause I guess integrating dbctl there isn't something we can do in 2 days [14:04:33] I don't think we need a full integration [14:04:52] if I remember correctly, this is only used to query masters of the other datacenter [14:05:09] a specific query like that can be done quickly, I can do it even [14:05:13] marostegui: I'm not dbctl maintainer, cc cdanis :-P [14:05:20] in fact, we only need it for those hosts that have more than 1 instance (which is present in dbctl) [14:05:30] so the main uses here are 2 [14:05:32] forget about mysql.py [14:05:39] 1) set masters RO/RW [14:05:47] (and check RO/RW) [14:05:49] just think about the dc steps [14:06:00] 2) check that the other master is in sync [14:06:07] exactly what volans is saying [14:06:12] so s1 master in codfw is in sync with s1 master in eqiad before turning RW [14:06:37] this is the switchdc use case [14:06:48] It is very unlikely we'll have a multi-instance master in MW in the near future [14:06:57] we just need to support that [14:06:58] *but* the mysql module in spicerack in theory should allow to filter all core dbs by any combination of DC, section and replication_role [14:07:21] and that's what I think is broken anyway with the current representation [14:07:33] as we don't support multiinstances [14:07:34] we ignore it, and we hardcode the "right query", until we have a proper source of truth [14:07:34] but not strictly needed for this switchover [14:07:40] not only for selection but also for connecting to them [14:07:57] or we just use zarcillo because it will work in the general case for that [14:08:10] and specially for quering masters [14:08:18] there is several approaches [14:09:18] I don't know [14:09:22] so I am not too worried about the dc swichover [14:09:33] give me 5 and I can audit all usage of the mysql module in spicerack [14:09:39] I just don't think mysql.py should be used in general [14:09:43] if it's only the switchdc we can be much more flexible [14:09:55] jynus: mysql.py? [14:10:05] and the spicerack library [14:10:24] spicerack/mysql.py is wrong for the general case, and I was -1ing that [14:10:39] I am ok for the limited dc failover use case [14:10:58] Sorry, I was thinking about "our" mysql.py [14:10:59] XD [14:10:59] but we should consider it "unfit" and deprecated for general usage [14:11:15] * volans tempted to parse the JSON from dbctl -s codfw section s1 get and read the master :D [14:11:18] so I was -1ing the "general usage" I hope that was understood [14:11:31] jynus: I understood you were -1ing for the dc failover [14:11:37] (which I think the update was intented for) [14:11:40] but I understand constrains as long as it is only for the usage of dc failover [14:11:56] I only thought about it for the dc switchover [14:12:15] from a quick check the mysql module is used only in switchdc cookbooks, so if we use a different approach there it could be ditched from spicerack for now [14:12:30] volans: I hope you also understood my thought, and last time we talked you also conceed it was not a great general approach [14:12:32] until we have a better replacement [14:13:09] volans: that was my thought, that this was _only_ used for the dc switchover [14:13:22] yes that mysql module module was kinda tech debt from day one, I've always admit it :) [14:13:37] so my -1 was in that context [14:13:55] if this is only used for the dc switchover, I think we are "fine" [14:14:04] until we have a multinstance master :) [14:15:05] dbctl for mw layer and zarcillo should be what we should focus on for general inventory, not puppet [14:15:12] correct [14:15:29] I would love to have a way that a host register itself on zarcillo on the first run or whatever [14:15:33] and say: hey o/ I am here [14:15:38] if we made zarcillo HA and made tests to cross reference and discover instances, we should b eok [14:15:50] we just said the same thing with different words [14:16:03] and report itself once X time to say: I have X amount of instances running on X,Y,Z ports [14:16:09] haha yeah [14:16:17] I just prefer to work on that rather than maintaining the old crift [14:16:23] *cruft [14:16:48] because otherwise you end up with outdated methods like this [14:17:44] and I belive we have a similar discussion on every dc failover, or at least I remember that [14:17:52] yep, but also probably not worth investing time on this if it is only used for dc swicthover and we won't have a multinstance master any time soon [14:18:17] then don't waste time patching ! [14:18:39] unless it is a patch to put a header saying "*** DEPRECATED DONT USE ***" [14:18:54] the patch was needed because current code doesn't work anymore [14:18:59] due to puppet refactoring [14:19:01] :-( [14:19:37] I can have a look with Reuven in the next days and see if we can find a quick replcament without using cumin/remote and without ssh-ing [14:19:42] another reason not to use puppet inventory- too unreliable [14:19:55] making mysql connections directly from the cumin hosts to the targeted instance [14:20:11] anyway, I just rised and dumped my thoughts, that's all [14:20:18] removed my -1 [14:20:20] but you have to tell me where to query the hosts [14:20:30] and instances [14:20:44] (FQDN and port basically) [14:20:57] volans: I guess that can be gathered from dbctl [14:23:26] volans: And from zarcillo on db1115: zarcillo.instances [14:24:58] mysql.py -h db1115.eqiad.wmnet zarcillo -e "SELECT server, port FROM masters JOIN instances ON instance=name WHERE dc = 'eqiad'" [14:25:18] ^will need some tuning to exclude pc in some cases, etc. [14:25:42] I am not pushing on this, just answering the question [14:26:31] and what would be the canonical way to connect from cuminNNNN to say db1098:3316? [14:26:50] are you thinking external command? [14:26:56] or libary call? [14:27:02] if we were not ssh-ing and executing local mysql [14:27:24] ideally library, but if it's hard to integrate and you have already a tool I guess exec might work too? [14:27:39] limited to the switchdc use case where the output to parse is supersimple [14:27:43] in the general case library [14:27:45] connecting to fqdn:port with root pass there, mysql.py is a mysql wrapper that does that [14:28:13] as a cmd line, but can be emulated very quickly on library [14:28:53] we have one, but we don't have it deployed on cumin [14:29:33] you cat $(which mysql.py) to see it [14:30:33] but you don't need most of that if you have fqdn [14:30:52] I can send you a proposal if you want [14:31:04] a patch [14:32:00] last thing I want is to make you work more [14:33:20] if we use dbctl as source we'll have the IP not the FQDN, but that should work too [14:33:40] we have a resolve function for that [14:34:41] yeah, it's called DNS :-P [14:34:48] xD [14:35:01] no, dctl strings are not hostnames [14:35:07] jynus: can you review my comment on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570792/ ? I would like to get that out today [14:35:16] they are aliases that we happen to be the same as hostnames [14:35:32] dbctl instance db1098:3316 get [14:35:40] "db1098:3316": { [14:35:40] "host_ip": "10.64.16.83", [14:35:40] "note": "", [14:35:40] "port": 3316, [14:35:48] all the info we need AFAICT [14:36:31] marostegui: so how does private tables get filtered on wikireplicas, does it use realm.pp? [14:36:55] the only thing with dbctl is that it treats es1/es2 as master-slave and it is not that, so not sure what the check would return in that case (I asked cdanis to see if that can be changed anytime soon T239900#5826891) [14:36:56] T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 [14:37:04] volans: beware, 10.64.16.83 will fail tls authentication [14:37:12] jynus: yep, realm.pp generates the replication filters [14:37:31] ah, did I wrote that? because it seems quite ingenious! [14:37:40] haha [14:37:47] I don't know who did that :) [14:38:18] let me double check the code, not that I don't trust you [14:38:25] jynus: sure sure [14:38:27] but I don't remember [14:39:48] marostegui: I see, it doesn't use realm.pp directly, it uses/etc/mysql/private_tables.txt but that is genereated by puppet with realm [14:39:55] so all correct [14:40:02] :) [14:40:33] actually no [14:41:55] ah, yes [14:42:02] scope.lookupvar("::private_tables").each do |name| [14:42:11] that is not great but it works [15:20:46] I confused that with private wikis, which is duplicated there and on a dblist [15:21:16] yeah, but we don't touch the dblist [15:23:03] 10DBA, 10Expiring-Watchlist-Items, 10Community-Tech (Kanban-Q3-2019-20), 10Core Platform Team Workboards (Clinic Duty Team): Create required table for new Watchlist Expiry feature - https://phabricator.wikimedia.org/T240094 (10Marostegui) Replication filter in place: ` [15:16:49] marostegui@cumin1001:~$ su... [15:25:08] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [17:12:44] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Jclark-ctr) performed flea power drain. powered on host [17:13:23] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) Thanks - IPMI is back. I will take it from here. Thank you! [17:31:41] one thing I am seeing as weird [17:31:53] is replication control seems to not be using gtids at the moment [17:32:06] Timed out waiting for replication to reach db1120-bin.735/597321243 [17:32:15] ^that is new [17:33:02] mmmm, db1120 is a master [17:33:05] x1 master [17:34:51] well, yeah, but is gtid not enabled? [17:35:40] note this is unrelated to issues, as there was not wait locking, but seemed weird [17:36:30] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) John found this: https://www.dell.com/support/article/es/es/esbsdt1/sln316859/idrac7-idrac8-idrac-unresponsive-or-sluggish-performance?lang=en which is an update from May 2019, so mayb... [17:37:49] jynus: yeah, weird [17:37:55] not sure why it uses the old binlog [17:38:04] like the other method [17:38:49] I am checking only happened occasionally (other times it is GTID-based) [17:39:09] db1137 pos db1120-bin.734/1011808375 [17:39:17] and db1120-bin.734/1011795835 [17:39:47] sorry, I meant db1127 pos db1120-bin.734/746203511 [17:40:00] only happening on x1? [17:40:24] Anyways, I have to go [17:40:27] See you on Monday! [17:40:43] bye! [18:36:20] 10DBA, 10Expiring-Watchlist-Items, 10Community-Tech (Kanban-Q3-2019-20), 10Core Platform Team Workboards (Clinic Duty Team): Create required table for new Watchlist Expiry feature - https://phabricator.wikimedia.org/T240094 (10aezell) @Marostegui Thanks for your help with this!