[05:08:36] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [05:08:51] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) p:05Triage→03Medium [05:22:02] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [05:22:04] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) [05:22:09] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [05:22:11] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) [05:30:30] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) @Papaul any chances we can place es2034 into A4 or A8 instead of A6? [05:30:49] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [05:58:42] 10DBA: es1017 IPMI issues - https://phabricator.wikimedia.org/T259218 (10Marostegui) 05Open→03Declined Let's not act on this task as this host is about to get replaced with T257785 [06:16:56] 10DBA, 10User-Kormat: Enable DB replication codfw -> eqiad before the switchover and some other checks - https://phabricator.wikimedia.org/T243373 (10Marostegui) [06:26:14] 10DBA, 10User-Kormat: Enable DB replication codfw -> eqiad before the switchover and some other checks - https://phabricator.wikimedia.org/T243373 (10Marostegui) I have checked that the event scheduler is enabled everywhere within codfw. Same for query killers. Query killer wasn't present on db2137:3314 db2137... [06:26:25] 10DBA, 10User-Kormat: Enable DB replication codfw -> eqiad before the switchover and some other checks - https://phabricator.wikimedia.org/T243373 (10Marostegui) [06:27:02] I will take car of GSOC, but will wait till tomorrow [06:27:51] ok - thanks [06:47:07] 10DBA, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 3 others: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 (10Marostegui) 05Open→03Resolved a:03jcrespo [07:05:28] Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2020-09-01 00:00:01 is 1 GB, but previous one was 611 GB, a change of 99.8% (?) [07:06:21] I guess it failed? [07:06:47] no, when it shows there it means it "succeeded" and there were no fatal errors [07:07:07] ah ok [07:07:11] don't know [07:07:32] from what I can see that host has no issues [07:08:03] 2020-09-01 00:02:28 [ERROR] - Error: DB: tkwiki Could not create output file /srv/backups/dumps/ongoing/dump.es4.2020-09-01--00-00-01/tkwiki.blobs_cluster26-schema.sql.gz (24) [07:08:04] 2020-09-01 00:02:28 [ERROR] - Error: DB: thwikisource Could not create output file /srv/backups/dumps/ongoing/dump.es4.2020-09-01--00-00-01/thwikisource.blobs_cluster26-schema.sql.gz (24) [07:08:20] so it did fail [07:08:25] not sure why the errors on the log were not detected [07:08:55] 2020-09-01 00:02:28 [ERROR] - Could not read data from betawikiversity.blobs_cluster26: Lost connection to MySQL server during query [07:09:02] 2020-09-01 00:02:28 [ERROR] - Error dumping table (bewikibooks.blobs_cluster26) data: MySQL server has gone away [07:09:26] network glitch? [07:09:31] the host has no mysql errors on the logs [07:09:54] +---------------+---------+ [07:09:54] | Uptime | 6472340 | [07:09:54] +---------------+---------+ [07:09:54] 1 row in set (0.002 sec) [07:10:10] es5 has the same errors [07:10:17] so unlikely both happened at the same time [07:10:25] maybe if network failed for backup2002 [07:11:52] nothing on logs on backup2002 or network errors either [07:12:00] as in drops or stuff like that [07:20:02] my backup process has a logical error [07:20:15] it checks for output errors and metadata file errors, but not log errors [08:01:47] jynus: re: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/mariadb/backup/snapshot.pp#L9 - does this profile need the server (aka wmf-mariadb10X) package installed, or would the client (aka wmf-mariadb10X-client) package be sufficient? [08:03:36] it needs the server [08:03:43] ah. why so? [08:03:54] for several reasons- one because xtrabackup it is a mariadb server itself [08:04:14] another is beacuse in case of panic, we will want to have a local server available if needed [08:04:29] ahh, huh. i see. [08:04:39] xtrabackup is basically a server that "crashes" the data [08:04:44] in a clean way [08:05:13] we could I guess create a mariabackup specific package [08:05:39] but not worth it at the moment if I still need the server anyway in case there is a non-trivial emergency [08:05:47] yeah understood [08:05:48] e.g. last time we needed to recover a table [08:06:05] and it was faster to recover the whole thing and extract the table [08:06:13] (with a server up) [08:06:19] than trying to handle text files [08:06:53] in other words xtrabackup is a "server utility" not a client utility [08:07:04] ok cool. thanks for the explanation :) [08:09:16] kormat: my patch (which I didn't intended to merge) [08:09:26] fails on test_WMFMariaDB.py not on the new code [08:09:32] could you have it a quick look? [08:09:41] sure, let me have a look. [08:09:48] I don't need it merged, I hopefully will merge it on the new repo [08:10:03] but maybe some refactoring was missing on testing? [08:10:25] mm - what's the CR number? [08:10:29] (or I may not be seeing it, jenkings stdout is getting more and more complex) [08:10:35] https://integration.wikimedia.org/ci/job/wmfmariadbpy-tox-docker/31/console [08:10:42] let me find the gerrit # [08:10:56] https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/623525/ [08:11:40] huh. i can confirm what you're saying. that's odd. i'll poke. [08:11:45] I just wanted to upload it as I found a big logic error on backups [08:12:06] and indeed, it fails the same way locally for me [08:12:18] i'll send a fix. [08:12:20] not a big deal for merging, ok, just a heads up [08:12:50] I may hot patch production [08:13:18] as I need to retry es backups asap [08:13:51] or maybe it was not rebased? [08:15:25] ah. a new version of black was released. [08:15:36] but I don't understand, if it was a mistake (even if it was one I made), how it was not cought before? [08:15:40] uff [08:16:01] i'll pin to a specific black version [08:16:30] well, if the fix is easy I can send a patch too [08:17:39] jynus: https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/623526 [08:17:40] but the uff was for, how annoying that is going to be in the future? [08:18:12] jynus: i've pinned black and isort to specific versions. so this should not re-occur. [08:19:09] ping me when merged so I can rebase [08:20:08] jynus: it's merged now [08:20:40] BTW this exist https://gerrit.wikimedia.org/r/admin/repos/operations/software/wmfbackups [08:20:54] but I have the same problem than you did, I cannot push history there yet [08:21:27] ahh, cool [08:21:38] jynus: fortunately akosiaris is a gerrit admin ;) [08:21:45] I asked to own it to both of us [08:21:50] to make it* [08:22:02] I am? [08:22:05] oh yes, I am [08:22:09] akosiaris: :D [08:22:18] ldap never lies [08:22:23] (except when it does) [08:23:40] kormat: jynus: added both of you in https://gerrit.wikimedia.org/r/admin/groups/39566e9568c6908f909070828b6bd9ebb3c21375,members as owners [08:23:57] akosiaris: i knew we kept you around for something! <3 [08:24:05] ehm, I meant that group is marked as owners of the repo, I think that should do it [08:24:15] kormat: YES! I have a purpose now! [08:24:18] :D [08:24:23] * akosiaris relieved [08:24:29] as you should be [08:24:38] kormat: I've added a comment to the formatter task wrt the above [08:24:42] just FYI [08:25:06] volans: just seen, ack. [08:27:30] what's the permission I need to "create new commit objects"? [08:27:59] jynus: do you get a specific git error when you try to push there? [08:28:09] [remote rejected] master -> master (prohibited by Gerrit: update for creating new commit object not permitted) [08:28:21] remote: Contact an administrator to fix the permissions [08:28:38] remote: use a SHA1 visible to you, or get update permission on the ref [08:29:02] ah. might be just the 'push' permission [08:30:09] https://gerrit-review.googlesource.com/Documentation/error-prohibited-by-gerrit.html [08:30:24] I'm missing context, what are you trying to do? [08:30:36] volans: populate a new repo with existing code [08:30:43] kormat: push AND forge committer [08:30:50] then yes, you need both temporarily [08:30:54] and then you can remove them [08:31:06] we usually leave only push tags, all the rest should go through CRs [08:31:06] I will do exactly that [08:32:07] done [08:32:25] will delete permissions now, as I can edit them (both of us) we will not be a blocker [08:33:48] kormat: I may need some help to clean up wmfmariadbpy and the backups one, but I will leave you alone for today (not a priority) [08:35:31] If I retarget a patch with the same change-id but on a different repo, would gerrit be inteligent enough? [08:39:03] nah, it creates a new patch with the same patch id [08:57:14] I am retrying backups with the new codebase [08:57:39] will monitor for errors, although with the fix, the errors shoudl be monitored automatically [08:59:15] cool [08:59:39] kormat: what's a word when you code a nice check routine, but then forget to call it? [08:59:48] I was silly [09:00:28] being human :) [09:00:55] we want to have fresh backups by dc failover [09:01:09] I may start a batch of new snapshots now just in case [09:01:32] planing for a worse possible scenario [09:04:13] note that doesn't mean checks were not being done, only a specific set where the backup pocess completes succesfully but returned errors on log was failing to get detected [09:04:30] plus the size check on nagios detected it nicely [09:06:56] the other issue is s4 backups, which have been ongoing for 6 hours [09:08:29] 115G on the image table (after text compression) [09:09:52] 200GB on master [09:19:12] 10DBA, 10User-Kormat: Create testing environment for db automation - https://phabricator.wikimedia.org/T256602 (10Marostegui) @Kormat can this task be considered complete? The testing environment is already done, or there're bits pending? [09:21:39] 10DBA, 10User-Kormat: Create testing environment for db automation - https://phabricator.wikimedia.org/T256602 (10Kormat) I was thinking that it could do with a bit of documentation, and right now we can't test things like transferpy in it (because those puppet roles are disabled in our pontoon env). I'd like... [09:22:26] 10DBA, 10User-Kormat: Create testing environment for db automation - https://phabricator.wikimedia.org/T256602 (10Marostegui) Sounds good, thanks! [09:39:11] I'm running benchmarks on a new type of hw we got for swift (i.e. 24x disks instead of 12x), thoughts/tips on what/how else I could stress-test ? task is https://phabricator.wikimedia.org/T261633 [09:39:28] volans: wmfmariadbpy-common removed from both cumin hosts. [09:39:31] seems to be performing as expected from the benchmarks afaics so far [09:40:14] godog: which raid does it have? [09:40:52] marostegui: swraid1 for the OS drives (ssd) but otherwise no raid [09:41:04] welll, raid0/jbod from the controller's perspective [09:41:13] * marostegui cries with sw raid [09:41:14] but each disk by itself [09:41:55] lol, IME it isn't that bad for most cases [09:42:07] I have had terrible experiences with it [09:42:29] godog: I use sysbench as a general stuff [09:42:42] it has things for cpu, hd, mysql [09:43:23] jynus: thank you, is there a task or somewhere to get started with sysbench ? [09:43:39] let me find it [09:44:37] godog: I did lots of stress testing on labsdb1011 when it crashed, but I cannot find the task, it might give you ideas [09:44:49] trying to find it at the moment [09:45:04] thanks! appreciate it [09:45:32] godog: this is from google, but I think it is interesting enough https://www.howtoforge.com/how-to-benchmark-your-system-cpu-file-io-mysql-with-sysbench [09:48:33] sysbench should be no debian and you can do several things with it [09:48:40] *on not no [09:50:36] godog: this is it https://phabricator.wikimedia.org/T247787 you have the stress tests within the comments, hope it helps [09:50:58] godog: and if you ever need more stress, we've got plenty to share. ;) [09:53:01] haha kormat ! same here [09:53:17] thanks marostegui and jynus, it does indeed help! I'll let this fio run and then try those too [10:33:41] kormat: not sure right now, and not your change, but I am not sure maria-backup is needed on a client [10:33:58] jynus: ack [10:35:26] I will have to check why it was added, maybe it is a leftover when backups where done in a different way [10:35:54] in theory, xtrabackup is only needed on source backups and dbprovs [10:36:47] i think the code was copy&pasted, because the old comment implies that it's installing the server version, [10:36:57] but the actual code installs the client version [10:37:06] could be [10:37:17] `require_package('wmf-mariadb101-client') # xtrabackup only available on wmf-mariadb101 server package` [10:37:22] if you want to dig it, feel free, otherwise I can check other day [10:37:27] it is ok to merge as is [10:37:37] as it is not a new bug, if it is one [10:37:47] let's leave it to another day [10:37:50] we can fix it on a separate patch if it is real [10:38:29] in any case, mariadb-backup (upstream package) is never used [10:38:39] as that would be for 10.3 [10:38:51] we use the one on 10.4 from the wmf package [10:39:14] but maybe there is some generic utility or something that coudl be useful on cumin?0 [10:39:48] the package contains 2 binaries, `mariabackup` and `mbstream` [10:39:58] then almost sure it is a mistake [10:40:30] it does sound that way, yeah [10:41:13] maybe [10:41:23] I installed the client packages on dbprov first [10:41:29] and saw it missing [10:41:48] then I realized I actually needed the server package to get the same version (10.1/10.4) [10:42:03] but didn't revert the client packages modifications [10:42:05] that would fit [10:42:26] i'll add a TODO in my current CR, to remind us later [10:42:34] my fault [10:42:39] most likely [10:43:04] I remember it took me several weeks to find the best way to architecture the snapshotting [10:43:12] I didn't have it clear [10:43:20] i can imagine. it's not a simple thing [10:43:42] as the dumps are more like a pull thing, but snapshot are more like a push action [10:45:22] anyway, it is a minor issue "unused staff installed", not a big deal [10:48:22] jynus: backup to confirm, backup sources do not have production grants, right? [10:48:32] *just to confirm [10:49:40] as in, mw? [10:49:56] they do have it, they were copied from production ones [10:50:14] they only have on top the backup grants [10:50:40] but indeed that is not ideal because if recovered not into mw, those have to be removed [10:50:48] jynus: excellent, I just want to be sure we are relatively ready in case we need some extra power and we can place the backup sources (hopefully not needed) [10:50:59] yes, preciselly that was the idea [10:51:15] Yeah, I didn't remember whether we removed those grants or what we did in the end [10:51:19] however I can check that they are in a working state, as they are never used [10:52:03] yeah, just 10.192.% really [10:52:15] if you have time to double check, that'd be good [10:52:29] I will after lunch, which I do now early [10:52:34] question [10:52:43] ta [10:53:04] would you have fresh how to add a new host to mw, beacuse I would not have it? [10:53:17] yep [10:53:20] it would need both a mw and a dbctl patch? [10:53:26] no, just dbctl [10:53:28] ok [10:53:31] and a puppet patch [10:53:37] well, puppet sure [10:53:38] I can prepare that patch already if needed and not merge it [10:53:50] as long as you have it clear no issue [10:54:00] I haven't done one of those, ever [10:54:00] yep, puppet+dbctl is all we need [10:54:06] and query killers and such [10:54:33] talking about query killers [10:54:49] we may need to update the wmf-pt-kill package to support multiinstance [10:54:57] yes, it is already tracked [10:55:02] not sure how involving that would be- just a systemd patch [10:55:06] or more [10:55:14] https://phabricator.wikimedia.org/T260511 [10:55:19] I asked kormat to take a look in Q2 or so [10:55:35] maybe, and I am being too ambitous here [10:55:57] we could start researching if the mw query killers could be replaced with that [10:56:24] yeah, that is too ambitous for now, but it can be done in the future [10:56:36] note I said research :-D [10:56:39] Given how badly we saw pt-kill working with some type of statements... [10:56:45] Let's keep it simple for now [10:57:08] he could maybe had a first look at that and pt-heartbeat [10:57:10] Still waiting for percona to answer.... [10:57:31] I think he already mentioned he didn't like current status, so we could trick him into looking at it :-P [10:57:39] we'll see :) [10:58:02] we should also ask if sobanski is good with pythong :-) [10:58:07] *python [10:58:23] or perl in this case [10:59:15] es backups are finishing now [11:02:27] jynus: Not really great with either :( [11:03:41] But I do some simple Python stuff for daily tasks and haven't touched Perl in ages, so that's the current status. [11:04:38] "Not really great" is probably 100x better than me with my "I should be fired because of my python" skill level [11:12:04] You would be surprised ;) [11:28:34] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) a:03Kormat [12:35:44] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) @Marostegui the only chance placing it in A4 is to use the 10G port since A4 is a 10G switch. In A8 I need to check to see if i have available... [12:36:35] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) Thanks, let's see if there are available spaces on A8, if not, let's leave it on A6. [12:40:55] kormat: ack, thx [12:48:05] marostegui: is it known that db1145 and db2099 are not replicating? they're both mariadb::dbstore_multiinstance. is this backups related? [12:48:18] yep, I did an extra backup today [12:48:22] I just acked it [12:48:42] one other server (s3) may lag today [12:48:46] ah cool. thanks :) [12:49:30] that is the check that should be lag_critical => 7200s, but we need more refactoring [12:49:37] as usual 0:-) [12:49:41] hehe [12:50:30] anyway, if something happens, better having a 1h full backup ready [13:09:28] hey guys, let's organize a bit who will do what during the switchover, if needed [13:09:43] I can take care of changing LB weights [13:09:55] We need to monitor, icinga, tendril and logstash too in general [13:10:23] I can do LB weights + tendril monitoring for lag/qps [13:14:35] i can watch icinga, and maybe logstash [13:15:08] excellent [13:15:12] thanks [13:15:39] (i can definitely _watch_ logstash. and maybe provide useful data by doing so :) [13:17:39] I take grafana [13:17:53] thanks [13:18:06] will also be answering other teams questions so you both can be the first responders [13:18:14] sounds good [13:18:22] but will bail out on problems [13:18:37] not from you, from not being directly involved [13:24:18] doing the checklist: "Make absolutely sure that parsercache replication is working from the active to the passive DC. This is important" [13:24:48] What? [13:24:53] on the wiki [13:25:09] That was done a week ago [13:25:12] is parsercache eqiad -> codfw replication active? [13:25:21] just ticking the list :-) [13:25:27] and joe asked [13:26:18] as a pilot I hope you were accustomed to trivial questions :-D [13:26:39] (in a checklist) [13:27:03] and remember I broke it last time [13:31:05] jynus: there's a new script added to operations/software/dbtools called `check-master-heartbeat.sh`. it makes it easy to check: https://phabricator.wikimedia.org/P12428 [13:31:30] I see [13:31:36] thanks [13:31:53] we could integrate maybe that with replica-tree for more info [13:32:10] replication tree doesn't support cycles [13:32:44] * kormat nods [13:33:21] was semisync option checked on codfw, now that I think (aside from gtid) [13:34:11] yes [13:34:16] cool, thanks [13:52:15] after the DC is switched and RW enabled, let's be aware of RO messages on logstash, just in case [13:53:13] I have a one liner for manually check the masters in codfw, to make sure they are writtable [13:53:25] the cookbook does it too [13:53:34] use a different method :-P [13:53:34] volans: ah cool [13:53:39] :) [13:53:50] I am putting grafana mysql master s1 codfw on my tab [13:54:04] volans: I will query zarcillo.masters essentially :) [13:54:05] the overal traffic on other [13:54:21] marostegui: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/mysql_legacy.py#190 [13:54:45] volans: <3 grazie [13:54:51] de nada [13:55:39] parser cache, too [14:18:16] MariaDB Replica Lag: m2 CRITICAL slave_sql_lag Replication lag: 32691172.34 seconds [14:18:19] scratches head [14:18:25] maybe puppetization issue? [14:18:54] m2 wasn't touched, so that's weird, to be checked later [14:19:20] something like maybe icinga things that the primary host is on codfw and pt-heartbeat doesn't work or something [14:22:13] checking db2080, as it had some prometheus errors [14:26:44] kormat: so the RO check is relatively new and this is the first time we've switched DCs since it was installed, so it might require some tweaking [14:26:55] ok [14:28:16] it doesn't have to run on icinga [14:28:20] it is configured on the hosts [14:28:51] command[check_mariadb_read_only_s3]=db-check-health --port=3306 --icinga --check_read_only=true --process [14:29:03] or maybe it does? [14:29:06] I don't know [14:29:09] but it changed now [14:29:20] -command[check_mariadb_read_only_s3]=db-check-health --port=3306 --icinga --check_read_only=false --process [14:29:21] \ No newline at end of file [14:29:23] +command[check_mariadb_read_only_s3]=db-check-health --port=3306 --icinga --check_read_only=true --process [14:29:24] \ No newline at end of file [14:29:33] the logic is determined by `$is_on_primary_dc = ($mw_primary == $::site)` [14:29:53] $mw_primary is `mediawiki::state('primary_dc')` [14:29:55] yep, that is ok, is the lag to run puppet what is an issue [14:30:05] so the call from icinga is always the same and jus teh file in /etc/nagios/nrpe.d/ changes? [14:30:12] volans: yes. [14:30:22] I don't really know, but it took a lot to change [14:31:16] we can check that later, I don't think that is broken, as in user impact [14:31:26] and it is changing correctly now [14:33:37] m2 and m3 may need review on its logic [14:33:54] as we have really no way to note what is the active master [14:34:12] not sure what is the logic- if it is mw master, that will be wrong, as those stayed on eqiad [14:34:51] also the core-test [14:35:00] ahh [14:35:10] i remember noting that last week.. and then forgetting about it [14:35:27] yeah, I said "it has alerts disabled, so we will see what we do" [14:35:36] when it reaches to that [14:35:45] normally it should go to read only too, i guess? [14:35:54] but right now core test is actually a misc host, staying on eqiad [14:36:34] aside from that there is no outstanding alert, right? [14:37:49] es1021 has a read-only alert [14:37:55] let me see [14:38:00] the other 4 hosts that were alerting have self-resolved, somehow [14:38:13] that somehow is a bit worrying tbh [14:38:19] puppet [14:38:29] volans: yeah it's not making sense to me [14:39:02] so db1103 got the change at the run at 14:33:41 [14:39:09] previous run was at 14:11:02 [14:39:23] i'm taking a short break [14:39:25] mybe etcd key was cached? [14:39:28] are you sure you run them in the right hosts? [14:39:45] mmmh weird [14:39:51] I just ran it on es1021 [14:39:54] RO ended earlier than that [14:39:59] and chacked it to the right value [14:40:03] *changed it [14:40:17] https://puppetboard.wikimedia.org/node/db1103.eqiad.wmnet [14:40:21] it got changed 3 times [14:40:25] uff [14:40:47] ok, let me run it again [14:41:23] nah, it is like that now on next run (good) [14:41:53] oh, no [14:41:55] it changed back [14:42:09] ommand[check_mariadb_read_only_es4]=db-check-health --port=3306 --icinga --check_read_only=false --process [14:42:12] so first change was --datacenter eqiad -> codfw [14:42:13] on eqiad [14:42:27] is the value of `mediawiki::state('primary_dc')` stable? [14:42:30] that function is returning arbitrary values [14:42:34] is my suspicion [14:42:40] second one was a revert [14:42:59] third one back again [14:43:01] -command[check_mariadb_read_only_es4]=db-check-health --port=3306 --icinga --check_read_only=true --process [14:43:02] \ No newline at end of file [14:43:04] +command[check_mariadb_read_only_es4]=db-check-health --port=3306 --icinga --check_read_only=false --process [14:43:05] \ No newline at end of file [14:43:26] so has done eqiad -> codfw; codfw -> eqiad; eqiad -> codfw [14:43:46] it is changing on every run now [14:43:47] I have fixed db2078:3322's lag [14:43:54] it was just heartbeat messing up [14:43:54] marostegui: what was it? [14:45:01] volans: https://puppetboard.wikimedia.org/node/es1021.eqiad.wmnet :-O [14:45:26] last 3 runs are unchanged :) [14:45:28] at leaast [14:45:54] emm not sure how random() is nice, even if last 2 are right :-D [14:46:16] eh [14:46:17] could there be some caching or something affecting this? [14:46:32] I know you are not involved on etcd as much [14:46:38] I am looking for explanations [14:52:00] so, with that being fixed and db2078:3322 fixed, there's nothing else pending for us to check, right? [14:52:13] That's what I have noted during the switch [14:52:33] load seems stable, and I have only needed to make an small adjustment on s8 [14:52:38] I am downtiming db1077 should be seen now [14:53:40] what's the issue with db1077? I missed it with all the stuff flying around [14:54:38] db1077 is a core test host [14:54:45] yep [14:54:51] it is anomolously in read only = 0 [14:54:56] ah ok [14:54:56] yeah [14:55:20] there is not really a proper procedure for it, but when I setup the alerts I said "it doesn't alert, we will find a proper precedure later" [14:55:37] so minor issue, what configures its read only? [14:55:44] probably manually on puppet? [14:55:49] I will downtime it for now [14:55:55] yeah, I think we put it on read_only off for alex to test otrs [14:56:01] yes [14:56:09] it can be just ON again, it is not a big deal, it is not used really [14:56:18] normally it would proably be on [14:56:27] but it needs off at the moment for otrs [14:56:34] will puppetize it better [14:56:36] it is still being taken by alex? [14:56:46] I think so, otrs didn't happen [14:56:52] ok, cool [14:56:55] and he was away on vac [14:57:00] at least AFAIK [14:57:38] marostegui: db2113 is still showing on icinga as having huge lag [14:58:01] same issue as the other you fixed? [14:58:10] kormat: Could be the same as db2078, checking [14:58:15] thanks for the heads up [14:58:28] np, sorry, i thought it was the host you'd already fixed [14:58:32] that was.. intense. [14:58:42] kormat: no, i fixed db2078:3322 [14:58:46] thanks for bringing up the inconsistency [14:58:51] we found teh bad confd [14:59:03] volans: great :) it was the only thing that made any sense [14:59:11] well, when kormat was fixing it I thought it was just puppet + icinga being puppet + icinga [14:59:20] but when I saw it reverting I got like: I don't understand [14:59:37] kormat: should be fixed now, I will force an icinga run [15:00:17] marostegui: is it showing the time since the last failover maybe? [15:00:25] re: misc [15:00:30] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) The only chance to have it in A8 is if heze is decom before I get the server onsite. [15:00:48] I wonder why [15:00:58] jynus: no, it was just expecting the heartbeat from an old codfw master that is no longer on m2 [15:01:08] interesting [15:01:10] but why now? [15:01:26] (not actual question to you) [15:02:02] did you just remove old rows from the table? [15:02:17] yep [15:02:28] marostegui: i've re-run puppet on all the db master nodes, and forced icinga refreshes for everything that previously alerted for read-only issues. i'm going to start removing the downtimes now, because anything else that goes wrong we should know about it. [15:02:37] +1 [15:02:39] kormat: thank you very much [15:03:10] volans: for the bad config, does it make sense (and is it viable) to alarm if the files are inconsistent across hosts? [15:03:11] things to have better next time: cumin aliases that i can depend on to only contain relevant hosts [15:08:55] :-0 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=1 [15:09:02] kormat: what was missing? you added them recently :D [15:09:12] 200 pending db checks? [15:09:22] volans: i'm not sufficiently confident in them under fire [15:09:31] sobanski: so I think that confd should fail in those cases, if the unit fails we catch it with aa generic systemd check we have [15:09:43] but the unit was happily running and just logging the failure [15:10:16] so yes, we could add some additional check for confd specifically, and/or better understand it's failure mode [15:10:22] jynus: i suspect that these alerts were redefined by the dc-switchover, and icinga now has thousands of new checks to run [15:10:25] and is taking its sweet time [15:10:48] uff, but even crazy icinga normally doesn't work like that [15:10:53] lol [15:10:55] it reuses the ckeck [15:11:00] if it has the same name [15:11:03] * volans gets reminded of crazy Ivan [15:11:53] and no parameters should change for non-primary hosts [15:12:11] only thing changing should be the read only for primaries [15:12:22] this is weirder than usual [15:13:17] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) Let's go for A6 then :) Thanks for checking! [15:13:33] jynus: the parameters changed from `--datacenter=eqiad` to `--datacenter=codfw` for https://puppetboard.wikimedia.org/report/db1075.eqiad.wmnet/35e5434f2ad526d351c937c4b5fda98388dafb3c at least [15:13:38] ah [15:13:39] true [15:13:55] ok, but still, it is a parameter for npre config, right, not icinga new check? [15:14:08] for the host config, if you know what I mean [15:14:16] "run the check now with this options" [15:14:34] the nrpe call should be identical (?) [15:16:17] i don't know the architecture of nagios. it could be it connects to nrped, and asks for a list of checks, and the list contains a checksum of the options or something [15:16:46] I knew it lacked the dynamicity I needed (to check and update in real time check) [15:17:12] but this is even more disapointing (delete and readd all alerts for a parameter change) [15:17:29] are we sure of this? [15:17:30] we could workaround some of this by making the check a bit more stateful [15:17:42] maybe the name depends on teh same primary? [15:18:02] wait, did icinga get swapped over too? [15:18:06] no [15:18:09] kormat: nop [15:18:13] ok :) [15:18:17] and they run both the checks [15:18:18] no, the name I know it doesn't because I explicitly do that in the most reasonable option [15:18:18] all the time [15:18:37] it has names like _mariadb_read_only etc [15:19:13] or maybe _mariadb_
_read_only etc [15:19:35] could the mw_primary on puppet caused it somehow? [15:21:06] db1115 [15:21:06] MariaDB sustained replica lag [15:21:08] ha....tendril [15:21:29] poor tendril [15:21:42] I got an explanation [15:21:54] #page changes due to dc primary [15:22:02] that causing to be considered new checks [15:22:20] it changes its "more internal properties" (alert group) [15:23:10] sorry for notification for those that have that as keyword [15:27:15] running puppet on all db nodes, in batches of 10 per DC. [15:27:32] thanks [15:40:20] not a single db connection error in the last hour? [15:40:55] I see lock errors from 15 minutes ago [15:41:03] yeah, query yes [15:41:10] but connection was what I feared most [15:41:25] we have that e.g. when a db host is saturated [15:41:43] hosts are performing very well [15:41:49] that is very nice [15:41:57] I only had to reduce weight on one wikidata slave [15:41:57] you and kormat did a wonderful job [15:42:09] earlier in the switchover, other than that it is all good [15:42:09] on preparing the dbs [15:42:36] tweaks will be needed, like usual on eqiad too [15:42:50] but it seems very nice at the moment [15:44:19] and this is at almost peak time for enwiki/commons [16:16:32] 10DBA, 10SRE-tools, 10conftool, 10serviceops, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10jcrespo) [16:17:07] ^I created this, but feel free to alter it, I didn't know much what to say on the description [16:45:52] The other ones I got from today were to create a DB CPU saturation dashboard for codfw, investigate the s8 response time spike and create cumin aliases ahead of the next switch [16:46:02] Do all / some / none of them make sense as tickets? [16:46:19] I think all make sens to me [16:46:40] although probably with different priority [16:47:30] I'll go ahead and create them, I'll need kormat to provide details on the last one as I don't fully understand it yet [16:48:17] maybe note them down and ask for feedback tomorrow [16:48:33] Sounds good [16:48:56] e.g. not aware of s8 issue, but if it is something like "ah, this is the dispatcher starting" [16:49:09] it could be longer to create the task than resolve it [16:50:33] personally I am a bit worried of creating too many tasks if they are not practicaly actionable (e.g. low priority "that will never be worked on"), so I would ask for input [16:51:38] there is 200 tasks open https://phabricator.wikimedia.org/project/view/1060/ [16:51:50] some going back several years [16:52:15] (and those don't even include backups) [16:52:23] Agreed. I generally err on the side of "create not to forget and resolve during triage" but the time for that is when I know what I'm talking about. [16:52:48] oh, no issue with that, but maybe let's ask the others tomorrow [16:54:43] mostly because I was aware of the one I created, but not the other issues [17:46:13] 10DBA, 10SRE-tools, 10conftool, 10serviceops, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Volans) The context of the outdated info was confd stuck on one of the puppetmaster, so when... [21:45:07] 10DBA, 10Platform Engineering: Fix remaining page records with page_latest == 0 in database. - https://phabricator.wikimedia.org/T261797 (10holger.knust) [21:45:41] 10DBA, 10Platform Engineering: Fix remaining page records with page_latest == 0 in database. - https://phabricator.wikimedia.org/T261797 (10holger.knust)