[07:11:36] 10DBA: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) The host is not up and running, it says: `db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN` You are lucky no one uses dbproxy1005 at the moment- otherwise it would have gone read-only. [07:20:55] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) Network likly went down at 19:23 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1540407778981&to=15404104438... [07:52:05] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) Yep, that means it solved the question @Bstorm mentioned @jcrespo I mean db1073 //itself// is up and running, but the service is down, that's why I mentioned this seems a transient error [07:57:19] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) Please reload the proxy and work with @Bstorm or whoever may help to identify next steps. [08:00:00] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) What do you think, can T207881 be related? I don't really think so, but the timestamps are correlating. (Which may not mean anything, but better to mention than not) [08:01:23] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) After reloading HAProxy it reports correctly: ```mariadb,db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP``` [08:02:26] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) T207881 is mediawiki, db1072 is m5, nothing to do. [08:08:13] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) Sorry, I was not clear, I like your idea of possible there was a small network issue, and I was thinking if there could be a network issue which could affected the connection between mw ho... [08:29:33] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) The command which used for compressing tables is ``` mysql -BN -S /run/mysqld/mysqld.s1.sock -e "SELECT table_schema, table_name FROM information_Schema.tables WHERE engine='INNODB' and row_... [08:30:05] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) 05Open>03Resolved a:03Banyek [08:30:50] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10Banyek) p:05Unbreak!>03Normal [08:32:20] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10jcrespo) p:05Normal>03Unbreak! [09:54:43] 10DBA, 10User-Banyek: Reimage pc2006 with stretch - https://phabricator.wikimedia.org/T207934 (10Banyek) [09:54:54] 10DBA, 10User-Banyek: Reimage pc2006 with stretch - https://phabricator.wikimedia.org/T207934 (10Banyek) p:05Triage>03Normal [11:50:37] 10DBA, 10MediaWiki-Watchlist, 10Growth-Team (Current Sprint), 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), and 2 others: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit - https://phabricator.wikimedia.org/T171898 (10jcrespo) [13:55:40] hey folks, thoughts on https://gerrit.wikimedia.org/r/#/c/operations/dns/+/467711/ ? [13:55:43] should be quick an deasy [13:58:14] checkingh [13:58:46] let me validate that [14:03:24] sorry, I had the dns repo outdated [14:04:13] indeed a mistake [14:04:21] I double checked ips on both affected servers [14:05:01] was that thanks to the dns linter? [14:05:20] if yes, that is great [14:06:18] it is in merge conflict, not deploying it unless told you are not doing it [14:08:19] yes it is [14:08:31] thanks to volans' dns linter I mean [14:08:32] I can merge [14:08:35] please do! [14:08:37] or I can wait [14:08:39] ok, merging [14:08:44] if will probably require a rebase with a changed parent [14:08:49] yeah, saw that [14:09:06] banyek: did you got told how to work with dns? [14:09:13] the intention is to fix these errors one by one, and then make the linter a voting CI check [14:09:22] that is cool [14:09:38] (and potentially identify and fix any bugs in the process :) [14:10:07] will wait a few minutes in case it can be helpful for banyek if he hasn't seen a dns merge before [14:10:17] otherwise merge [14:11:18] I've seen the file, and you can merge as I read ot [14:11:20] *it [14:11:32] did manuel told you the procedure? [14:11:41] if you know it already, I will just merge [14:11:54] if not, I will let you do it [14:12:34] nope, this is new for me, so you can lead the way [14:12:53] ok, I will merge, that is like all other gerrit patches [14:13:10] check [14:13:31] just to a separate repo [14:14:05] documentations is at https://wikitech.wikimedia.org/wiki/DNS [14:14:09] but you can read that later [14:14:16] I will tell you what to do [14:15:11] ok [14:15:43] I just submitted [14:16:02] now log in to one of the dns servers [14:16:38] I use ns0.wikimedia.org so I don't have to look up which one is [14:16:55] eg. authdns1001 [14:17:04] ok [14:17:09] there as root? [14:17:27] root@authdns1001:~# who am i [14:17:27] banyek pts/1 2018-10-25 14:17 (208.80.154.86) [14:17:32] I'm in [14:17:47] cool, so 99.9% of the cases you just want to run [14:18:27] authdns-update [14:18:37] it will do the whole thing for you [14:18:46] of course, check for errors [14:18:55] I presume a, log first b, run in a screen [14:18:59] and the main issue with dns I hope you know what it is- breaking it [14:19:08] logging is ok [14:19:20] screen/tmux is ok, although it is relatively fast [14:19:38] it should just take 1-3 seconds [14:19:50] ah, ok [14:20:05] breaking it as in, pointing the wrong ip at the wrong dns [14:20:08] so I'll run authdns-update [14:20:10] and viceversa [14:20:13] +1 [14:20:33] so most of the work has to be done beforehand checking no errors, etc. [14:21:08] and then check dns entry works as intended (of course there is cache, etc) [14:21:32] it is a relatively strightforward process, the actual deploy [14:21:44] ```diff --git templates/10.in-addr.arpa templates/10.in-addr.arpa [14:21:44] index a6e476e1..6c46b502 100644 [14:21:44] --- templates/10.in-addr.arpa [14:21:44] +++ templates/10.in-addr.arpa [14:21:44] @@ -3587,7 +3587,7 @@ $ORIGIN 48.192.{{ zonename }}. [14:21:44] 112 1H IN PTR mw2290.codfw.wmnet. [14:21:44] 113 1H IN PTR ms-be2043.codfw.wmnet. [14:21:45] 114 1H IN PTR db2039.codfw.wmnet. [14:21:45] -115 1H IN PTR db2041.codfw.wmnet. [14:21:46] +115 1H IN PTR db2042.codfw.wmnet. [14:21:46] 116 1H IN PTR backup2001.codfw.wmnet. [14:21:47] $ORIGIN 49.192.{{ zonename }}. [14:21:47] Merge these changes? (yes/no)?``` [14:21:48] I love this [14:21:57] :-) [14:22:09] it is very similar to puppet-merge [14:22:17] it's done [14:22:20] coolio [14:22:28] so this is a trivial change [14:22:43] but in other cases, check that it has been correctly applied afterwards [14:23:21] and of course for example, on ip changes dbs need a restart and or a reimage [14:24:19] * banyek nods [14:24:35] you can read more about it on https://wikitech.wikimedia.org/wiki/DNS [14:24:51] but for us outside of traffic, adding entries is the only thing we do normally [14:25:51] faidon thanks for the ping! [14:26:28] the log one? I learned! ;) [14:27:40] a slightly other question [14:27:56] I was thinking about your monitor_backup check [14:28:03] yes? [14:28:15] it is not "finished", no check is [14:28:23] as I'd like to create the table check similiar [14:29:11] what do you think about this? [14:29:29] it is ok [14:29:52] may I suggest to think first on the overal architecture- where the check will run, etc. [14:30:10] which roles it would affect? [14:30:14] on puppet [14:30:28] before going to the details, what do you think? [14:30:59] my issue with the yaml, is that you though that first, not the config itself [14:31:07] sure thing [14:31:18] you can of course do both at the same time [14:31:46] but the problem with going to the detail first is that later when implemented it may not make sense in context [14:32:02] e.g. a suggestion would be to create a WIP puppet patch, with lots of TODOs [14:32:15] so you can show "this is how I would do it" [14:33:05] mainly thinking not to waste time because you create a script that later cannot be run depending its exact location [14:33:32] also thinking about security and other things first [14:34:18] once the general idea (design/plan) is clear, and we all think is a good idea, you can implement it on your own [14:34:28] sounds like a plan? [14:37:08] yes [14:37:55] and what to do with the script I made in the past two days? [14:38:35] what about them? [14:39:11] after "planing" you won't be able to use it? [14:39:20] I think we will [14:39:29] it's pretty simple and clean [14:39:32] then no issue? [14:39:38] sure [14:40:06] my suggestion is precisely to avoid working a lot and then not being able to use your work [14:41:02] I hope tomorrow it will be in the state that I can run in on cumin in a screen. It will run in every day and send us a mail if it finds any difference between hosts, so the first, rough 'band-aid' will be there [14:41:24] I think we'll be able to use it with nagios - I have good feeling about that [14:41:30] "it will run in every day" [14:41:33] but for that, planning first [14:41:35] how, from where? [14:41:43] that is the most important thing [14:41:58] note, this is just a *rough* *wip* state [14:42:04] the code itself is not that important [14:42:33] but- is it safe to run? from where? how is it productionized? etc. [14:42:55] I am not doubting on that, just saying you have to show a review for that [14:43:37] as kick-off for this week my plan was: `while true; do ./run_my_checker; sleep 82800; done` [14:43:51] not ok [14:44:00] I know it is not ok [14:44:21] I mean I know this is *far* from a productionised solution [14:44:24] what if the server gets down, what if the masters go into maintenance, what is there is network maintenance [14:44:34] if it is not on puppet, it doesn't exist [14:45:02] remember pt-kill-wmf? [14:45:23] I definetely not think of 'running in a screen' is a good sollution [14:45:26] yes, and we asked you to productionize it- you coded nothing there [14:45:35] this is the same [14:45:46] the main aspect is the productionization [14:45:54] I know [14:45:56] the script itself is almost secondary [14:46:14] and the architecture was clear there [14:46:48] here even I don't know what is the right place to run such a production, or where it should get its configuration-- that is your main problem :-D [14:47:07] we ask you to propose that ;-) [14:47:22] and we will probably go on several interations [14:49:10] That's why I said to I'd like to check the way monitor_backups was made, because I feel the way it works could be a good template for this [14:49:31] (As running custom nagios check which returns verbose output) [14:50:14] actually, [14:50:22] the following iterations are in my head: [14:50:33] a, run in a screen for this weekend [14:50:49] b, turn it into a cron job, which sends us emails if there is a problem [14:51:15] c, turn it into a nagios check (or create a nagios check from the output of the cron job - which is better I think) [14:51:38] I'll think this through, and put together some neat documentation about it [14:51:39] deal? [14:54:13] thank you [14:54:36] yay! :) [14:56:35] jynus, banyek: thanks for the DNS merge [14:56:54] I have to leave at 5 (kindergarten) but I'll check back in the evening if there are some new tickets, icinga errors, etc.) and will work a few more (at least 1) hour [14:57:19] Ah, and I have to update my 'manual' with the DNS merge process ;) [14:57:57] volans: I am happy to help ;) [14:58:36] regarding the above, if I may interject, I might miss the context so correct me if I'm wrong, but the basic concept is that no tool/script should be installed manually in production, but with puppet, and no code should run in production without having been code reviewed and merged in the appropriate repository [15:00:51] specially if it runs unattended on mediawiki masters [15:01:55] volans: you have right, the script we are talking about is harmless, it checks data integrity between tables and part of the wmf-dba toolset (which made by jynus) and what I created is just a wrapper around it to make our life easier - I can start it manually every day until it is productionized [15:02:22] It is not harmless, I wrote it- and it is not production-ready [15:03:53] ok, I definetely won't leave it running unattended. [15:04:29] but now I have to bring my kids from the kindergarten, before somebody else brings them back their home :) [15:07:21] jynus: have a good weekend, I keep the fire running, if there's something odd happens, I'll ask for help [15:09:43] thanks, banyek! [16:00:24] 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) @Pigsonthewing I hope my comment at Wikidata Village P... [19:12:48] 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Bstorm) I was wondering what had happened.... [19:54:05] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) I'm seeing a lot of occurrences of t... [20:00:21] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) This is happening even more now that... [20:02:03] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) This is for the past 1 hour, during... [20:03:40] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) I've also seen periodic alerts for a... [20:34:44] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) Promoted group1 to wmf.1 again and n... [20:40:03] 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) [20:46:31] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 3 others: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell)