[07:11:36] <wikibugs>	 10DBA: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) The host is not up and running, it says: `db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN`  You are lucky no one uses dbproxy1005 at the moment- otherwise it would have gone read-only.
[07:20:55] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) Network likly went down at 19:23 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1540407778981&to=15404104438...
[07:52:05] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) Yep, that means it solved the question @Bstorm mentioned @jcrespo I mean db1073 //itself// is up and running, but the service is down, that's why I mentioned this seems a transient error
[07:57:19] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) Please reload the proxy and work with @Bstorm or whoever may help to identify next steps.
[08:00:00] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) What do you think, can T207881 be related? I don't really think so, but the timestamps are correlating. (Which may not mean anything, but better to mention than not)
[08:01:23] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) After reloading HAProxy it reports correctly: ```mariadb,db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP```
[08:02:26] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10jcrespo) T207881 is mediawiki, db1072 is m5, nothing to do.
[08:08:13] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) Sorry, I was not clear, I like your idea of possible there was a small network issue, and I was thinking if there could be a network issue which could affected the connection between mw ho...
[08:29:33] <wikibugs>	 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) The command which used for compressing tables is ``` mysql -BN -S /run/mysqld/mysqld.s1.sock -e "SELECT table_schema, table_name FROM information_Schema.tables WHERE engine='INNODB' and row_...
[08:30:05] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Banyek) 05Open>03Resolved a:03Banyek
[08:30:50] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10Banyek) p:05Unbreak!>03Normal
[08:32:20] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10jcrespo) p:05Normal>03Unbreak!
[09:54:43] <wikibugs>	 10DBA, 10User-Banyek: Reimage pc2006 with stretch - https://phabricator.wikimedia.org/T207934 (10Banyek)
[09:54:54] <wikibugs>	 10DBA, 10User-Banyek: Reimage pc2006 with stretch - https://phabricator.wikimedia.org/T207934 (10Banyek) p:05Triage>03Normal
[11:50:37] <wikibugs>	 10DBA, 10MediaWiki-Watchlist, 10Growth-Team (Current Sprint), 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), and 2 others: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit - https://phabricator.wikimedia.org/T171898 (10jcrespo)
[13:55:40] <paravoid>	 hey folks, thoughts on https://gerrit.wikimedia.org/r/#/c/operations/dns/+/467711/ ?
[13:55:43] <paravoid>	 should be quick an deasy
[13:58:14] <jynus>	 checkingh
[13:58:46] <jynus>	 let me validate that
[14:03:24] <jynus>	 sorry, I had the dns repo outdated
[14:04:13] <jynus>	 indeed a mistake
[14:04:21] <jynus>	 I double checked ips on both affected servers
[14:05:01] <jynus>	 was that thanks to the dns linter?
[14:05:20] <jynus>	 if yes, that is great
[14:06:18] <jynus>	 it is in merge conflict, not deploying it unless told you are not doing it
[14:08:19] <paravoid>	 yes it is
[14:08:31] <paravoid>	 thanks to volans' dns linter I mean
[14:08:32] <jynus>	 I can merge
[14:08:35] <paravoid>	 please do!
[14:08:37] <jynus>	 or I can wait
[14:08:39] <jynus>	 ok, merging
[14:08:44] <paravoid>	 if will probably require a rebase with a changed parent
[14:08:49] <jynus>	 yeah, saw that
[14:09:06] <jynus>	 banyek: did you got told how to work with dns?
[14:09:13] <paravoid>	 the intention is to fix these errors one by one, and then make the linter a voting CI check
[14:09:22] <jynus>	 that is cool
[14:09:38] <paravoid>	 (and potentially identify and fix any bugs in the process :)
[14:10:07] <jynus>	 will wait a few minutes in case it can be helpful for banyek if he hasn't seen a dns merge before
[14:10:17] <jynus>	 otherwise merge
[14:11:18] <banyek>	 I've seen the file, and you can merge as I read ot
[14:11:20] <banyek>	 *it
[14:11:32] <jynus>	 did manuel told you the procedure?
[14:11:41] <jynus>	 if you know it already, I will just merge
[14:11:54] <jynus>	 if not, I will let you do it
[14:12:34] <banyek>	 nope, this is new for me, so you can lead the way
[14:12:53] <jynus>	 ok, I will merge, that is like all other gerrit patches
[14:13:10] <banyek>	 check
[14:13:31] <jynus>	 just to a separate repo
[14:14:05] <jynus>	 documentations is at https://wikitech.wikimedia.org/wiki/DNS
[14:14:09] <jynus>	 but you can read that later
[14:14:16] <jynus>	 I will tell you what to do
[14:15:11] <banyek>	 ok
[14:15:43] <jynus>	 I just submitted
[14:16:02] <jynus>	 now log in to one of the dns servers
[14:16:38] <jynus>	 I use ns0.wikimedia.org so I don't have to look up which one is
[14:16:55] <jynus>	 eg. authdns1001
[14:17:04] <banyek>	 ok
[14:17:09] <jynus>	 there as root?
[14:17:27] <banyek>	 root@authdns1001:~# who am i
[14:17:27] <banyek>	 banyek   pts/1        2018-10-25 14:17 (208.80.154.86)
[14:17:32] <banyek>	 I'm in
[14:17:47] <jynus>	 cool, so 99.9% of the cases you just want to run
[14:18:27] <jynus>	 authdns-update
[14:18:37] <jynus>	 it will do the whole thing for you
[14:18:46] <jynus>	 of course, check for errors
[14:18:55] <banyek>	 I presume a, log first b, run in a screen 
[14:18:59] <jynus>	 and the main issue with dns I hope you know what it is- breaking it
[14:19:08] <jynus>	 logging is ok
[14:19:20] <jynus>	 screen/tmux is ok, although it is relatively fast
[14:19:38] <jynus>	 it should just take 1-3 seconds
[14:19:50] <banyek>	 ah, ok
[14:20:05] <jynus>	 breaking it as in, pointing the wrong ip at the wrong dns
[14:20:08] <banyek>	 so I'll run authdns-update
[14:20:10] <jynus>	 and viceversa
[14:20:13] <jynus>	 +1
[14:20:33] <jynus>	 so most of the work has to be done beforehand checking no errors, etc.
[14:21:08] <jynus>	 and then check dns entry works as intended (of course there is cache, etc)
[14:21:32] <jynus>	 it is a relatively strightforward process, the actual deploy
[14:21:44] <banyek>	 ```diff --git templates/10.in-addr.arpa templates/10.in-addr.arpa
[14:21:44] <banyek>	 index a6e476e1..6c46b502 100644
[14:21:44] <banyek>	 --- templates/10.in-addr.arpa
[14:21:44] <banyek>	 +++ templates/10.in-addr.arpa
[14:21:44] <banyek>	 @@ -3587,7 +3587,7 @@ $ORIGIN 48.192.{{ zonename }}.
[14:21:44] <banyek>	  112 1H IN PTR   mw2290.codfw.wmnet.
[14:21:44] <banyek>	  113 1H IN PTR   ms-be2043.codfw.wmnet.
[14:21:45] <banyek>	  114 1H IN PTR   db2039.codfw.wmnet.
[14:21:45] <banyek>	 -115 1H IN PTR   db2041.codfw.wmnet.
[14:21:46] <banyek>	 +115 1H IN PTR   db2042.codfw.wmnet.
[14:21:46] <banyek>	  116 1H IN PTR   backup2001.codfw.wmnet.
[14:21:47] <banyek>	  $ORIGIN 49.192.{{ zonename }}.
[14:21:47] <banyek>	 Merge these changes? (yes/no)?```
[14:21:48] <banyek>	 I love this
[14:21:57] <jynus>	 :-)
[14:22:09] <jynus>	 it is very similar to puppet-merge
[14:22:17] <banyek>	 it's done
[14:22:20] <banyek>	 coolio
[14:22:28] <jynus>	 so this is a trivial change
[14:22:43] <jynus>	 but in other cases, check that it has been correctly applied afterwards
[14:23:21] <jynus>	 and of course for example, on ip changes dbs need a restart and or a reimage
[14:24:19] * banyek nods
[14:24:35] <jynus>	 you can read more about it on https://wikitech.wikimedia.org/wiki/DNS
[14:24:51] <jynus>	 but for us outside of traffic, adding entries is the only thing we do normally
[14:25:51] <jynus>	 faidon thanks for the ping!
[14:26:28] <banyek>	 the log one? I learned! ;)
[14:27:40] <banyek>	 a slightly other question
[14:27:56] <banyek>	 I was thinking about your monitor_backup check
[14:28:03] <jynus>	 yes?
[14:28:15] <jynus>	 it is not "finished", no check is
[14:28:23] <banyek>	 as I'd like to create the table check similiar 
[14:29:11] <banyek>	 what do you think about this?
[14:29:29] <jynus>	 it is ok
[14:29:52] <jynus>	 may I suggest to think first on the overal architecture- where the check will run, etc.
[14:30:10] <jynus>	 which roles it would affect?
[14:30:14] <jynus>	 on puppet
[14:30:28] <jynus>	 before going to the details, what do you think?
[14:30:59] <jynus>	 my issue with the yaml, is that you though that first, not the config itself
[14:31:07] <banyek>	 sure thing
[14:31:18] <jynus>	 you can of course do both at the same time
[14:31:46] <jynus>	 but the problem with going to the detail first is that later when implemented it may not make sense in context
[14:32:02] <jynus>	 e.g. a suggestion would be to create a WIP puppet patch, with lots of TODOs
[14:32:15] <jynus>	 so you can show "this is how I would do it"
[14:33:05] <jynus>	 mainly thinking not to waste time because you create a script that later cannot be run depending its exact location
[14:33:32] <jynus>	 also thinking about security and other things first
[14:34:18] <jynus>	 once the general idea (design/plan) is clear, and we all think is a good idea, you can implement it on your own
[14:34:28] <jynus>	 sounds like a plan?
[14:37:08] <banyek>	 yes
[14:37:55] <banyek>	 and what to do with the script I made in the past two days?
[14:38:35] <jynus>	 what about them?
[14:39:11] <jynus>	 after "planing" you won't be able to use it?
[14:39:20] <banyek>	 I think we will
[14:39:29] <banyek>	 it's pretty simple and clean
[14:39:32] <jynus>	 then no issue?
[14:39:38] <banyek>	 sure
[14:40:06] <jynus>	 my suggestion is precisely to avoid working a lot and then not being able to use your work
[14:41:02] <banyek>	 I hope tomorrow it will be in the state that I can run in on cumin in a screen. It will run in every day and send us a mail if it finds any difference between hosts, so the first, rough 'band-aid' will be there
[14:41:24] <banyek>	 I think we'll be able to use it with nagios - I have good feeling about that
[14:41:30] <jynus>	 "it will run in every day"
[14:41:33] <banyek>	 but for that, planning first
[14:41:35] <jynus>	 how, from where?
[14:41:43] <jynus>	 that is the most important thing
[14:41:58] <banyek>	 note, this is just a *rough* *wip* state
[14:42:04] <jynus>	 the code itself is not that important
[14:42:33] <jynus>	 but- is it safe to run? from where? how is it productionized? etc.
[14:42:55] <jynus>	 I am not doubting on that, just saying you have to show a review for that
[14:43:37] <banyek>	 as kick-off for this week my plan was: `while true; do ./run_my_checker; sleep 82800; done` 
[14:43:51] <jynus>	 not ok
[14:44:00] <banyek>	 I know it is not ok
[14:44:21] <banyek>	 I mean I know this is *far* from a productionised solution
[14:44:24] <jynus>	 what if the server gets down, what if the masters go into maintenance, what is there is network maintenance
[14:44:34] <jynus>	 if it is not on puppet, it doesn't exist
[14:45:02] <banyek>	 remember pt-kill-wmf?
[14:45:23] <banyek>	 I definetely not think of 'running in a screen' is a good sollution
[14:45:26] <jynus>	 yes, and we asked you to productionize it- you coded nothing there
[14:45:35] <jynus>	 this is the same
[14:45:46] <jynus>	 the main aspect is the productionization
[14:45:54] <banyek>	 I know
[14:45:56] <jynus>	 the script itself is almost secondary
[14:46:14] <jynus>	 and the architecture was clear there
[14:46:48] <jynus>	 here even I don't know what is the right place to run such a production, or where it should get its configuration-- that is your main problem :-D
[14:47:07] <jynus>	 we ask you to propose that ;-)
[14:47:22] <jynus>	 and we will probably go on several interations
[14:49:10] <banyek>	 That's why I said to I'd like to check the way monitor_backups was made, because I feel the way it works could be a good template for this
[14:49:31] <banyek>	 (As running custom nagios check which returns verbose output)
[14:50:14] <banyek>	 actually,
[14:50:22] <banyek>	 the following iterations are in my head:
[14:50:33] <banyek>	 a, run in a screen for this weekend
[14:50:49] <banyek>	 b, turn it into a cron job, which sends us emails if there is a problem
[14:51:15] <banyek>	 c, turn it into a nagios check (or create a nagios check from the output of the cron job - which is better I think)
[14:51:38] <banyek>	 I'll think this through, and put together some neat documentation about it
[14:51:39] <banyek>	 deal?
[14:54:13] <jynus>	 thank you
[14:54:36] <banyek>	 yay! :)
[14:56:35] <volans>	 jynus, banyek: thanks for the DNS merge
[14:56:54] <banyek>	 I have to leave at 5 (kindergarten) but I'll check back in the evening if there are some new tickets, icinga errors, etc.) and will work a few more (at least 1) hour
[14:57:19] <banyek>	 Ah, and I have to update my 'manual' with the DNS merge process ;)
[14:57:57] <banyek>	 volans: I am happy to help ;)
[14:58:36] <volans>	 regarding the above, if I may interject, I might miss the context so correct me if I'm wrong, but the basic concept is that no tool/script should be installed manually in production, but with puppet, and no code should run in production without having been code reviewed and merged in the appropriate repository
[15:00:51] <jynus>	 specially if it runs unattended on mediawiki masters
[15:01:55] <banyek>	 volans: you have right, the script we are talking about is harmless, it checks data integrity between tables and part of the wmf-dba toolset (which made by jynus) and what I created is just a wrapper around it to make our life easier - I can start it manually every day until it is productionized
[15:02:22] <jynus>	 It is not harmless, I wrote it- and it is not production-ready
[15:03:53] <banyek>	 ok, I definetely won't leave it running unattended.
[15:04:29] <banyek>	 but now I have to bring my kids from the kindergarten, before somebody else brings them back their home :)
[15:07:21] <banyek>	 jynus: have a good weekend, I keep the fire running, if there's something odd happens, I'll ask for help
[15:09:43] <jynus>	 thanks, banyek!
[16:00:24] <wikibugs>	 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) @Pigsonthewing I hope my comment at Wikidata Village P...
[19:12:48] <wikibugs>	 10DBA, 10cloud-services-team: dbproxy1005 reports database failover - https://phabricator.wikimedia.org/T207901 (10Bstorm) I was wondering what had happened....
[19:54:05] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) I'm seeing a lot of occurrences of t...
[20:00:21] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) This is happening even more now that...
[20:02:03] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) This is for the past 1 hour, during...
[20:03:40] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) I've also seen periodic alerts for a...
[20:34:44] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell) Promoted group1 to wmf.1 again and n...
[20:40:03] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), 10Patch-For-Review, 10Wikimedia-production-error: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell)
[20:46:31] <wikibugs>	 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 3 others: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 (10mmodell)