[00:10:20] <wikibugs>	 10DBA, 10Availability (MediaWiki-MultiDC): Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10aaron) Are there any tasks here that remain and are blockers to multi-DC?
[04:56:56] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) All good! ``` root@db2061:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337F3720)      Port Name: 1I     Port Name: 2I     Gen8 Serv...
[04:57:04] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) 05Open>03Resolved
[05:15:17] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10Marostegui)
[05:15:24] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10Marostegui) p:05Triage>03Normal
[05:22:28] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) 05Open>03stalled p:05Triage>03Normal
[06:08:28] <Adran18>	 With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/
[06:08:28] <Adran18>	 I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/
[07:22:12] <wikibugs>	 10DBA, 10Availability (MediaWiki-MultiDC): Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10jcrespo) The only thing I can think of is T164407 and T164505
[07:34:33] <wikibugs>	 10DBA, 10Collaboration-Team-Triage, 10Growth-Team, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762 (10Marostegui) deleted from testwiki and test2wiki on all eqiad hosts (codfw was done directly on the master)  [x] labsdb101...
[07:34:43] <wikibugs>	 10DBA, 10Collaboration-Team-Triage, 10Growth-Team, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762 (10Marostegui)
[07:34:51] <wikibugs>	 10DBA, 10Collaboration-Team-Triage, 10Growth-Team, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762 (10Marostegui) 05Open>03Resolved a:03Marostegui
[08:19:53] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) s3 eqiad progress   [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1123 [] db1078 [] db1077 [] db1075
[08:19:57] <wikibugs>	 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) s3 eqiad progress   [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1123 [] db1078 [] db1077 [] db1075
[08:20:00] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) s3 eqiad progress   [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1123 [] db1078 [] db1077 [] db1075
[09:14:37] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 (10Marostegui) I have granted access to that, try re-running the script: ``` for i in labsdb1009 labsdb1010 labsdb1011; do echo $i; mysql.py -h$i...
[09:14:58] <wikibugs>	 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Marostegui) I assume you meant zhwikiversity, I have granted that on all hosts. Try re-running the script. ``` for i i...
[09:15:38] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm: Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Marostegui) I have added the grants to labsdbuser so the script won't fail with access denied: ``` for i in labsdb1009 labsdb1010 labsdb1011; do echo $i; mysql.py -h$i -e...
[09:22:32] <jynus>	 so I am thinking of deploying the read only check on codfw and test there- even if I have to revert afterwards
[09:23:03] <jynus>	 can you think of any multiinstance that would be read-write?
[09:26:21] <marostegui>	 yeah
[09:26:23] <marostegui>	 sanitarium
[09:26:35] <marostegui>	 we could do it for db2094 or db2095
[09:26:36] <jynus>	 sanitarium is r-w?
[09:26:39] <marostegui>	 one of the instances
[09:26:45] <marostegui>	 No, it is not, but we could set it up
[09:26:53] <marostegui>	 I mean, nothing uses it
[09:26:57] <jynus>	 ok, what I mean
[09:27:10] <jynus>	 is a reason to not suppose they all should be configured by default in r-w
[09:27:17] <jynus>	 aka not parametrized
[09:27:34] <jynus>	 e.g. for core, I will parametrize it based on if it is the current master
[09:27:45] <jynus>	 tests aside
[09:27:54] <jynus>	 (which we can also do)
[09:28:03] <marostegui>	 Yeah, for codfw it should all be read-only until we flip over to codfw as active
[09:28:13] <jynus>	 not for eqiad?
[09:28:27] <marostegui>	 I cannot think of any situation where an instance in codfw should be rw
[09:28:28] <jynus>	 aka could we have a multiinstance as master
[09:28:36] <jynus>	 or some other weird reason?
[09:28:46] <marostegui>	 we could, but I don't think we are planning for that in a near future
[09:28:51] <marostegui>	 So I would assume by default read-only
[09:28:57] <jynus>	 I mean, it is not set in stone
[09:29:02] <marostegui>	 I know
[09:29:13] <jynus>	 I can change it, I am mostly asking if that supposition seems ok right now
[09:29:28] <marostegui>	 Yeah, I think by default all multi-instance are and should be RO
[09:29:36] <jynus>	 (and in an emergency, like last day, we can alway ignore monitoring)
[09:30:49] <marostegui>	 yeah
[09:45:22] <jynus>	 I am testing the normal behaviour on dbstore2001
[09:48:33] <marostegui>	 great
[09:51:11] <jynus>	 I think it didn't work, grant issues
[09:51:41] <marostegui>	 it is using nagios user?
[09:51:59] <jynus>	 it should, checking what it actually is tring to do
[10:38:20] <jynus>	 so apparently it sends a critical if there is replication lag, which is not intended
[11:06:51] <marostegui>	 maybe it is worth cutting off all the output, so getting rid of: Version 10.1.33-MariaDB, Uptime 6815001s, 5 mysqld process(es), read_only: True, s1 lag: 0.01s, 14 client(s), 110.17 QPS, connection latency: 0.005431s, query latency
[11:06:55] <marostegui>	 and all that
[11:07:03] <marostegui>	 it is clearer, without it, no?
[11:09:58] <jynus>	 sure
[11:10:10] <jynus>	 not yet there, however- see this issue
[11:10:45] <jynus>	 https://gerrit.wikimedia.org/r/450211
[11:11:44] <marostegui>	 I don't get it
[11:11:57] <marostegui>	 So if it appears twice, what fails?
[11:12:10] <jynus>	 the parsing of the configuration
[11:12:30] <jynus>	 I can make it more robust, but I think it is also a config error
[11:12:51] <marostegui>	 Yeah, definitely
[11:12:58] <marostegui>	 I didn't get what was failing
[11:13:00] <marostegui>	 Interesting....
[11:13:26] <jynus>	 so both things need fixing, but this is easier to do first
[11:13:32] <marostegui>	 Totally
[11:14:21] <jynus>	 I have only deployed the check on 4 hosts so far, so I don't attend lower formatting issues yet
[11:14:59] <marostegui>	 Sure
[11:15:05] <jynus>	 the script was supposed to check many things, so there are lots of things pending
[11:15:37] <jynus>	 I want to check for now reliably the read only
[11:15:41] <marostegui>	 sure
[11:15:49] <jynus>	 will improve the rest a bit later
[11:15:49] <marostegui>	 It was a suggestion to make it clearer
[11:16:04] <jynus>	 which it is ok, just I am not yet there :-)
[11:16:14] <marostegui>	 I know! :)
[11:17:51] <jynus>	 note that when it fails, it put the reason of the failure on front, so I don't disagree to remove, just that it is a minor issue right now
[11:18:46] <marostegui>	 Yeah, I know. For you and me it is obvious to read it, but to some untrained eyes it might not be that easy with all the noise
[11:24:40] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui)
[11:24:50] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui)
[11:25:03] <wikibugs>	 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui)
[11:26:33] <jynus>	 dbstore1001 is bad now
[11:26:43] <marostegui>	 as in...?
[11:26:50] <jynus>	 read write
[11:34:06] <jynus>	 can you help me check/change all multiinstance to be using unix_socket authentication for nagios?
[11:34:22] <jynus>	 for example, I think most sanitarium as still using a password
[11:34:46] <marostegui>	 Sure, I will take care of the sanitariums
[11:35:04] <jynus>	 I changed some already
[11:35:18] <jynus>	 but I am deploying the check evertwhere except core_mi
[11:37:07] <marostegui>	 Sanitariums look good
[11:37:11] <marostegui>	 both eqiad and codfw
[11:37:17] <jynus>	 really?
[11:37:23] <jynus>	 I changed 1 all instances
[11:37:27] <jynus>	 db2094
[11:37:39] <marostegui>	 they all have unix_socket for root
[11:37:45] <jynus>	 no, not for root
[11:37:50] <jynus>	 those are everywere
[11:37:52] <jynus>	 for nagios
[11:38:00] <marostegui>	 oh crap!
[11:38:01] <marostegui>	 sorry!
[11:38:04] <marostegui>	 :)
[11:38:28] <jynus>	 it could happen that only 1 host was happening, but I was very suspicious about that :-)
[11:39:01] <jynus>	 it is about to check db1124 db2094
[11:39:40] <marostegui>	 yeah I am doing it now
[11:39:54] <jynus>	 I will check the cores
[11:40:22] <marostegui>	 db1125 is all wrong
[11:40:38] <jynus>	 all == all insatances?
[11:40:44] <marostegui>	 yeah
[11:40:48] <jynus>	 note we never really did a proper migration
[11:40:56] <jynus>	 this could be a good excuse
[11:41:01] <marostegui>	 db1124 also wrong
[11:41:04] <marostegui>	 codfw looks good
[11:41:19] <jynus>	 (although I think I did it on many single instance cores hosts)
[11:42:24] <jynus>	 core seems ok, probably because on upgrade, it was done there
[11:42:34] <jynus>	 I am going to deploy on a single codfw core hosts
[11:42:38] <jynus>	 *host
[11:43:16] <marostegui>	 ok db1124 and db1125 fixed
[11:43:34] <jynus>	 nah, there are some core hosts with that isssue, too
[11:44:50] <jynus>	 please help me keep an eye on pending and ongoing criticals on icinga
[11:45:27] <marostegui>	 yeah
[11:45:29] <jynus>	 I deployed on the first core host db2084
[11:45:45] <marostegui>	 I am checking irc :)
[11:46:22] <jynus>	 you can also check manually if needed
[11:46:23] <jynus>	 with
[11:46:32] <marostegui>	 I am going thru all the sections
[11:46:55] <jynus>	 sudo -u nagios check_mariadb.py --port=3315 --icinga --process --check_read_only=1
[11:47:18] <jynus>	 remove port if it is 3306 or set it accordingly
[11:47:33] <marostegui>	 No, what I meant is I am checking all the nagios users across the board
[11:48:11] <jynus>	 oh, thanks
[11:49:09] <marostegui>	 I am fixing the ones without the socket enabled
[11:49:44] <marostegui>	 I am leaving the tests host aside, db1095, db1102 and db1118
[11:55:14] <jynus>	 I am deploying on core codfw
[11:55:20] <jynus>	 which I am happy about
[11:55:20] <marostegui>	 Cool
[11:55:29] <marostegui>	 I am fixing s5 now, s1,s2,s3,s4 already done
[12:00:25] <jynus>	 db2078 may need actual read only fix
[12:00:45] <marostegui>	 ok, I have done all cores + misc
[12:00:54] <jynus>	 it is fixed on config, though
[12:01:19] <jynus>	 m5 is in read write
[12:01:31] <marostegui>	 Fixing == fixing the nagios user XD
[12:02:38] <marostegui>	 fixed db2078:3325 read only
[12:05:04] <jynus>	 so I am going to do a single eqiad core instance
[12:05:44] <jynus>	 db1090
[12:07:29] <marostegui>	 ok!
[12:08:37] <jynus>	 (note single-instance hosts are not part of this deployment)
[12:09:28] <jynus>	 also, did I tell you we no longer have core hosts with jessie/10.0 (except masters)
[12:10:00] <marostegui>	 And they will be gone soon!!!
[12:10:04] <marostegui>	 Great job! :)
[12:15:13] <jynus>	 db1090 looks fine ,deploying everywhere (8 hosts left)
[12:15:31] <marostegui>	 goood
[12:17:35] <jynus>	 that should be it, I will leave deployment on single instance hosts for later
[12:17:40] <jynus>	 next week
[12:17:53] <marostegui>	 sounds good
[12:17:54] <jynus>	 and setting critical at the same time than the other critical fixes
[12:17:55] <marostegui>	 this is great!
[12:17:56] <marostegui>	 :)
[12:23:38] <jynus>	 lookin good: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=MariaDB+read+only
[12:24:14] <marostegui>	 :)
[14:06:03] <cmjohnson1>	 jynus can I poweroff dbproxy1006 for 5 mins
[14:06:46] <cmjohnson1>	 I am slightly confused on what's happening here and need to look at the idrac settings
[14:09:45] <jynus>	 one sec, cmjohnson1
[14:09:54] <jynus>	 I think it is not active, but let me check
[14:12:02] <jynus>	 ok, it is not, let me disable the checks for 1 second
[14:19:38] <cmjohnson1>	 thanks jynus all fixed
[14:20:00] <cmjohnson1>	 I still do not know how backup1001 was misconfigured...but it's all correct now
[14:21:36] <jynus>	 can I ask what what is the thing pending, as I sad before- reloading nagios because it has a new ip or something else? Aka what did you do?
[14:22:29] <jynus>	 I don't need to know except to make everyhing green on icinga :-)
[14:22:41] <jynus>	 oh, it is now green
[14:22:47] <jynus>	 so no changes needed
[14:22:50] <jynus>	 sorry
[14:24:43] <cmjohnson1>	 jynus on backup1001 the idrac had dbproxy1006's mgmt IP address 
[14:25:32] <cmjohnson1>	 so it was set that way....I have to assume I did it but that seems very odd and not something that I think I would've done
[14:25:33] <jynus>	 yes, I just thought you had changed the ip
[14:26:00] <cmjohnson1>	 no, the dns files were correct it was just on the server wrong 
[14:26:07] <jynus>	 cool, thank you
[14:26:27] <cmjohnson1>	 yep, since I have you es1019 needs to be powered off...looks like a hard reset of the idrac is needed
[14:26:46] <jynus>	 ok, let me see
[14:27:45] <jynus>	 cmjohnson1: I can do that, but given it requires mediawiki deploy, it is late for me and I have currently no backup
[14:28:02] <jynus>	 I would hope if we could delay it till monday or tuesday
[14:28:47] <cmjohnson1>	 sure...we can delay it
[14:28:52] <cmjohnson1>	 no worries
[14:28:54] <jynus>	 I can deploy it fast, and you probably don't take it long to reset it
[14:29:01] <jynus>	 but repooling it is very slow
[14:29:05] <cmjohnson1>	 no...really need about 2 mins
[14:29:07] <jynus>	 so I do not want to do it so late
[14:29:15] <jynus>	 repooling takes 2+ hours
[14:29:18] <jynus>	 on my side
[14:29:22] <cmjohnson1>	 ouch...no worries let's do next week
[14:29:32] <jynus>	 so let me invite you, monday, tuesday?
[14:31:11] <jynus>	 sent you an invite to you and marostegui
[14:36:26] <cmjohnson1>	 Tuesday works better
[14:36:29] <cmjohnson1>	 I see the invite
[15:18:43] <wikibugs>	 10DBA, 10Operations, 10decommission, 10ops-eqiad: Decommission db1051 - https://phabricator.wikimedia.org/T195484 (10Cmjohnson)
[18:06:00] <wikibugs>	 10DBA: Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10aaron)
[18:07:31] <wikibugs>	 10DBA: Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10aaron)
[18:41:26] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10bd808)
[18:52:58] <wikibugs>	 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10bd808) I think the failure I'm seeing is the same upstream wildcard grant bug that we have seen before. The maintain-v...
[19:21:38] <wikibugs>	 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) The bug is weird.  The grant is already there if @Marostegui ran it, so the script doesn't need to do it.  It...
[19:25:36] <wikibugs>	 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) Yup, that did it.  I created the DB manually, and it ran fine.
[19:51:22] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Bstorm) It will still fail on that if we don't also create the DB manually, unfortunately.  I can get that in a bit.
[19:52:57] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Bstorm) I should say it "might fail" since it sometimes might work, as we found :-p
[20:45:02] <wikibugs>	 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) 2018-08-03T19:51:57Z mwopenstackclients.DnsManager WARNING : Creating id_internalwikimedia.analytics.db.svc.eq...
[20:46:29] <wikibugs>	 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) 05Open>03Resolved a:03Bstorm meta_p thing done as well.
[21:06:00] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Bstorm) 05Open>03Resolved a:03Bstorm Scripts run and _p db created.
[21:11:22] <wikibugs>	 10DBA, 10Operations, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10herron) p:05Triage>03Normal
[22:37:29] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 (10Bstorm) Created the _p database for this.  @bd808 if you are very bored, you can finish it now (DNS should already be set).  I can get it mysel...
[22:47:31] <wikibugs>	 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 (10bd808) 05Open>03Resolved a:03bd808 ```lines=10 $ sql satwiki Reading table information for completion of table and column names You can t...