[00:10:20] 10DBA, 10Availability (MediaWiki-MultiDC): Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10aaron) Are there any tasks here that remain and are blockers to multi-DC? [04:56:56] 10DBA, 10Operations, 10ops-codfw: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) All good! ``` root@db2061:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F3720) Port Name: 1I Port Name: 2I Gen8 Serv... [04:57:04] 10DBA, 10Operations, 10ops-codfw: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) 05Open>03Resolved [05:15:17] 10DBA, 10Operations, 10ops-eqiad: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10Marostegui) [05:15:24] 10DBA, 10Operations, 10ops-eqiad: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10Marostegui) p:05Triage>03Normal [05:22:28] 10DBA, 10Operations, 10ops-eqiad: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) 05Open>03stalled p:05Triage>03Normal [06:08:28] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [06:08:28] I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [07:22:12] 10DBA, 10Availability (MediaWiki-MultiDC): Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10jcrespo) The only thing I can think of is T164407 and T164505 [07:34:33] 10DBA, 10Collaboration-Team-Triage, 10Growth-Team, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762 (10Marostegui) deleted from testwiki and test2wiki on all eqiad hosts (codfw was done directly on the master) [x] labsdb101... [07:34:43] 10DBA, 10Collaboration-Team-Triage, 10Growth-Team, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762 (10Marostegui) [07:34:51] 10DBA, 10Collaboration-Team-Triage, 10Growth-Team, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762 (10Marostegui) 05Open>03Resolved a:03Marostegui [08:19:53] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) s3 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1123 [] db1078 [] db1077 [] db1075 [08:19:57] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) s3 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1123 [] db1078 [] db1077 [] db1075 [08:20:00] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) s3 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1123 [] db1078 [] db1077 [] db1075 [09:14:37] 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 (10Marostegui) I have granted access to that, try re-running the script: ``` for i in labsdb1009 labsdb1010 labsdb1011; do echo $i; mysql.py -h$i... [09:14:58] 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Marostegui) I assume you meant zhwikiversity, I have granted that on all hosts. Try re-running the script. ``` for i i... [09:15:38] 10DBA, 10Data-Services, 10User-Urbanecm: Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Marostegui) I have added the grants to labsdbuser so the script won't fail with access denied: ``` for i in labsdb1009 labsdb1010 labsdb1011; do echo $i; mysql.py -h$i -e... [09:22:32] so I am thinking of deploying the read only check on codfw and test there- even if I have to revert afterwards [09:23:03] can you think of any multiinstance that would be read-write? [09:26:21] yeah [09:26:23] sanitarium [09:26:35] we could do it for db2094 or db2095 [09:26:36] sanitarium is r-w? [09:26:39] one of the instances [09:26:45] No, it is not, but we could set it up [09:26:53] I mean, nothing uses it [09:26:57] ok, what I mean [09:27:10] is a reason to not suppose they all should be configured by default in r-w [09:27:17] aka not parametrized [09:27:34] e.g. for core, I will parametrize it based on if it is the current master [09:27:45] tests aside [09:27:54] (which we can also do) [09:28:03] Yeah, for codfw it should all be read-only until we flip over to codfw as active [09:28:13] not for eqiad? [09:28:27] I cannot think of any situation where an instance in codfw should be rw [09:28:28] aka could we have a multiinstance as master [09:28:36] or some other weird reason? [09:28:46] we could, but I don't think we are planning for that in a near future [09:28:51] So I would assume by default read-only [09:28:57] I mean, it is not set in stone [09:29:02] I know [09:29:13] I can change it, I am mostly asking if that supposition seems ok right now [09:29:28] Yeah, I think by default all multi-instance are and should be RO [09:29:36] (and in an emergency, like last day, we can alway ignore monitoring) [09:30:49] yeah [09:45:22] I am testing the normal behaviour on dbstore2001 [09:48:33] great [09:51:11] I think it didn't work, grant issues [09:51:41] it is using nagios user? [09:51:59] it should, checking what it actually is tring to do [10:38:20] so apparently it sends a critical if there is replication lag, which is not intended [11:06:51] maybe it is worth cutting off all the output, so getting rid of: Version 10.1.33-MariaDB, Uptime 6815001s, 5 mysqld process(es), read_only: True, s1 lag: 0.01s, 14 client(s), 110.17 QPS, connection latency: 0.005431s, query latency [11:06:55] and all that [11:07:03] it is clearer, without it, no? [11:09:58] sure [11:10:10] not yet there, however- see this issue [11:10:45] https://gerrit.wikimedia.org/r/450211 [11:11:44] I don't get it [11:11:57] So if it appears twice, what fails? [11:12:10] the parsing of the configuration [11:12:30] I can make it more robust, but I think it is also a config error [11:12:51] Yeah, definitely [11:12:58] I didn't get what was failing [11:13:00] Interesting.... [11:13:26] so both things need fixing, but this is easier to do first [11:13:32] Totally [11:14:21] I have only deployed the check on 4 hosts so far, so I don't attend lower formatting issues yet [11:14:59] Sure [11:15:05] the script was supposed to check many things, so there are lots of things pending [11:15:37] I want to check for now reliably the read only [11:15:41] sure [11:15:49] will improve the rest a bit later [11:15:49] It was a suggestion to make it clearer [11:16:04] which it is ok, just I am not yet there :-) [11:16:14] I know! :) [11:17:51] note that when it fails, it put the reason of the failure on front, so I don't disagree to remove, just that it is a minor issue right now [11:18:46] Yeah, I know. For you and me it is obvious to read it, but to some untrained eyes it might not be that easy with all the noise [11:24:40] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [11:24:50] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [11:25:03] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [11:26:33] dbstore1001 is bad now [11:26:43] as in...? [11:26:50] read write [11:34:06] can you help me check/change all multiinstance to be using unix_socket authentication for nagios? [11:34:22] for example, I think most sanitarium as still using a password [11:34:46] Sure, I will take care of the sanitariums [11:35:04] I changed some already [11:35:18] but I am deploying the check evertwhere except core_mi [11:37:07] Sanitariums look good [11:37:11] both eqiad and codfw [11:37:17] really? [11:37:23] I changed 1 all instances [11:37:27] db2094 [11:37:39] they all have unix_socket for root [11:37:45] no, not for root [11:37:50] those are everywere [11:37:52] for nagios [11:38:00] oh crap! [11:38:01] sorry! [11:38:04] :) [11:38:28] it could happen that only 1 host was happening, but I was very suspicious about that :-) [11:39:01] it is about to check db1124 db2094 [11:39:40] yeah I am doing it now [11:39:54] I will check the cores [11:40:22] db1125 is all wrong [11:40:38] all == all insatances? [11:40:44] yeah [11:40:48] note we never really did a proper migration [11:40:56] this could be a good excuse [11:41:01] db1124 also wrong [11:41:04] codfw looks good [11:41:19] (although I think I did it on many single instance cores hosts) [11:42:24] core seems ok, probably because on upgrade, it was done there [11:42:34] I am going to deploy on a single codfw core hosts [11:42:38] *host [11:43:16] ok db1124 and db1125 fixed [11:43:34] nah, there are some core hosts with that isssue, too [11:44:50] please help me keep an eye on pending and ongoing criticals on icinga [11:45:27] yeah [11:45:29] I deployed on the first core host db2084 [11:45:45] I am checking irc :) [11:46:22] you can also check manually if needed [11:46:23] with [11:46:32] I am going thru all the sections [11:46:55] sudo -u nagios check_mariadb.py --port=3315 --icinga --process --check_read_only=1 [11:47:18] remove port if it is 3306 or set it accordingly [11:47:33] No, what I meant is I am checking all the nagios users across the board [11:48:11] oh, thanks [11:49:09] I am fixing the ones without the socket enabled [11:49:44] I am leaving the tests host aside, db1095, db1102 and db1118 [11:55:14] I am deploying on core codfw [11:55:20] which I am happy about [11:55:20] Cool [11:55:29] I am fixing s5 now, s1,s2,s3,s4 already done [12:00:25] db2078 may need actual read only fix [12:00:45] ok, I have done all cores + misc [12:00:54] it is fixed on config, though [12:01:19] m5 is in read write [12:01:31] Fixing == fixing the nagios user XD [12:02:38] fixed db2078:3325 read only [12:05:04] so I am going to do a single eqiad core instance [12:05:44] db1090 [12:07:29] ok! [12:08:37] (note single-instance hosts are not part of this deployment) [12:09:28] also, did I tell you we no longer have core hosts with jessie/10.0 (except masters) [12:10:00] And they will be gone soon!!! [12:10:04] Great job! :) [12:15:13] db1090 looks fine ,deploying everywhere (8 hosts left) [12:15:31] goood [12:17:35] that should be it, I will leave deployment on single instance hosts for later [12:17:40] next week [12:17:53] sounds good [12:17:54] and setting critical at the same time than the other critical fixes [12:17:55] this is great! [12:17:56] :) [12:23:38] lookin good: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=MariaDB+read+only [12:24:14] :) [14:06:03] jynus can I poweroff dbproxy1006 for 5 mins [14:06:46] I am slightly confused on what's happening here and need to look at the idrac settings [14:09:45] one sec, cmjohnson1 [14:09:54] I think it is not active, but let me check [14:12:02] ok, it is not, let me disable the checks for 1 second [14:19:38] thanks jynus all fixed [14:20:00] I still do not know how backup1001 was misconfigured...but it's all correct now [14:21:36] can I ask what what is the thing pending, as I sad before- reloading nagios because it has a new ip or something else? Aka what did you do? [14:22:29] I don't need to know except to make everyhing green on icinga :-) [14:22:41] oh, it is now green [14:22:47] so no changes needed [14:22:50] sorry [14:24:43] jynus on backup1001 the idrac had dbproxy1006's mgmt IP address [14:25:32] so it was set that way....I have to assume I did it but that seems very odd and not something that I think I would've done [14:25:33] yes, I just thought you had changed the ip [14:26:00] no, the dns files were correct it was just on the server wrong [14:26:07] cool, thank you [14:26:27] yep, since I have you es1019 needs to be powered off...looks like a hard reset of the idrac is needed [14:26:46] ok, let me see [14:27:45] cmjohnson1: I can do that, but given it requires mediawiki deploy, it is late for me and I have currently no backup [14:28:02] I would hope if we could delay it till monday or tuesday [14:28:47] sure...we can delay it [14:28:52] no worries [14:28:54] I can deploy it fast, and you probably don't take it long to reset it [14:29:01] but repooling it is very slow [14:29:05] no...really need about 2 mins [14:29:07] so I do not want to do it so late [14:29:15] repooling takes 2+ hours [14:29:18] on my side [14:29:22] ouch...no worries let's do next week [14:29:32] so let me invite you, monday, tuesday? [14:31:11] sent you an invite to you and marostegui [14:36:26] Tuesday works better [14:36:29] I see the invite [15:18:43] 10DBA, 10Operations, 10decommission, 10ops-eqiad: Decommission db1051 - https://phabricator.wikimedia.org/T195484 (10Cmjohnson) [18:06:00] 10DBA: Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10aaron) [18:07:31] 10DBA: Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504 (10aaron) [18:41:26] 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10bd808) [18:52:58] 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10bd808) I think the failure I'm seeing is the same upstream wildcard grant bug that we have seen before. The maintain-v... [19:21:38] 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) The bug is weird. The grant is already there if @Marostegui ran it, so the script doesn't need to do it. It... [19:25:36] 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) Yup, that did it. I created the DB manually, and it ran fine. [19:51:22] 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Bstorm) It will still fail on that if we don't also create the DB manually, unfortunately. I can get that in a bit. [19:52:57] 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Bstorm) I should say it "might fail" since it sometimes might work, as we found :-p [20:45:02] 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) 2018-08-03T19:51:57Z mwopenstackclients.DnsManager WARNING : Creating id_internalwikimedia.analytics.db.svc.eq... [20:46:29] 10DBA, 10Data-Services, 10Chinese-Sites, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 (10Bstorm) 05Open>03Resolved a:03Bstorm meta_p thing done as well. [21:06:00] 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 (10Bstorm) 05Open>03Resolved a:03Bstorm Scripts run and _p db created. [21:11:22] 10DBA, 10Operations, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10herron) p:05Triage>03Normal [22:37:29] 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 (10Bstorm) Created the _p database for this. @bd808 if you are very bored, you can finish it now (DNS should already be set). I can get it mysel... [22:47:31] 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 (10bd808) 05Open>03Resolved a:03bd808 ```lines=10 $ sql satwiki Reading table information for completion of table and column names You can t...