[00:16:02] 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2836846 (10fgiunchedi) I don't think this applies anymore but moving on to #DBA's radar for confimation [01:15:48] 10DBA, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2836998 (10hoo) >>! In T151681#2836131, @jcrespo wrote: > Another example of why long running connections are a problem: I am depooling es1017 for important maintenan... [06:23:51] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837353 (10Marostegui) >>! In T150802#2836731, @jcrespo wrote: > I wanted to sanitize this for... [06:26:33] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837356 (10Marostegui) I think we are ready to sanitize s3 now after dropping all the non priv... [06:31:13] 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1096483 (10Marostegui) This is quite old indeed and we do not start MySQL everywhere (apart from labs) on purpose. We do not really want Puppet to handle the MySQL servi... [06:42:07] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837381 (10Marostegui) >>! In T150802#2836731, @jcrespo wrote: > I wanted to sanitize this for... [06:54:18] 10DBA, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2837404 (10jcrespo) > Hm, these are both job runners, jobs (probably) shouldn't run for so long. I wonder what's causing this. Separate issue then, but heads up for it. [06:54:20] 10DBA: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2837405 (10Marostegui) Running on dbstore2001 [07:03:14] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837422 (10jcrespo) > Shall I run the local redact_sanitarium.sh instead of the one we used to... [07:03:50] ^ I will run the local copy then :) [07:04:01] Sorry for the confusion about frwikivoyage by the way [07:04:22] no, no confusion [07:14:01] I intended not to upgrade most of eqiad servers [07:14:14] so only restart? [07:14:23] but some are so old, they do not have the right ssl compiled options [07:14:35] oh [07:14:40] so I can upgrade to .22 or .23 [07:14:48] or all the way to .28 [07:15:21] if we upgrade to .28 we will have a mismatch between eqiad and codfw right? [07:15:43] no, I upgraded codfw to .28 [07:15:56] ah [07:16:14] Then it is probably better to upgrade to .28, but you know I am sometimes too careful [07:16:23] we have a 3rd copy [07:16:30] on es200X [07:16:39] so I think it should be safe [07:16:45] once we have to upgrade [07:16:59] I think we can stay in 22 for most of them [07:17:15] to be honest, I would only upgrade to .28 those that really needed [07:17:24] sorry if I am slowing you down here being too careful [07:17:33] well, I already upgraded 1/3 of those [07:17:47] thinking I could not upgrade the others [07:18:11] but I have to do some upgrade on 3 of them or I will not be able to enable TLS [07:18:27] then it is clear [07:18:33] .16 does not have SSL support [07:18:37] openSSL [07:18:51] we still have .16 on some of them? :o [07:18:57] yes [07:18:57] that is old indeed haha [07:19:16] so the question is if to upgrade to .22, .23 or .28 on those [07:19:56] Probably .28 in order to avoid to work twice (and do the restart twice) [07:20:01] don't you think? [07:20:11] is there any drawback in going .16 -> .28 [07:20:13] ? [07:20:19] I was asking for a second opinion [07:20:39] I think it is ok for these 3 servers because we have backup not upgraded [07:20:53] it should be fine, we have gone from .22 to .28 I believe [07:20:59] in some servers [07:21:16] the backup is on .22 [07:21:22] and is fully offline [07:21:31] (no replication) [07:21:32] then let's go for 28 [07:24:02] I am doing apt install wmf-mariadb10=10.0.28-1, with no other upgrade [07:24:35] probably I should upgrade openssl, too [07:25:06] so you are leaving kernel and all that stuff aside? [07:25:31] yes, I was supposed to only restart [07:25:35] mysql [07:25:40] yeah, that sounds good to me :) [07:29:34] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837454 (10Marostegui) I have started to sanitize s3 using the local script in a local screen... [07:33:38] 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2837456 (10Marostegui) Alter running on db1070 [08:28:25] 10DBA, 13Patch-For-Review: Moving backup and otrs role into their own .pp - https://phabricator.wikimedia.org/T150851#2837570 (10Marostegui) 05Open>03Resolved This has been deployed. Running puppet agent in dbstore1001 and es2001 went fine so this can be closed. [08:28:27] 10DBA, 07Epic: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#2837572 (10Marostegui) [08:28:32] I created https://phabricator.wikimedia.org/T152080 [08:29:56] I read: page and got scared about the duplicate entries we had a week ago in the page table, but then I kept reading and..pheew [08:30:06] ha ha [08:35:29] 10DBA, 07Epic: Moving eventloggin role into its own .pp - https://phabricator.wikimedia.org/T152081#2837585 (10Marostegui) [08:35:43] 10DBA: Moving eventlogging role into its own .pp - https://phabricator.wikimedia.org/T152081#2837585 (10Marostegui) [09:00:08] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2837638 (10Marostegui) I am going to start rolling this out in m3. dbstore servers do not use GTID so it should be perfectly safe to deploy it there too. ``` root@neodymium:/home/marostegui/git/... [09:08:26] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837675 (10Marostegui) The script has finished. It took around 1:35h to finish. I am going to... [12:04:58] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2837930 (10Marostegui) I have started MySQL and replication on db2048 so it can catch up from yesterday. @Papaul ping me before doing the DIMM changes so I can turn it off. [13:52:47] 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2838169 (10Marostegui) db1070 is done ``` MariaDB PRODUCTION s5 localhost dewiki > show create table revision\G *************************** 1. row *************************** Table: revision Create Table... [14:02:46] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2838183 (10Marostegui) The data has been sanitized correctly and I have started replication in... [15:50:15] 10DBA: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2838449 (10Marostegui) dbstore2001 is done ``` MariaDB DBSTORE localhost wikidatawiki > show create table revision\G *************************** 1. row *************************** Table: revision Create Table... [16:05:59] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2838501 (10Marostegui) After the memory swap I have started the transfer between db2048 and db2034. [16:08:37] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2838504 (10Marostegui) The server caught up and the data is being sanitized as it comes in, so... [16:20:48] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838536 (10Volans) [16:22:15] 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838538 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [16:22:19] 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838540 (10Marostegui) The disk failed in the end: https://phabricator.wikimedia.org/T152105 [16:22:29] oh, interesting [16:22:46] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838512 (10Volans) [16:22:48] 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838545 (10Volans) [16:23:23] 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838546 (10Marostegui) The disk is now rebuilding: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)... [16:24:10] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838547 (10Marostegui) The disk is now rebuilding: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive... [16:24:14] 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838548 (10Volans) 05duplicate>03Open [16:24:43] 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2810386 (10Volans) [16:24:45] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838553 (10Volans) [16:27:01] compression finishes, should we close T150802 and open another for labs copy? [16:27:01] T150802: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802 [16:27:13] jynus: yeah, agreed [16:27:20] I am asking :-) [16:27:20] I will close it once it is over [16:27:25] yes, i think so [16:27:34] otherwise it will get too messy [16:27:46] with labs copy + accounts, provisioning done? [16:27:59] we still have this: https://phabricator.wikimedia.org/T147052 [16:28:28] yes, we can use the master one, too [16:28:59] T149418 not a blocker [16:28:59] T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418 [16:29:14] T151607, the same [16:29:15] T151607: Rebuild old timestamp format tables - https://phabricator.wikimedia.org/T151607 [16:29:29] yep [16:29:38] I will work tomorrow on the haproxy [16:29:53] ok I think tomorrow I might be able to start copying data to the new labs boxes [16:29:58] if compression finishes [16:31:52] db1095 is quite powerful [16:32:04] unlike dbstore2X, and others [16:32:27] enwiki compression 3-5 days to 1-2 hours [16:32:48] for revision, I mean [16:33:02] revision took around 10 hours I think [16:33:07] really? [16:33:11] yeah [16:33:17] but in the dbstore it took more XD [16:45:59] 10DBA, 06Labs: Prepare and check storage layer for new fi.wikivoyage.org - https://phabricator.wikimedia.org/T151756#2838643 (10jcrespo) 05Open>03Resolved a:03jcrespo From the above patch, this is resolved. [17:08:20] 2 repools and I can close T151995 at last [17:08:20] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [17:10:10] 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2838717 (10jcrespo) Waiting for es2019 and es2015 to warmup their buffer pools to repool them and I could close this. [17:10:49] 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029#2838718 (10jcrespo) a:03jcrespo [17:11:12] 10DBA, 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2803218 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. You... [17:11:30] 10DBA, 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2838723 (10Papaul) a:03Papaul [17:46:13] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2838814 (10Papaul) a:03Papaul [17:55:37] 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2838847 (10fgiunchedi) 05Open>03Invalid Thanks @Marostegui, tentatively resolving [17:56:16] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2838851 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [17:58:55] 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2838853 (10jcrespo) 05Invalid>03Resolved In fact Mariadb can start automatically for non production hosts (right now beta, dns-labs, and analytics-labs), so this is... [18:26:11] 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1096483 (10Krenair) >>! In T91797#2838853, @jcrespo wrote: > In fact Mariadb can start automatically for non production hosts (right now beta, dns-labs, and analytics-la... [19:04:29] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839211 (10jcrespo) [19:04:32] 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2839210 (10jcrespo) 05Open>03Resolved [20:03:40] jynus: thoughts on https://phabricator.wikimedia.org/T106386 ? still "high" ? [20:06:00] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839471 (10jcrespo) Out of 157 active hosts responding to salt, 15 host with no TLS deployed, 42 with the old certificate, 100 with the puppet one: ``` $ sudo salt -C 'G@cluster:mysql' cmd... [20:06:20] that is not really an operations task [20:06:41] it is a mediawiki-database with support from operations [20:07:06] but if I was to triage it, I would leave it as low or normal [20:15:32] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839512 (10jcrespo) List of eqiad hosts with the old cert: ``` db1015.eqiad.wmnet db1021.eqiad.wmnet db1022.eqiad.wmnet db1036.eqiad.wmnet db1054.eqiad.wmnet db1060.eqiad.wmnet db1063.eqi... [21:17:29] 10DBA, 06Collaboration-Team-Triage, 10Flow, 13Patch-For-Review, and 2 others: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936#2839814 (10Catrope) [21:17:30] 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#2839815 (10Catrope) [21:17:52] 10DBA, 06Collaboration-Team-Triage, 10Flow, 13Patch-For-Review, and 2 others: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936#2769308 (10Catrope) >>! In T149936#2775616, @jcrespo wrote: > Looks good, waiting on code deployment for production deploy The code is deployed now. > - t... [21:20:17] 07Blocked-on-schema-change, 06Collaboration-Team-Triage, 10Notifications, 13Patch-For-Review, and 2 others: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428#2839834 (10Catrope) The code is now deployed, and this is ready to go. [21:21:31] 07Blocked-on-schema-change, 10DBA, 06Collaboration-Team-Triage, 10Flow, and 3 others: Add primary keys to remaining Flow tables - https://phabricator.wikimedia.org/T149819#2839836 (10Catrope) The code is now deployed, and this is ready to go. [23:43:25] 10DBA, 10MediaWiki-General-or-Unknown, 06Operations, 13Patch-For-Review: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2687819 (10fgiunchedi) I don't recall seeing this issue on s4 since https://gerrit.wikimedia.org/r/314229 landed, still an issue or we...