[00:16:02] <wikibugs_>	 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2836846 (10fgiunchedi) I don't think this applies anymore but moving on to #DBA's radar for confimation
[01:15:48] <wikibugs_>	 10DBA, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2836998 (10hoo) >>! In T151681#2836131, @jcrespo wrote: > Another example of why long running connections are a problem: I am depooling es1017 for important maintenan...
[06:23:51] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837353 (10Marostegui) >>! In T150802#2836731, @jcrespo wrote: > I wanted to sanitize this for...
[06:26:33] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837356 (10Marostegui) I think we are ready to sanitize s3 now after dropping all the non priv...
[06:31:13] <wikibugs_>	 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1096483 (10Marostegui) This is quite old indeed and we do not start MySQL everywhere (apart from labs) on purpose. We do not really want Puppet to handle the MySQL servi...
[06:42:07] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837381 (10Marostegui) >>! In T150802#2836731, @jcrespo wrote: > I wanted to sanitize this for...
[06:54:18] <wikibugs>	 10DBA, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2837404 (10jcrespo) > Hm, these are both job runners, jobs (probably) shouldn't run for so long. I wonder what's causing this.  Separate issue then, but heads up for it.
[06:54:20] <wikibugs_>	 10DBA: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2837405 (10Marostegui) Running on dbstore2001
[07:03:14] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837422 (10jcrespo) > Shall I run the local redact_sanitarium.sh instead of the one we used to...
[07:03:50] <marostegui>	 ^ I will run the local copy then :)
[07:04:01] <marostegui>	 Sorry for the confusion about frwikivoyage by the way
[07:04:22] <jynus>	 no, no confusion
[07:14:01] <jynus>	 I intended not to upgrade most of eqiad servers
[07:14:14] <marostegui>	 so only restart?
[07:14:23] <jynus>	 but some are so old, they do not have the right ssl compiled options
[07:14:35] <marostegui>	 oh 
[07:14:40] <jynus>	 so I can upgrade to .22 or .23
[07:14:48] <jynus>	 or all the way to .28
[07:15:21] <marostegui>	 if we upgrade to .28 we will have a mismatch between eqiad and codfw right?
[07:15:43] <jynus>	 no, I upgraded codfw to .28
[07:15:56] <marostegui>	 ah
[07:16:14] <marostegui>	 Then it is probably better to upgrade to .28, but you know I am sometimes too careful
[07:16:23] <jynus>	 we have a 3rd copy
[07:16:30] <jynus>	 on es200X
[07:16:39] <jynus>	 so I think it should be safe
[07:16:45] <jynus>	 once we have to upgrade
[07:16:59] <jynus>	 I think we can stay in 22 for most of them
[07:17:15] <marostegui>	 to be honest, I would only upgrade to .28 those that really needed
[07:17:24] <marostegui>	 sorry if I am slowing you down here being too careful
[07:17:33] <jynus>	 well, I already upgraded 1/3 of those
[07:17:47] <jynus>	 thinking I could not upgrade the others
[07:18:11] <jynus>	 but I have to do some upgrade on 3 of them or I will not be able to enable TLS
[07:18:27] <marostegui>	 then it is clear
[07:18:33] <jynus>	 .16 does not have SSL support
[07:18:37] <jynus>	 openSSL
[07:18:51] <marostegui>	 we still have .16 on some of them? :o
[07:18:57] <jynus>	 yes
[07:18:57] <marostegui>	 that is old indeed haha
[07:19:16] <jynus>	 so the question is if to upgrade to .22, .23 or .28 on those
[07:19:56] <marostegui>	 Probably .28 in order to avoid to work twice (and do the restart twice)
[07:20:01] <marostegui>	 don't you think?
[07:20:11] <marostegui>	 is there any drawback in going .16 -> .28
[07:20:13] <marostegui>	 ?
[07:20:19] <jynus>	 I was asking for a second opinion
[07:20:39] <jynus>	 I think it is ok for these 3 servers because we have backup not upgraded
[07:20:53] <marostegui>	 it should be fine, we have gone from .22 to .28 I believe
[07:20:59] <marostegui>	 in some servers
[07:21:16] <jynus>	 the backup is on .22
[07:21:22] <jynus>	 and is fully offline
[07:21:31] <jynus>	 (no replication)
[07:21:32] <marostegui>	 then let's go for 28
[07:24:02] <jynus>	 I am doing apt install wmf-mariadb10=10.0.28-1, with no other upgrade
[07:24:35] <jynus>	 probably I should upgrade openssl, too
[07:25:06] <marostegui>	 so you are leaving kernel and all that stuff aside?
[07:25:31] <jynus>	 yes, I was supposed to only restart
[07:25:35] <jynus>	 mysql
[07:25:40] <marostegui>	 yeah, that sounds good to me :)
[07:29:34] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837454 (10Marostegui) I have started to sanitize s3 using the local script in a local screen...
[07:33:38] <wikibugs_>	 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2837456 (10Marostegui) Alter running on db1070
[08:28:25] <wikibugs>	 10DBA, 13Patch-For-Review: Moving backup and otrs role into their own .pp - https://phabricator.wikimedia.org/T150851#2837570 (10Marostegui) 05Open>03Resolved This has been deployed. Running puppet agent in dbstore1001 and es2001 went fine so this can be closed.
[08:28:27] <wikibugs_>	 10DBA, 07Epic: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#2837572 (10Marostegui)
[08:28:32] <jynus>	 I created https://phabricator.wikimedia.org/T152080
[08:29:56] <marostegui>	 I read: page and got scared about the duplicate entries we had a week ago in the page table, but then I kept reading and..pheew
[08:30:06] <jynus>	 ha ha
[08:35:29] <wikibugs>	 10DBA, 07Epic: Moving eventloggin role into its own .pp - https://phabricator.wikimedia.org/T152081#2837585 (10Marostegui)
[08:35:43] <wikibugs_>	 10DBA: Moving eventlogging role into its own .pp - https://phabricator.wikimedia.org/T152081#2837585 (10Marostegui)
[09:00:08] <wikibugs_>	 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2837638 (10Marostegui) I am going to start rolling this out in m3. dbstore servers do not use GTID so it should be perfectly safe to deploy it there too. ``` root@neodymium:/home/marostegui/git/...
[09:08:26] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2837675 (10Marostegui) The script has finished. It took around 1:35h to finish. I am going to...
[12:04:58] <wikibugs_>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2837930 (10Marostegui) I have started MySQL and replication on db2048 so it can catch up from yesterday. @Papaul ping me before doing the DIMM changes so I can turn it off.
[13:52:47] <wikibugs>	 10DBA, 13Patch-For-Review: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2838169 (10Marostegui) db1070 is done  ``` MariaDB PRODUCTION s5 localhost dewiki > show create table revision\G *************************** 1. row ***************************        Table: revision Create Table...
[14:02:46] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2838183 (10Marostegui) The data has been sanitized correctly and I have started replication in...
[15:50:15] <wikibugs>	 10DBA: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2838449 (10Marostegui) dbstore2001 is done  ``` MariaDB DBSTORE localhost wikidatawiki > show create table revision\G *************************** 1. row ***************************        Table: revision Create Table...
[16:05:59] <wikibugs>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2838501 (10Marostegui) After the memory swap I have started the transfer between db2048 and db2034.
[16:08:37] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2838504 (10Marostegui) The server caught up and the data is being sanitized as it comes in, so...
[16:20:48] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838536 (10Volans)
[16:22:15] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838538 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete.
[16:22:19] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838540 (10Marostegui) The disk failed in the end: https://phabricator.wikimedia.org/T152105
[16:22:29] <marostegui>	 oh, interesting
[16:22:46] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838512 (10Volans)
[16:22:48] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838545 (10Volans)
[16:23:23] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838546 (10Marostegui) The disk is now rebuilding:  ```       logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)...
[16:24:10] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838547 (10Marostegui) The disk is now rebuilding:  ```       logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)       physicaldrive...
[16:24:14] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838548 (10Volans) 05duplicate>03Open
[16:24:43] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2810386 (10Volans)
[16:24:45] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838553 (10Volans)
[16:27:01] <jynus>	 compression finishes, should we close T150802 and open another for labs copy?
[16:27:01] <stashbot>	 T150802: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802
[16:27:13] <marostegui>	 jynus: yeah, agreed
[16:27:20] <jynus>	 I am asking :-)
[16:27:20] <marostegui>	 I will close it once it is over
[16:27:25] <marostegui>	 yes, i think so
[16:27:34] <marostegui>	 otherwise it will get too messy
[16:27:46] <jynus>	 with labs copy + accounts, provisioning done?
[16:27:59] <marostegui>	 we still have this: https://phabricator.wikimedia.org/T147052
[16:28:28] <jynus>	 yes, we can use the master one, too
[16:28:59] <jynus>	 T149418 not a blocker
[16:28:59] <stashbot>	 T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418
[16:29:14] <jynus>	 T151607, the same
[16:29:15] <stashbot>	 T151607: Rebuild old timestamp format tables - https://phabricator.wikimedia.org/T151607
[16:29:29] <marostegui>	 yep
[16:29:38] <jynus>	 I will work tomorrow on the haproxy
[16:29:53] <marostegui>	 ok I think tomorrow I might be able to start copying data to the new labs boxes
[16:29:58] <marostegui>	 if compression finishes
[16:31:52] <jynus>	 db1095 is quite powerful
[16:32:04] <jynus>	 unlike dbstore2X, and others
[16:32:27] <jynus>	 enwiki compression 3-5 days to 1-2 hours
[16:32:48] <jynus>	 for revision, I mean
[16:33:02] <marostegui>	 revision took around 10 hours I think
[16:33:07] <jynus>	 really?
[16:33:11] <marostegui>	 yeah
[16:33:17] <marostegui>	 but in the dbstore it took more XD
[16:45:59] <wikibugs_>	 10DBA, 06Labs: Prepare and check storage layer for new fi.wikivoyage.org - https://phabricator.wikimedia.org/T151756#2838643 (10jcrespo) 05Open>03Resolved a:03jcrespo From the above patch, this is resolved.
[17:08:20] <jynus>	 2 repools and I can close T151995 at last
[17:08:20] <stashbot>	 T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995
[17:10:10] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2838717 (10jcrespo) Waiting for es2019 and es2015 to warmup their buffer pools to repool them and I could close this.
[17:10:49] <wikibugs_>	 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029#2838718 (10jcrespo) a:03jcrespo
[17:11:12] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2803218 (10Papaul) Dear Mr Papaul Tshibamba,  Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.  You...
[17:11:30] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2838723 (10Papaul) a:03Papaul
[17:46:13] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2838814 (10Papaul) a:03Papaul
[17:55:37] <wikibugs>	 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2838847 (10fgiunchedi) 05Open>03Invalid Thanks @Marostegui, tentatively resolving
[17:56:16] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2838851 (10Papaul) Dear Mr Papaul Tshibamba,  Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.  Your reque...
[17:58:55] <wikibugs_>	 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2838853 (10jcrespo) 05Invalid>03Resolved In fact Mariadb can start automatically for non production hosts (right now beta, dns-labs, and analytics-labs), so this is...
[18:26:11] <wikibugs_>	 10DBA, 06Operations: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1096483 (10Krenair) >>! In T91797#2838853, @jcrespo wrote: > In fact Mariadb can start automatically for non production hosts (right now beta, dns-labs, and analytics-la...
[19:04:29] <wikibugs_>	 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839211 (10jcrespo)
[19:04:32] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2839210 (10jcrespo) 05Open>03Resolved
[20:03:40] <godog>	 jynus: thoughts on https://phabricator.wikimedia.org/T106386 ? still "high" ?
[20:06:00] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839471 (10jcrespo) Out of 157 active hosts responding to salt, 15 host with no TLS deployed, 42 with the old certificate, 100 with the puppet one:  ``` $ sudo salt -C 'G@cluster:mysql' cmd...
[20:06:20] <jynus>	 that is not really an operations task
[20:06:41] <jynus>	 it is a mediawiki-database with support from operations
[20:07:06] <jynus>	 but if I was to triage it, I would leave it as low or normal
[20:15:32] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839512 (10jcrespo) List of eqiad hosts with the old cert:  ```  db1015.eqiad.wmnet db1021.eqiad.wmnet db1022.eqiad.wmnet db1036.eqiad.wmnet db1054.eqiad.wmnet db1060.eqiad.wmnet db1063.eqi...
[21:17:29] <wikibugs>	 10DBA, 06Collaboration-Team-Triage, 10Flow, 13Patch-For-Review, and 2 others: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936#2839814 (10Catrope)
[21:17:30] <wikibugs_>	 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#2839815 (10Catrope)
[21:17:52] <wikibugs_>	 10DBA, 06Collaboration-Team-Triage, 10Flow, 13Patch-For-Review, and 2 others: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936#2769308 (10Catrope) >>! In T149936#2775616, @jcrespo wrote: > Looks good, waiting on code deployment for production deploy  The code is deployed now.  > - t...
[21:20:17] <wikibugs_>	 07Blocked-on-schema-change, 06Collaboration-Team-Triage, 10Notifications, 13Patch-For-Review, and 2 others: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428#2839834 (10Catrope) The code is now deployed, and this is ready to go.
[21:21:31] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 06Collaboration-Team-Triage, 10Flow, and 3 others: Add primary keys to remaining Flow tables - https://phabricator.wikimedia.org/T149819#2839836 (10Catrope) The code is now deployed, and this is ready to go.
[23:43:25] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 06Operations, 13Patch-For-Review: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2687819 (10fgiunchedi) I don't recall seeing this issue on s4 since https://gerrit.wikimedia.org/r/314229 landed, still an issue or we...