[06:28:55] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 (10Marostegui) [09:03:42] 10DBA, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [09:14:46] 10DBA, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [09:15:39] 10DBA, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) a:05Marostegui>03RobH These hosts are now ready for DCOps to take over. MySQL has been stopped on them too. [09:27:04] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 (10Marostegui) [09:27:32] lots of io errors on es2014 [09:27:41] it seems network may not bee too stable there [09:27:50] in which row is it? [09:28:08] I remember some net maintenance was going on (or was going to go on) on codfw [09:28:14] so maybe poking arzhel about it? [09:28:14] one error (disconnection and connection) every hour [09:28:33] this is happening until now [09:28:35] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 (10Marostegui) s2 eqiad progress: [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1125 [] db1122 [] db1105 [] db1103 [] db109... [09:29:25] no, what I meant is that there were some changes on codfw network [09:29:41] WHen I left for holidays they were planning them, not sure if they happened [09:29:55] I see [09:30:32] from yesterday's meeting [09:30:33] Updates: [09:30:33] • codfw row C done Nov 8th [09:30:33] • codfw rows A, B, D, waiting on approval of uplink modules - https://phabricator.wikimedia.org/T207960 [09:32:17] this is older [09:32:25] it starts on oct 5 [09:33:29] * marostegui checking sal [09:34:02] nothing that looks related [09:34:26] that is just when the logs start [09:34:32] it may have always happened [09:34:38] I am restarting and see [09:35:01] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 (10Marostegui) [09:40:07] 10DBA, 10Data-Services, 10User-Banyek, 10User-Urbanecm: Prepare and check storage layer for punjabiwikimedia - https://phabricator.wikimedia.org/T207584 (10Banyek) I also created the privileges for the cloud team to create the views, but this ticket is still waits on the test user creation [11:11:31] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) [11:12:54] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10D3r1ck01) [11:28:17] brb 10-15min [12:12:28] 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) >>! In T208320#4738830, @Marostegui wrote: > I have eased replication consistency flags and it is now catching up. > What do you mean with "it is not compressed"? that you are running the alter tables to compres... [12:17:56] I prepare to restart s3 mysql instance on dbstore2002 [12:18:12] what is the plan? [12:19:02] I'll ```SET SESSION sql_log_bin = 0; SET GLOBAL innodb_buffer_pool_dump_at_shutdown = 0``` [12:19:26] then [12:20:09] I'll downtime the host [12:20:30] stop slave; restart service, start slave, [12:20:43] and turn on the innodb_buffer_pool_dump_at_shutdown again [12:21:06] You expect that to fix the issue? [12:21:54] I am genuinely asking as I don't reallly know what the issue is :-) [12:22:55] whatever you do, make sure you do not interfere with the backups happening today [12:22:59] not necessarely, but as we talked about it yesterday it could be *something*. I mean nothing really uses that host now, the backup is not running yet, so restarting the instance is mostly harmless [12:23:13] jynus: yes, indeed [12:23:33] I was just about on to check when the backups expected to start [12:23:44] check the used memory, and see if you need and can increase the buffer pool [12:25:18] ```MariaDB [(none)]> pager grep "Buffer pool hit rate" [12:25:18] PAGER set to 'grep "Buffer pool hit rate"' [12:25:18] MariaDB [(none)]> show engine innodb status\G [12:25:18] Buffer pool hit rate 988 / 1000, young-making rate 4 / 1000 not 20 / 1000 [12:25:18] Buffer pool hit rate 995 / 1000, young-making rate 2 / 1000 not 6 / 1000 [12:25:18] Buffer pool hit rate 969 / 1000, young-making rate 13 / 1000 not 45 / 1000 [12:25:18] Buffer pool hit rate 964 / 1000, young-making rate 14 / 1000 not 61 / 1000 [12:25:19] Buffer pool hit rate 991 / 1000, young-making rate 2 / 1000 not 10 / 1000 [12:25:19] Buffer pool hit rate 985 / 1000, young-making rate 5 / 1000 not 45 / 1000 [12:25:20] Buffer pool hit rate 994 / 1000, young-making rate 3 / 1000 not 6 / 1000 [12:25:20] Buffer pool hit rate 981 / 1000, young-making rate 6 / 1000 not 35 / 1000 [12:25:21] Buffer pool hit rate 966 / 1000, young-making rate 9 / 1000 not 57 / 1000``` [12:25:26] please use paste [12:25:40] next time [12:25:59] on grafana I see it is using 121out of 125gb so not a huge amount available [12:27:21] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2002&var-port=13313&from=1542706032314&to=1542716832316 [12:30:10] well, the backups will start in 4,5 hours, so I'd say let's restart the instance, and see what happens. I don't really think this would help, but I am pretty sure it won't cause us any more harm, at least we checked. [12:30:36] any objections jynus, marostegui? [12:31:04] banyek: the only drawback is that the host will be cold for a while, so we won't see any results in the inmediate time, but at some point you'll need to decide wether it keeps lagging because it is cold and when it keeps lagging cause it didn't work [12:32:34] Another plan B could be to set trx to 2 in general on dbstore2001,2 as we do reclone the instance when mysql dies anyways [12:33:06] what I am expecting: slower catch-up until backup, slow(er than usual) backup, and after that the instance will be warm enough to see how it will catch-up [12:33:32] abiut trx:2 [12:33:38] about trx:2 [12:35:13] I like the idea - in the past I always let my instances (except masters) run with trx 2; and normally my approach was 'always reclone, even with the smallest issue' [12:36:05] so if dbstore2001 and dbstore2002 are the same, I don't see any reason not to do this, as the chance of crashing them both in the same time is pretty low [12:36:10] Normally recloning is a bit of a pain, so if the crash doesn't look too bad I tend to run compare.py over all the tables first [12:36:18] If the crash looks like the one on db1078, I do reclone anyways [12:37:09] banyek: to be clear, I would set trx=2 on s3 for now, as that is the only one that has issues so far [12:39:58] - actually we have *everything* to make the recloning a single click/command process. (well, needs some time to do it) [12:41:14] - I see your point of setting the trx=2 only on s3 now, but as I learned so far the most hard thing here is we have a lots of exceptions and hacks around, if we'd spend time on getting everything as same as possible we'll find million free hours [12:43:35] as we closing to the start of the labsdb maintenance window I think whatever we'll do on dbstore2002 will happen after [12:44:17] I don't want to make those two maintenances interfere, I'll comment on the phab. ticket [12:45:52] Maybe set trx=2 manually for now to let it catch up a bit at least I would say [12:45:58] it is 19h delayed now [12:50:20] ok, I agree [12:50:34] also labsdb1011 maintenance starts on 10 minutes [12:52:20] then I issue ```set session sql_log_bin=0; set global innodb_flush_log_at_trx_commit=2;``` [12:52:55] ok [12:54:08] on s3 I assume? [12:54:17] only on s3 [12:54:44] log with for s3 only if you can, just to be precise [12:54:56] what about sync_binlog? [12:55:03] what about it? [12:55:28] you commented I just restored the original flags to sync_binlog=1 and trx_commit=1 as s3 caught up. [12:55:43] I am asking if I shall sync_binlog to 0 too [12:55:56] *If I shall set [12:56:08] No, I put it past week to 0 because it was like 5 days delayed, when I came back from holidays [12:56:20] 👍 [12:56:25] so I wanted it to catch up as fast as possible, 5 days was _way_ too much [12:56:41] ok, then I set the innodb_flush_log_at_trx_commit to 2 *only* [12:56:47] ok [12:57:30] it started to catch up [13:01:18] 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) as the replication lag was 69663 seconds we agreed to set `innodb_flush_log_at_trx_commit=2;` on the host. Now the replication is catching up. [13:05:33] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) [13:11:45] host depooled, now I am waiting until the connections are gone [13:15:18] we have 2 long running queries there (9558 and 9216 seconds) I kill them [13:16:56] you should probably coordinate the maintenance in -operations [13:17:06] ok [13:50:30] after the labsdb1011 maintenance finished I'll go out & eat for about an hour [13:50:47] but we are still at 'o' so it takes a while [13:50:58] :-) [13:51:27] it used to be worse in the past, where they could not be depooled and it got stuck due to user queries [13:54:05] so I have a proposal of what to do with dbstore2002- drop dbstore2002:3312, which is already on dbstore1001 and use the resources available to increase performance [13:54:37] Ah, I didn't know s2 was duplicated [13:54:59] i like the idea [13:55:03] it was from the times we still wanted to have everthing duplicated [13:56:29] as running the mysql_upgrade all wikis databases prints this: [13:57:16] some views pointing to non-existent tables? [13:57:31] https://phabricator.wikimedia.org/P7831 [13:57:32] yes [13:57:54] actually all the same [13:57:58] smd_resource_links [13:58:07] msg_resource_links [13:58:09] Open a ticket to cloud team to get them removed, probably left over from dropped [13:58:14] ok [14:22:13] aaargh it finished [14:32:49] and here comes the fun part: https://phabricator.wikimedia.org/P7832 [14:33:30] wmf-pt-kill can't read the server key (as it is only readable by mysql) [14:33:32] `-r-------- 1 mysql mysql 3.2K Jul 26 2016 /etc/mysql/ssl/server.key` [14:34:08] pt-kill it seems doesn't have --skip-ssl [14:34:11] 10DBA, 10Data-Services, 10User-Banyek, 10User-Urbanecm: Prepare and check storage layer for punjabiwikimedia - https://phabricator.wikimedia.org/T207584 (10Urbanecm) Yeah, but there is no account in the wiki, so nobody can log in. I pinged @Reedy with request for an account creation, waiting :). [14:35:09] pt-kill has defaults-file [14:35:17] it is F= [14:35:57] how did it work before? [14:36:37] it worked, I don't know how [14:37:16] according to the paste you did, it is not properly puppetized [14:37:46] how so? [14:38:15] well, it doesn't start after a reboot [14:38:18] :-D [14:38:30] it failed to start [14:38:35] I was on the console [14:39:01] ok s/puppet/package/ it is the same thing [14:43:30] it's not possible that the file privs changed with the package version? [14:44:38] /etc ? nope, that is not touched by mysql [14:45:02] it only installs on /opt and systemd untit [14:49:05] how come it worked before, if now it is doing what is supposed to do- that is the main question [14:49:28] as the server is not on production I change the file privs to see if that is enough for make the wmf-pt-kill to start. If so, I'll set the privileges back to it's current state [14:50:18] ? [14:50:52] `-r-------- 1 mysql mysql 3.2K Jul 26 2016 /etc/mysql/ssl/server.key` [14:51:21] the client should not need to access to the server key to connect [14:51:46] `-r--r----- 1 mysql wmf-pt-kill 3.2K Jul 26 2016 /etc/mysql/ssl/server.key` <- I'd like to test this [14:52:26] this is the error wmf-pt-kill has: [14:52:29] ```Nov 20 13:41:26 labsdb1011 wmf-pt-kill[3559]: SSL error: Unable to get private key from '/etc/mysql/ssl/server.key'``` [14:53:02] sure, but pt-kill should not try to read the server key to start [14:54:11] this is like giving your private certificate to every browser you want to connect to your web server [14:55:34] try F=/dev/null [14:57:58] I don't think perl is compiler with the right openssl version anyway [15:10:38] yes! [15:10:42] That nailed it [15:10:54] That's nailed it [15:11:30] please please make sure to leave permisions as they were [15:11:49] you can leave it running as you can and make a proper patch later [15:11:52] just for the record: ```root@labsdb1011:/etc/mysql/ssl# ls -lah [15:11:52] total 24K [15:11:52] dr-xr-xr-x 2 mysql mysql 4.0K Mar 7 2017 . [15:11:52] drwxr-xr-x 4 root root 4.0K Sep 4 13:01 .. [15:11:52] -r--r----- 1 root mysql 1.6K Jul 26 2016 cacert.pem [15:11:53] -r--r--r-- 1 mysql mysql 1.9K Jul 26 2016 cert.pem [15:11:53] -r-------- 1 mysql mysql 3.2K Jul 26 2016 server.key [15:11:54] -r--r----- 1 root mysql 1.7K Jul 26 2016 server-key.pem``` [15:11:58] they're intact [15:13:25] so the good news is that solves the problem, the bad news is I have to rebuild the wmf-pt-kill package, as the service definition is there [15:23:55] Jaime, can I ask for a +1 on this? [15:23:56] https://gerrit.wikimedia.org/r/#/c/operations/debs/wmf-pt-kill/+/474924/ [15:24:07] I tested the command line, it works [15:32:12] banyek: labsdb1011 has a critical for the last 2h, "check systemd state" I haven't checked further as I am in meetings [15:32:25] not sure what that is, I guess the pt-kill [15:32:30] but just saying so it is not forgotten :) [15:32:55] read back here [15:33:11] I have jumped between meetings [15:33:11] I am just created new package for it [15:33:16] I know <3 [15:33:38] so I haven't had time for that, I have quickly scanned the pt-kill thing, so I assume it is it, but I thought it was safer to say it [15:33:41] but ok, I will read back [15:34:05] tldr; pt-kill can't connect to mysql in labs1011 but after addig the DSN F=/dev/null it works [15:39:20] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) wmf-pt-kill was not able to start on labsdb1011 after reboot, I needed to create a new package for it with DSN F=/dev/null [15:39:28] !log uploaded wmf-pt-kill_2.2.20-1+wmf5 packages to stretch-wikimedia (T209517) [15:39:28] banyek: Not expecting to hear !log here [15:39:28] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [15:50:31] db2071 seems depooled, last time it was touched seems for a regular upgrade [15:50:37] any reason not to repool it? [15:50:45] jynus: that was me, you can repoolk it [15:50:48] sorry about forgeting it [15:51:08] it is ok, but I needed to ask [15:51:13] of course [15:51:19] in case there was something else going on [15:51:25] maybe it crashed or something [15:51:28] nope, just an upgrade I made [15:51:32] cool [15:51:42] and then probably got distracted with the 200 other things that are flowing around [15:51:51] I can repool it tomorrow if you like [15:53:37] I am doing it now as I wanted to do it on another host [15:53:54] thanks [15:54:26] I finished my meeting and I am off, had enough of today [15:54:28] bye [15:54:39] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) [15:55:55] bye [15:56:45] I have 1:20 until the next maintenance I afk a bit now [17:02:05] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Bawolff) >>! In T209031#4740258, @Anomie wrote: > Note the `actor` view will likely turn out to have similar issues. > > As s... [17:13:56] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10kchapman) Seems like this should be in the Inbox for TechCom rather than the Backlog. [17:20:42] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10D3r1ck01) @kchapman, almost yes :) I was still supposed to develop the use-case section so the ticket can come into the Inbox in a full package. B... [17:23:28] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10kchapman) @D3r1ck01 oops, misunderstanding on my side. Please move to the Inbox when you are ready. [17:23:54] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Anomie) @Bawolff: You quoted by comment, but I can't see how your reply is relevant. How would fetching the minimum and maximu... [17:24:38] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10D3r1ck01) > @D3r1ck01 oops, misunderstanding on my side. Please move to the Inbox when you are ready. A billion thanks :) and yes, I'll move it... [18:04:40] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10aborrero) [18:13:02] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10Bawolff) I should emphasize, I really don't know if what I said makes any sense, and a DBA should probably weigh in before doing much based on my... [18:30:28] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Bstorm) Note: labsdb1004's remote serial terminal seems broken. lasdb1006 looked bad, but recovered after reboot. I also see permissio... [20:55:39] 10DBA, 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10aaron) >>! In T196378#4550382, @jcrespo wrote: > You can help on your side (mediawiki) in parallel by preparing a way (conf... [22:09:56] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Milimetric) I think @Bawolff was referring to the automatic query that Sqoop generates against the table you point it at, usua... [22:10:03] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Bawolff) I was assuming bases on this comment >>! In T209031#4732101, @Krenair wrote: > By the way, the query discussed on IR...