[00:19:10] 10DBA, 06Operations, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2961400 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1066.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201... [00:31:56] 10DBA, 06Operations, 10netops, 13Patch-For-Review: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2967405 (10jcrespo) I have upgraded all packages except wmf-mariadb10 and restarted the server for kernel update. [00:41:08] 10DBA, 06Operations, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2967438 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1066.eqiad.wmnet'] ``` and were **ALL** successful. [01:05:07] 10DBA, 10MediaWiki-Change-tagging, 06Operations: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2967493 (10TTO) It looks like the author of `ChangeTags::updateTags` intended for the DB unique key to act as protection against duplicate entries. This pattern... [01:08:25] 10DBA, 06Labs, 10wikitech.wikimedia.org: SemanticMediaWiki tries to create temporary tables, but can't as wikiuser is restricted - https://phabricator.wikimedia.org/T110981#1591340 (10jcrespo) If this was me, I would close it as won't fix- this fails because the user doesn't have permissions to do `CREATE TE... [01:12:44] 10DBA, 10MediaWiki-Change-tagging, 06Operations: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2967519 (10jcrespo) > is widespread in MediaWiki core And we should kill them with fire! :-P I got the architecture comittee to agree with me and document that... [01:54:08] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2415416 (10Revent) I was asked to mention this here, although I'm unsure if it's actually a 'replication' issue. https://quarry.wmflabs.org/query/14916 reports 35 entries in the transcode table, all for files that were uplo... [02:02:55] 07Blocked-on-schema-change, 06Collaboration-Team-Triage, 10Notifications, 13Patch-For-Review, 07Schema-change: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428#2967652 (10Cenarium) [02:19:53] 10DBA, 06Operations, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2967668 (10jcrespo) This is all done, only pending db1066 to catch up and repool. [02:30:04] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2967685 (10jcrespo) MySQLs with no SSL ``` $ sudo salt -C 'G@cluster:mysql' cmd.run 'mysql --skip-ssl -e "SELECT @@ssl_ca"' | grep -c 'NULL' 13 ``` MySQL with expired TLS cert: ``` $ sudo s... [03:14:56] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2967820 (10MZMcBride) Another request for this data for `frwiki_p`: . [06:56:00] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2968016 (10Marostegui) [06:56:03] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2968017 (10Marostegui) [06:56:06] 10DBA, 06Operations, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2968014 (10Marostegui) 05Open>03Resolved I have repooled the host. Awesome job @jcrespo [07:04:43] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2968020 (10Marostegui) For the record and tracking purposes: after lots of hours and hassle we were able to switch db1095's (new sanitarium) master from db1052... [07:07:40] 10DBA, 06Operations, 10netops, 10ops-eqiad: Move db1054 to C3 - https://phabricator.wikimedia.org/T156225#2968022 (10Marostegui) [07:08:04] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui) [07:09:35] 10DBA, 10MediaWiki-Change-tagging, 06Operations: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10Marostegui) [07:38:52] 10DBA, 06Operations, 10ops-codfw: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2968060 (10Marostegui) a:03Papaul I had rebooted the server as it wasn't responding. ILO logs aren't showing anything, neither system logs. However yesterday, as it can be see on the original ticket message,... [07:49:18] 10DBA, 10MediaWiki-Change-tagging, 06Operations: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968079 (10Marostegui) We should take the opportunity, now that it is depooled, to move it to another rack as all the API hosts for s1 are on D1, including this host. We can move it to B2 for... [07:49:24] 10DBA, 06Operations: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968080 (10Marostegui) [07:50:00] hi folks [07:50:16] hey [07:56:25] C2 rebooted overnight apparently [07:56:34] so I think next step now it go ahead with the replacement [07:56:34] yeah [07:56:41] at 1:20 I am checking the slaves logs [07:56:47] 1:20:01 [Note] Slave I/O thread: Failed reading log event, [07:56:52] to do that we'll have to incur a longer downtime [07:56:59] We are aiming for the switchover tomorrow 7:00utc [07:57:00] how's the progress on emptying it out of critical databases? [07:57:15] I have the email ready, just waiting for jaime to wake up to email ops about it :) [07:57:38] we have done all the critical tasks already yesterday, it was a loooooong day [07:57:58] so we just need to prepare a few things and we will be good to go for tomorrow [08:03:13] ok [08:03:18] i hope it'll stay up until then :/ [08:03:48] they future new master is out of the rack already [08:03:53] so that is good [08:03:59] (just in case it dies for good) [10:22:33] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#2968331 (10Liuxinyu970226) [10:26:00] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2968376 (10Marostegui) >>! In T155902#2965899, @Tb wrote: > Great thanks. although I missed one in the list above; can you grant all to s51111 on p50380g50491_inconsistent_redirects on s1.la... [11:28:40] marostegui: ferm is probably down to a startup bug on db1051, shall I start it or is it stopped explicitly for maintenance? [11:28:53] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2968484 (10Tb) 05Open>03Resolved [11:29:04] moritzm: nope, we haven't done anything on that host, so feel free to start it :) [11:29:40] ok, done [11:29:48] thanks :) [11:55:34] 10DBA, 06Operations, 10netops, 13Patch-For-Review: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2968514 (10Marostegui) This will be happening Thursday 25th at 07:00 UTC [12:00:33] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2968516 (10jcrespo) This is actually done on all wikis- but for some reason, there is a bug and the views have not been regenerated. ``` $ mysql --skip-ssl -h labsdb1001.eqiad.... [14:45:07] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2968832 (10chasemp) @jcrespo, for now the runs are not automated. I think this is good to go now and took: > maintain-views --all-databases --table watchlist_count --replace-all [14:46:46] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2968834 (10jcrespo) @chasemp Thank you very much! I think I can run that on my own, at least for frwiki. [14:51:57] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2968839 (10jcrespo) Oh, sorry, I misunderstood it. You run it already. @MZMcBride can you test it, even if T59617#2893932 is still pending? ``` $ mysql --skip-ssl frwiki_p -e "... [16:10:45] 10DBA, 06Operations, 10netops, 10ops-eqiad, 13Patch-For-Review: Move db1054 to A2 - https://phabricator.wikimedia.org/T156225#2969189 (10Marostegui) [16:25:03] I am not sure this is right: https://phabricator.wikimedia.org/diffusion/MW/browse/master/tests/phpunit/includes/db/DatabaseMysqlBaseTest.php;b843994408cd0b4d9f2676ae87225258e0497913$261 [16:26:14] 10DBA, 06Operations, 10netops, 10ops-eqiad, 13Patch-For-Review: Move db1054 to A3 - https://phabricator.wikimedia.org/T156225#2969260 (10Marostegui) [16:26:31] but maybe the comparison is ok, but there is a special check, I cannot remember how that ended up [16:26:41] 10DBA, 06Operations, 10netops, 10ops-eqiad, 13Patch-For-Review: Move db1054 to A3 - https://phabricator.wikimedia.org/T156225#2968022 (10Marostegui) in the end it will go to A3 as Chris found some issues on the racks we previously selected. [16:30:14] actually, it works [16:30:27] I shuould have more faith [16:31:59] https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/libs/rdbms/database/DatabaseMysqlBase.php;b843994408cd0b4d9f2676ae87225258e0497913$826 [16:33:31] or does it? [16:33:34] I'm guessing helium jamming its NIC is moved databases being backed up? cc akosiaris [16:33:37] it returns null [16:33:45] helium? [16:33:50] the bacula machine [16:34:21] no, in theory that should not be affected at all [16:35:05] oh ok, thanks jynus [16:35:19] let me know if there is a problem, it could be real [16:35:40] what did you see? [16:36:03] godog: no, it's syncing to heze for the precise => jessie upgrade [16:36:17] so it's me and there's no reason to worry [16:36:30] ah, I missunderstood the jamming [16:36:43] it should last a while btw [16:36:43] I thought you meant connectivity problems [16:36:44] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2415416 (10Krenair) @Revent: As the production database servers return the same result, this has nothing to do with labs replication. [16:36:48] like.... 2-3 days :P [16:36:53] :-) [16:37:12] akosiaris: ah! ok thanks [16:37:16] sorry for the noise [16:37:35] I was going through the inbox and noticed the librenms emails [16:37:46] akosiaris, will the host movement affect encryption keys? [16:38:30] godog: actually you are right.. I should try to rate-limit a bit [16:38:45] jynus: no, encryption keys are based on the client, not the storage host [16:39:09] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2969276 (10jcrespo) @Revent then I missunderstood you, we should file a proper, standalone bug. [16:39:29] nice [16:40:16] akosiaris: or switch to 10G ! [16:40:20] see what I did there? [16:40:52] shouldn't backups be on 10G by default? [16:42:24] we will complain the day when we have to do an emergency recovery [16:42:59] hehehe [16:47:02] ok capped at 600Mbps [16:48:11] sorry [16:48:18] I didn't actually suggested that [16:48:23] I did [16:48:23] please revert [16:48:36] I meant that you should have more bandwith available [16:48:44] not the other way around [16:49:09] don't worry, it's a one off copying around for the helium migration to jessie [16:49:09] we can cap it when there is an issue [16:49:19] no need to do it now [16:49:26] niah it's ok. I don't mind [16:49:38] it's nothing urgent anyway [16:49:39] yes, but I do think is a bad idea [16:49:46] capping the rsync ? [16:49:50] you should use all resources [16:49:53] in fact it's a good idea [16:50:05] cause it leaves resources available for the actual backups [16:50:07] no need to add extra days without a reason [16:50:26] ok, your call [16:50:53] and I really support it being on a 10GB network [16:51:02] that we can do [16:51:49] main storage is offsite, though? [16:51:57] right? [16:52:21] note sure how is the bandwidth between dcs [16:55:27] 10DBA, 06Operations, 10netops, 10ops-eqiad, 13Patch-For-Review: Move db1054 to A3 - https://phabricator.wikimedia.org/T156225#2969337 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1054 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated... [16:55:34] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2969340 (10Marostegui) [16:57:08] 10DBA, 06Operations: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2969346 (10Marostegui) This is probably not required anymore as we will do this: T156226 Will leave it open for now as it is not a bad a idea to have the 3 api host in 3 different racks anyways. We will evaluate tomorrow. [17:37:54] !log restarting and upgrading db2060 [17:37:54] Not expecting to hear !log here [17:40:04] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2969538 (10Superyetkin) I am getting timeout errors on `labsdb-web.eqiad.wmnet` . [17:44:16] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2257889 (10chasemp) >>! In T134203#2969538, @Superyetkin wrote: > I am getting timeout errors on `labsdb-web.eqiad.wmnet` . Can you give us some more details? From where, etc? [17:49:43] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2969585 (10Superyetkin) [[http://tools.wmflabs.org/superyetkin/test.php | Here]] is a page where the connection method (mysql_connect) is being called. [17:50:42] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2969587 (10jcrespo) also, does this mean labsdb-analytics.eqiad.wmnet works for you? [17:55:10] 10DBA, 06Labs, 10Tool-Labs: Labs users reporting timouts when connecting to labsdb-web.eqiad.wmnet - https://phabricator.wikimedia.org/T156285#2969602 (10jcrespo) [17:55:35] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2257889 (10jcrespo) Let's move to a different ticket: T156285 [18:23:05] 10DBA, 06Operations, 10ops-codfw: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2969747 (10jcrespo) I have restarted db2060 because there is no reason to have mysql down there- if it was corrupted, which I believe it shouldn't, we would find out, and if it isn't we lose nothing. We can al... [18:23:27] 10DBA, 06Operations, 10ops-codfw: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2969748 (10jcrespo) p:05Triage>03Normal [18:24:55] 10DBA, 06Labs, 10Tool-Labs: Labs users reporting timouts when connecting to labsdb-web.eqiad.wmnet - https://phabricator.wikimedia.org/T156285#2969750 (10chasemp) I'll look into this today if I can or very shortly thereafter [18:40:39] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2969823 (10MZMcBride) >>! In T59617#2968839, @jcrespo wrote: > @MZMcBride can you test it, even if T59617#2893932 is still pending? Yay, it works! Output from `frwiki_p`: 10DBA, 06Labs, 10Tool-Labs: Labs users reporting timouts when connecting to labsdb-web.eqiad.wmnet - https://phabricator.wikimedia.org/T156285#2969602 (10Krenair) I'm guessing this will be #netops filtering between production and labs, though I'm surprised this only came up after actual users were given the... [20:57:31] 10DBA, 06Labs, 10Tool-Labs: Labs users reporting timouts when connecting to labsdb-web.eqiad.wmnet - https://phabricator.wikimedia.org/T156285#2970337 (10chasemp) 05Open>03Resolved a:03chasemp @jcrespo should be sorted now [21:03:59] 07Blocked-on-schema-change, 10DBA, 06Community-Tech, 10MediaWiki-extensions-PageAssessments: Update PageAssessment schemas for subprojects in production - https://phabricator.wikimedia.org/T156305#2970391 (10kaldari) [22:31:46] 07Blocked-on-schema-change, 10DBA, 06Community-Tech, 10MediaWiki-extensions-PageAssessments: Update page_assessments_projects schema for subprojects in production - https://phabricator.wikimedia.org/T156305#2970679 (10kaldari) a:03jcrespo [22:36:30] 07Blocked-on-schema-change, 10DBA, 06Community-Tech, 10MediaWiki-extensions-PageAssessments: Update page_assessments_projects schema for subprojects in production - https://phabricator.wikimedia.org/T156305#2970690 (10kaldari) FYI, this should be very quick and easy. The extension is only installed on 3 wi...