[04:15:25] 10DBA, 10Data-Services: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser - https://phabricator.wikimedia.org/T179628#3761640 (10Dispenser) >>! In T179628#3755589, @jcrespo wrote: > @Dispenser Do you have a link to a code example? [[http://dispenser.info.tm/~dispenser/sources/missing_entries.sql|missin... [04:22:41] 10DBA, 10Data-Services, 10Goal, 10cloud-services-team (FY2017-18): Added namespace IDs and names to meta_p - https://phabricator.wikimedia.org/T180558#3761644 (10Dispenser) [06:09:14] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Identify tools hosting databases on labsdb100[13] and notify maintainers - https://phabricator.wikimedia.org/T175096#3761719 (10Quiddity) As above, I've created the list for 1003 at {P6313} [06:32:26] 10DBA, 10Data-Services: labsdb1004 replication broken: table s51290__dpl_p.i_psub; The table 'i_psub' is full - https://phabricator.wikimedia.org/T180560#3761729 (10Marostegui) [06:40:01] 10DBA, 10Data-Services: labsdb1004 replication broken: table s51290__dpl_p.i_psub; The table 'i_psub' is full - https://phabricator.wikimedia.org/T180560#3761748 (10Marostegui) First attempts to convert the table to InnoDB/Aria are not successful: ``` mysql:root@localhost [s51290__dpl_p]> alter table i_psub e... [07:09:01] 10DBA, 10Data-Services: labsdb1004 replication broken: table s51290__dpl_p.i_psub; The table 'i_psub' is full - https://phabricator.wikimedia.org/T180560#3761774 (10Marostegui) So the only way to solve this issue is: truncating the table or creating it with InnoDB. Given this is a MEMORY table, what I have don... [07:09:51] 10DBA, 10Data-Services: labsdb1004 replication broken: table s51290__dpl_p.i_psub; The table 'i_psub' is full - https://phabricator.wikimedia.org/T180560#3761775 (10Marostegui) And replication got broken with a different in memory table from the same user: ``` mysql:root@localhost [s51290__dpl_p]> show create... [07:29:59] 10DBA, 10Data-Services: labsdb1004 replication broken: table s51290__dpl_p.i_psub; The table 'i_psub' is full - https://phabricator.wikimedia.org/T180560#3761795 (10Marostegui) With this new table, the procedure of recreating it as innodb doesn't work, as the code is dropping it and recreating it all the time,... [07:48:11] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3761833 (10Marostegui) [08:01:05] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3761845 (10Marostegui) [08:01:29] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3761729 (10Marostegui) And as expected, a drop and a recreation of `i_psub` also happened, recreating the table with MEMORY engine, so it got full again a... [08:27:41] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3761856 (10Marostegui) `u_pagelinks` also broke ``` Last_Error: Could not execute Write_rows_v1 event on table s51290__dpl_p.u_pagelink... [08:43:54] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3761887 (10Marostegui) Unfortunately, new memory tables keep coming, so I am going to filter all the tables called `i_` and `u_` because I wouldn't want t... [08:50:13] marostegui: hola! Need to reboot thorium for kernel updates and then I'll be ready for db1107 (if you are) [08:50:26] sure! [08:52:09] just ran apt full-upgrade on db1107 [08:52:13] to have it with the latest updates [09:05:42] all right thoriumg is ok [09:05:48] *thorium :) [09:06:05] elukey cool [09:06:13] so, for db1046 [09:06:22] is it downtimed? eventlogging related processes stopped? [09:06:47] nope, I was about to say that I am going to stop [09:06:51] 1) replication from all the slaves [09:07:02] 2) eventlogging pushing data (from eventlog1001) [09:07:54] puppet needs to stop on replicas- it starts eventlogging automatically [09:08:14] 0) stop puppet on replicas [09:08:28] n) repoint proxy to the right master [09:08:37] elukey: db1107 has no role associated? [09:08:43] jynus: I know :) [09:08:53] marostegui: not yet [09:08:57] ok :) [09:09:03] not a blocker for the transfer anyways [09:12:05] do you know how much labsdb is compressed? 1.8 TB [09:13:23] marostegui: all downtimed for 7200s, el_sync stopped + puppet disabled, eventlogging stopped on eventlog100 [09:13:26] *1001 [09:13:35] cool [09:13:43] can i stop mysql on db1046 then? [09:14:25] sure [09:14:41] which task should I !log at [09:15:01] I'd say https://phabricator.wikimedia.org/T177405 [09:15:13] thanks! [09:16:16] in the meantime I can prepare the puppet config for db1107 [09:16:34] sounds good [09:16:40] transfer started [09:16:46] does db1107 have running the latest kernel? [09:17:00] probably not [09:17:01] i did an apt full-upgrade [09:17:07] it would be nice to reboot it if not, can happen after the transfer is complete [09:17:08] and no new kernel was installed [09:17:16] ok, I can check the kernel running [09:17:18] elukey@db1107:~$ uname -a [09:17:18] Linux db1107 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux [09:17:19] to be 100% sure [09:17:26] yes, that is the last one [09:17:30] ah nice :) [09:17:43] sorry to nitpick, later is more difficult to do [09:17:47] agreed! [09:17:50] nono thanks for the reminders! [09:17:58] better now than later [09:17:58] the master is now pointing to db1047 [09:18:01] m4-master [09:18:16] so hopefuly no insertions are happening now [09:18:35] eventlogging is stopped [09:18:44] so should be ok [09:19:26] what is the proxy config, did you repoint it to 1108 already? [09:20:24] if 'you' is me nope, no idea about the proxy config :( [09:23:47] I do not know why my connection is so bad [09:25:49] marostegui: created https://gerrit.wikimedia.org/r/#/c/391519/2 without removing the role from db1046 (just to keep it as it is in case of fire) [09:26:56] sounds good, I assume we can have two masters without any issues, right? [09:28:49] I am going to check eventlogging's config but I am pretty sure that it does not use db1046 anywhere.. I might need to modify the eventlogging_sync config though [09:29:05] when it will be up again, it needs to grab data from db1107 [09:29:17] sure, let's double check to make sure we can have two hosts with role master [09:30:03] I'm now starting with the openssl updates I mentioned yesterday [09:30:09] good! [09:31:03] the mysql daemon on eventlog1001 is using m4-master.eqiad.wmnet/log in its config [09:34:33] profile::mariadb::misc::eventlogging::replication::master_host: m4-master.eqiad.wmnet [09:34:40] * elukey is happy [09:34:45] haha [09:39:48] ok so eventlogging's config does not mention db1046 anywhere [09:39:53] only m4-master [09:40:03] same thing for the replication scripts [09:40:28] it makes sense, the idea is to be behind a proxy all the time [09:40:47] the proxy and the dns will have to be updated [09:40:54] this would be an ideal time to upgrade the proxy [09:42:58] jynus, marostegui: not sure if you use Cumin aliases, any objections to https://gerrit.wikimedia.org/r/391521 ? [09:43:14] (they're used by debdeploy) [09:43:35] (and if you're not using them, you should :-P ) [09:43:41] +1 ed [09:43:58] * marostegui confirms volans has the cumin highlight [09:44:18] thx, merging [09:44:29] you know I keep en eye on this channel ;) [09:45:32] I think the dns functionality on systemd crashed for me on network error [09:47:37] that is wrong [09:48:03] O:mariadb::misc or O:mariadb::misc::phabricator are not core [09:48:41] neither is O:mariadb::misc::eventlogging [09:49:35] those alias are wtong [09:50:40] and if you have been updating host based on them you missed a lot [09:53:36] I'm almost exclusively using db-all-codfw/db-all-eqiad since from the view of packages installed/updating things they're all pretty much alike, feel free to reshuffle the indivudual sub-aliases per DBA needs as needed [09:53:51] db-all-codfw are wrong too [09:54:12] what in particular? [09:54:46] there are 20 or so roles related to databases, most are missing [09:57:47] will dbstore1001 explode? https://grafana.wikimedia.org/dashboard/db/mysql?panelId=11&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=dbstore1001&var-port=9104&from=now-90d&to=now [09:58:46] marostegui: I am going to upgrade dbproxy1009 to stretch, the repoint m4-master to it while the transfer is ongoing, ok? [09:58:57] sounds good [09:59:00] poor dbstore1001 [09:59:08] it is begging to be converted to multi-instance [09:59:12] that we we upgrade the m4 proxies to stretch/kernel/etc [09:59:23] *that way [10:06:48] multi-instance vs multisource on replication performance: https://tendril.wikimedia.org/host/view/labsdb1010.eqiad.wmnet/3306 [10:06:58] (Replication -- 24h / 5m) [10:07:17] guess which one is the blue graph? [10:08:35] I went through all the db roles used in site.pp, they were all covered except the new eventlogging db names, https://gerrit.wikimedia.org/r/391530 should cover that and also reduces db-core to the actual code servers [10:08:42] :-( [10:09:52] I wil send you an amend [10:10:18] k, thanks. need to look into something else now, but will revisit this later the day [10:13:08] marostegui: should we merge https://gerrit.wikimedia.org/r/#/c/391519/2 only when the data trasfer is completed or now is fine ? [10:13:29] should be fine now [10:13:49] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987633 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1009.eqiad... [10:15:15] will it affect transfer if firewall is deployed? [10:15:24] good point [10:15:27] genuine question [10:15:33] I don't know [10:15:36] yeah [10:15:38] could be [10:15:45] I think new connections will be affected but not super sure [10:15:50] let's postpone the merge then :) [10:15:54] ah [10:16:05] existing should continue [10:16:21] we have copied 430G out of 716G already :) [10:16:25] wooowww [10:16:34] but also sometimes when deployed for the first time network is restarted [10:16:46] dbproxy1009 is reinstalling [10:17:08] hopefuly I can get it done for when it is finished [10:21:19] this would be our first stretch haproxy, can we trust it? [10:27:21] I think we may be skipping some tables uncompressed on labs due to not having the time plus later alter tables [10:28:37] jynus: is haproxy's version too different from what we run now on jessie? [10:28:49] haproxy is rock solid [10:28:50] (trying to figure out the "can we trust it?") [10:28:55] but I can check [10:29:04] specially the versions on debian [10:29:45] 1.7.5-2 on stretch [10:30:15] 1.5.8-3+deb8u2 on jessie [10:30:53] puppet code was ported to systemd long time ago [10:31:47] wmf-auto-reimage has improved a lot apparently [10:31:53] since last time I used it [10:32:08] BTW, I uploaded 10.0.33, not sure if you saw it [10:37:59] nope, didn't see it [10:38:30] would be good to test it somehere [10:38:36] just leaving a slave updated and replicating [10:38:49] I already did that [10:38:51] :-) [10:38:56] ah \o/ [10:38:58] in db2034? [10:39:04] yes [10:40:16] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fops&var-server=db2034&var-port=9104 [10:41:35] yeah, saw it in red this morning in tendril [10:41:41] and i saw it was a normal shutdown [10:41:51] by you, but didn't notice the version change :) [10:41:54] yes, I took the time to upgrade it fully [10:42:22] I logged it yesterday [10:42:44] missed it :) [10:42:57] I saw it in red this morning while having breakfast and didn't check SAL [10:43:49] so that is the explanation [10:43:57] thanks [10:44:28] FYI, openssl updated across db[12]* with service restarts for all lower level services using openssl, will upgrade dbstore* and es* later the day [10:46:47] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3762299 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1009.eqiad.wmnet'] ``` and were **ALL** successful. [10:48:08] configuration changes on the new haproxy: https://phabricator.wikimedia.org/P6315 [10:49:02] I can fix it, those seem trivial syntax changes [10:49:18] i was going to say that they don't look like big things [10:49:24] so everything should work as it used to [10:58:23] now I only have to make something rather than copy the file : https://gerrit.wikimedia.org/r/391536 [11:01:42] elukey: transfer is over, you want to merge? [11:01:47] so we can try to start mysql and see how it goes? [11:02:08] marostegui: ack! [11:03:22] done! [11:03:33] running puppet [11:09:05] this would need fixing: https://phabricator.wikimedia.org/P6316 [11:09:13] as the socket is not on /tmp anymore [11:09:28] but on: /run/mysqld/mysqld.sock [11:09:38] Mysql started fine, now running mysql_upgrade [11:10:33] (we can obviously start heartbeat manually with the correct socket, but it requires a proper fix for that host) [11:10:46] did 1108 start with socket still on /tmp`? [11:11:03] no [11:11:07] or both servers are ok, just other service [11:11:10] s [11:11:21] db1108 doesn't have pt-heartbeat it is not master [11:11:38] there is a config on mariadb [11:11:46] that says where is the port [11:11:53] elukey: I can see data around in db1107 finely [11:12:22] wait until you start eventlogging data an replication [11:12:32] we need to change the proxy config [11:12:58] yeah, i just meant that the data is there and queriable [11:13:41] marostegui: about https://phabricator.wikimedia.org/P6316 - I thought we had fixed that puppet code no? [11:14:41] if os_version('debian >= stretch') { [11:14:41] $mariadb_basedir = '/opt/wmf-mariadb101' [11:14:42] $mariadb_socket = '/run/mysqld/mysqld.sock' [11:16:21] looks like it is not fixed then :-) [11:17:24] it is profile::mariadb::misc::eventlogging::database that should be included in db1107's role [11:17:27] mmmm [11:18:48] ah wait that thing is the hearthbeat, not the mysql config right? [11:19:06] ahhh okok, now it makes sense [11:19:32] the class { 'mariadb::heartbeat': [11:19:33] yeah yeah [11:19:36] okok fixing [11:19:39] it is not mysql [11:20:10] don't modify mariadb::heartbeat class unless you want to bring down production [11:20:16] it should have its parameter [11:20:56] in fact, it defaults to $socket = '/run/mysqld/mysqld.sock', [11:21:04] already, so something changed it in between [11:21:23] probably beacuse 1047/46 were so old [11:21:26] I wouldn't dare modifying mariadb::hearthbeat :D [11:21:40] https://gerrit.wikimedia.org/r/#/c/391537/1/modules/profile/manifests/mariadb/misc/eventlogging/database.pp [11:21:41] I have almost the proxy fixed [11:22:19] correct [11:22:34] or [11:22:43] we can almost delete all those ifs [11:22:54] well, as soon as we remove db1047 and 46 [11:22:57] yeah [11:23:05] I'll do some clean up after them [11:23:53] marostegui: merged! [11:23:55] most of those parameter will not be needed once on stretch [11:24:23] elukey: looks good now, so everything looks fine [11:24:34] marostegui: what about grants? [11:24:52] elukey: we copied everything from db1046 [11:24:55] grants should be the same [11:25:06] super, going to ask my team to double check [11:25:12] cool [11:26:55] I have almost fixed haproxy [11:26:59] with the old parameters [11:27:39] not sure it will validate puppet style :-/ [11:29:49] it works [11:30:58] if you guys are ok I'd re-enable notifications for db1107 and disable them for db1046 https://gerrit.wikimedia.org/r/#/c/391540/1/hieradata/regex.yaml [11:31:09] (I'd like to keep mariadb stopped on db1046) [11:31:18] elukey: I started it, shall I stop it? [11:31:30] it is up [11:31:47] marostegui: It might be good so we'll know straight away if something still points to it, but I'll let you guys decide [11:31:49] sure [11:32:00] if you stop it, we can test dbroxy1009 works as intended :-) [11:32:03] fine by me [11:32:05] haha [11:32:07] let me stop it [11:32:25] log it as it will warn on the channel [11:33:31] elukey: stopped, feel free to disable notifications there and enable them on db1107 [11:33:45] super! [11:38:28] see comment on -operations [11:57:07] marostegui: all seems good on db1107 [11:57:16] great!! [11:57:38] my team seems not be able to ssh to that db (and it was probably the same for db1046), so we might discuss about granting analytics access to it? [11:57:44] I sent to gerrit commits [11:57:51] ssh to databases are only for roots [11:58:11] all right :) [11:58:19] I mean, not forever [11:58:33] until issues to be commented privately are solved [11:58:40] ack! [11:59:00] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3762507 (10Marostegui) [11:59:09] why would somene that is non-root would want to access a database anyway? [11:59:27] to check data for example [11:59:41] check data? [11:59:55] elaborate [12:00:09] access to local filesystem, I mean [12:00:23] which is different from access to the database service [12:00:34] ahhh no sorry, I meant mysql [12:00:44] elukey: if your team wants root on the db fs, let's get them into the alerts too :-) [12:00:51] well, duh, of course that should be granted [12:00:56] it should be already [12:01:02] even with more or less permissions [12:01:20] the database host is what I mean is usually root restricted [12:01:40] the master to be fair, I would say it is for protection [12:01:57] I agree 100% with that, I didn't explain myself correctly [12:02:04] but that can be negotiated [12:02:21] e.g. people can have admin access to the database [12:02:30] but then we do not give support to it [12:02:42] fair enough [12:02:52] so that is open for discussion no problem [12:03:07] what it is normally no is filesystem access [12:03:54] all right [12:04:03] not the master limitations are not for analytics [12:04:19] no one should access to a master,ever, we normally do not connect to it [12:04:32] just puppet or automatic orchestration [12:05:05] * elukey nods [12:05:17] check the 2 gerrit commits I proposed [12:05:47] 10DBA, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10MediaWiki-Logging, 10Wikimedia-log-errors: Error: 2013 Lost connection to MySQL server during query on IndexPager::buildQueryInfo (LogPager) - https://phabricator.wikimedia.org/T121306#3762562 (10Marostegui) 05Open>03Resolved >>! In T121306#1... [12:07:03] ^I would have merged it [12:07:28] I think after apiquery contributions, that is the second type of query to research, but it rarely happens [12:08:44] merge it with what? [12:09:06] https://phabricator.wikimedia.org/T121306#1875248 [12:09:11] not a big deal [12:09:17] don't waste time [12:09:21] haha [12:09:21] ok [12:10:23] so ok with the plan proposed on the commits or do you think it is too risky? [12:10:59] 1009 is already CRITICAL check_failover servers up 1 down 1 [12:11:11] it didn't complain because it was still downtime'd [12:11:26] sorry, which commits [12:11:27] ? [12:11:35] he he, too many things happening [12:11:49] https://gerrit.wikimedia.org/r/#/c/391543/ [12:11:57] https://gerrit.wikimedia.org/r/#/c/391545/ [12:12:19] ah [12:12:53] yes, as long as elukey is fine with db1107 as a master already (he already +1ed), so here comes my +1 [12:12:56] I can leave the old server not reimaged for a while [12:13:22] yeah, let's leave it not reimaged for a while [12:13:31] well, he +1 it, but didn't comment on when to deploy [12:14:06] if you are good I am ok to proceed [12:14:07] as long as eventlogging is not running, we can delay the decision [12:14:24] if he wants to check data or anything there is no rush [12:15:32] to be fair, data was copied in binary form, there is very litlle room for errors unless on the wire corruption [12:15:44] or something that doesn't work on mariadb 10.1, maybe [12:15:48] ? [12:16:08] db1108 has been working fine for 2 weeks already [12:16:26] yeah, but it only replicates, it is not a "master" [12:16:32] true [12:16:43] I would find compatibility issues from applications very weird [12:16:58] and operational issues we have been fihgting for months already with 10.1 [12:17:14] I mean, labsdb are on 10.1 and are our "hardest" to admin servers [12:17:17] let's go ahead I would say [12:17:35] we keep 1046 [12:17:44] is not like we are going to delete them now [12:18:27] I agree, 1046 could be kept as it is in stand by [12:18:33] I am going to merge both patches and reload haproxy (reloads are manual to avoid errors) [12:18:39] sounds good [12:19:02] I can even delay the reload on dbproxy1004 [12:21:34] mariadb,db1107,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP [12:21:43] mariadb,db1108,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP [12:22:07] mysql --skip-ssl -h dbproxy1009.eqiad.wmnet -e "SELECT @@hostname" [12:22:12] db1107 [12:22:24] going to deploy the dns change, ok? [12:22:32] go! [12:23:23] marostegui: now that I realize, I may have interrupt you doing all this with elukey [12:23:33] sorry, I was trying to help and got excited [12:23:56] jynus: hahahaha not at all! [12:24:04] ahahahah [12:25:47] we may want to revisit the haproxy config, we may be dragging old formats and so [12:25:59] dig m4-master.eqiad.wmnet -> dbproxy1009.eqiad.wmnet. [12:26:05] mysql --skip-ssl -h m4-master.eqiad.wmnet -e "SELECT @@hostname" [12:26:12] db1107 [12:26:23] are both 1107 and 1108 in RW mode? [12:26:53] yep [12:27:08] then "our part" should be done [12:27:15] meaning the db migration itself [12:27:24] yes, from our side we are good [12:27:48] with 10.1 we could reevaluate the custom replication [12:28:01] maybe the parallel replication would work again? [12:28:04] but not something to do no [12:28:10] *now [12:28:43] It is also nice to get rid of ubuntu servers, finally [12:29:13] I have not yet reloaded dbproxy1004 [12:29:21] so make sure dns is up to date on clients [12:29:28] yeah, can't wait to decom db1046 and 47 [12:29:32] db1047 is down, actually, so no problem [12:30:27] deleting the 4 downtimes of dbproxy1004 [12:30:30] *9 [12:33:17] I am leaving for lunch, ping me or call me if there is any issue [12:33:25] enjoy! [12:34:30] marostegui: do you want to get lunch and proceed or prefer to continue [12:34:33] ?? [12:34:33] BTW, remember I asked the eventlogging_script to read from config- hopefuly the change of socket will work now transparently [12:34:54] it does :) [12:34:58] elukey: i would prefer to continue, but…what do I have to do? :) [12:35:20] marostegui: first of all moral support to me :D [12:35:26] you always have that! [12:35:29] hahahaha [12:35:49] so we are basically ready to re-enable the eventlogging crew right? [12:35:55] m4-master updated etc.. [12:35:59] from our side, yep [12:36:33] all right, lemme try to re-enable eventlogging then [12:38:01] good [12:40:40] it seems working :) [12:41:04] nice! [12:41:31] going keep watching, thanks a lot for the work done! [12:41:53] * marostegui waves good bye to db1047 and db1046 [12:47:14] eventlogging sync works too [12:47:19] (on db1108) [12:47:20] \o/ [12:48:37] awesome! [13:18:42] <_joe_> I'm taking a lunch break too, will be back around 15:00Z [13:43:33] 10DBA, 10Data-Services: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser - https://phabricator.wikimedia.org/T179628#3762918 (10Marostegui) @Dispenser but on that missing_entries.sql there are tons of things that wouldn't work with the current and new replicas if I am not missing something. You are ba... [13:47:28] 10DBA, 10Data-Services: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser - https://phabricator.wikimedia.org/T179628#3762958 (10jcrespo) @Marostegui The idea is to create those tables (or some more generic) in advance in sanitarium- like with T59617. [13:51:01] I am going to shutdown labsdb1009 now that labsdb1010 has catched up [13:51:50] ok! [14:23:54] 10DBA, 10Cloud-Services: Pagelinks table contains a row having pl_from = 0 - https://phabricator.wikimedia.org/T98110#3763072 (10Marostegui) 05Open>03Resolved a:03Marostegui This is not present anywhere on dewiki core hosts ``` root@neodymium:/home/marostegui/git/software/dbtools# cat s5.hosts | grep -v... [14:45:05] 10DBA: Tendril DB graphs say 24hr but at 12hr - https://phabricator.wikimedia.org/T137654#3763106 (10jcrespo) Actually, the graphs should return 24 hours, not sure why they are returning only 12 :-/ ``` $hours = 24; ... sprintf('gsl.server_id = %d and gsl.stamp > now() - interval 24 hour', $server_id)) ``` [14:56:36] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1004 replication broken: in memory tables from s51290__dpl_p - https://phabricator.wikimedia.org/T180560#3763116 (10jcrespo) Ignore the full database, notify the breakage to the user (and that he will not have backups/redundancy, as done with the other tools... [15:15:42] I am going to do a YOLO with tendril: https://gerrit.wikimedia.org/r/#/c/391558/1 [15:15:58] heads up in case I break it [15:48:08] 10DBA, 10Cloud-Services: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out - https://phabricator.wikimedia.org/T180564#3763287 (10MusikAnimal) [15:50:48] 10DBA, 10Cloud-Services: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out - https://phabricator.wikimedia.org/T180564#3761807 (10jcrespo) Can you reconnect? Servers now go up and down all the time (T179244#3763271)- if a connection gets stuck is your (client's) responsib... [15:54:27] 10DBA, 10Cloud-Services: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out - https://phabricator.wikimedia.org/T180564#3763311 (10MusikAnimal) >>! In T180564#3763293, @jcrespo wrote: > Can you reconnect? Servers now go up and down all the time (T179244#3763271)- if a conn... [15:55:54] 10DBA, 10Cloud-Services: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out - https://phabricator.wikimedia.org/T180564#3763315 (10MusikAnimal) And again, it seems it's only the `wikidatawiki_p.wb_items_per_site` table that's affected. Very odd! [16:16:22] if manuel happens to be arround enwiki-20160801.monthly_scores.trimmed.tsv created by you maybe? [16:16:30] on labsdb1009 /srv [16:32:29] ok, I am recovering labsdb1009, it should take around 7 hours [16:35:11] jynus: mmmm i never create .tsv files [16:35:49] :-) [16:36:01] I deleted it [16:36:06] it was old [16:36:08] sure :) [17:19:45] marostegui: I was looking at T146591, and it should be "done" other than doing the index fix across the cluster for consistency. Since we already truncated that table (cf: T150306) it should be *super* quick to get done. Do we want a separate #DBA task, or does the current one suffice? [17:19:46] T150306: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306 [17:19:46] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [17:22:03] no_justification: if the patch is merged and all is ready for the PK to be added, just add the blocked-on-schema-change tag [17:22:13] I have to go offline now, sorry :-) [17:22:47] Okie dokie I'll update the task [17:23:59] 10Blocked-on-schema-change, 10MediaWiki-Database, 10Patch-For-Review: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591#3763567 (10demon) Per IRC, just need to run the following (probably just the latter, since we already truncated the tables) across all wikis: ``` lang=sql DELETE... [17:56:09]