[07:14:56] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2635833 (10MoritzMuehlenhoff) The stack trace sounds like a broken disk controller (or possibly broken RAM). I'd say let Chris to a hardware check. [07:20:24] 10DBA, 06Operations: Hardware check - https://phabricator.wikimedia.org/T145607#2635842 (10Marostegui) [07:20:47] Thanks moritzm - as you can see I have created the subtask :) [07:25:00] I think you could also just reassign the bug to ops-eqiad [07:30:14] moritzm: Are those chris' team? [07:30:23] I don't really know :) [07:31:17] ops-eqiad and ops-codfw are used for the local data centre engineers [07:31:36] Chris is at the eqiad data centre and Papaul at codfw [07:32:03] so if there's an issue which requires physical hardware invention, just add the projects to the task [07:32:32] Ah right, thanks for the explanation :) [07:35:00] 10DBA, 06Operations, 10ops-eqiad: Hardware check - https://phabricator.wikimedia.org/T145607#2635861 (10Marostegui) a:05Cmjohnson>03None [07:35:22] And added ops-eqiad as a tag :) [07:35:52] 10DBA, 06Operations, 10ops-eqiad: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2635865 (10jcrespo) [07:38:38] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2635866 (10jcrespo) Thank you @MoritzMuehlenhoff for your incredibly quick evaluation, I didn't even check the full stacktrace, you were really helpful. I will unsubscribe you so you do not suffer spam from the rest of t... [07:40:10] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2635885 (10jcrespo) @Marostegui While we wait, probably we can check the lifecycle hardware logs. [07:40:28] ping me when you are around, marostegui [07:41:10] jynus: I am :) [08:50:21] I am checking db1081 hw log now [09:01:44] db1082? [09:03:23] yes, 82 [09:03:39] 81 is the one doing https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=2d156b90c546938661e802f474cd97a9&host=%5Edb1081&user=wikiuser&schema=wik&hours=1 [09:03:45] 82 is the crashed one [09:03:52] I mixed them [09:52:00] 10DBA, 06Operations: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2636302 (10Marostegui) s1, enwiki, db1073 ``` MariaDB db1073 enwiki > rename table povwatch_log to TO_DROP_povwatch_log; Query OK, 0 rows affected (0.10 sec) MariaDB db1073 e... [10:18:34] 10DBA, 10CatWatch, 10MediaWiki-General-or-Unknown, 06TCB-Team, 07Wikimedia-log-errors: SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXX', 10) AS lockstatus - https://phabricator.wikimedia.org/T133801#2636372 (10jcrespo) @hashar yesterday we had a crashed sl... [10:36:11] 10DBA, 10MediaWiki-API, 10MediaWiki-Page-deletion: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636396 (10MarcoAurelio) [10:39:09] AaronSchulz, I couldn't agree more with many of the patches you do, but I am much more pesimistic about our code an infrastrcture behaving well :-) [10:40:44] 10DBA, 10CatWatch, 10MediaWiki-General-or-Unknown, 06TCB-Team, and 2 others: SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXX', 10) AS lockstatus - https://phabricator.wikimedia.org/T133801#2636421 (10hashar) https://gerrit.wikimedia.org/r/310514 changes the... [10:41:51] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2636424 (10Marostegui) Unfortunately the ILO isn't showing anything relevant hardware-wise between the crash and when we power cyled the server This is the first record from yesterday which is basically when we connected... [10:46:29] ^you got the same conclusions I did, sorry I did not update the ticket [10:46:38] busy with many other things at the same time [10:48:22] no worries jynus [10:49:18] I will assign that ticket to myself so if there is anything people need, they can ping me directly so you are not disturbed (even more) [10:50:15] 10DBA, 10MediaWiki-API, 10MediaWiki-Page-deletion, 06Operations: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636434 (10MarcoAurelio) [10:53:05] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2636435 (10Marostegui) a:03Marostegui [10:56:31] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2636440 (10jcrespo) p:05Triage>03High [10:57:29] 10DBA, 06Operations, 10ops-eqiad: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2636444 (10jcrespo) p:05Triage>03High I am going to put this high, because this block putting the server back into production, and that means it will lag for longer, so this is time sensitive. [11:01:33] I can also move mysqld_safe to the mariadb modules, is that handled via some git sub module? [11:13:13] moritzm, it is on operations/mariadb , or just load the submodule modules/mariadb [11:13:57] we can add it as a mysqld_safe.pp, and then load it optionally, but latter enforce it on all [11:14:24] k, will add it there [11:16:25] or we can merge it as it is [11:16:29] just for testing [11:16:37] and then move it later [11:16:45] whatever is easier [11:18:31] nah, let's better get it done proper from the start, I'll update my patch in about 30 mins [11:21:45] 10DBA, 10MediaWiki-API, 10MediaWiki-Page-deletion, 06Operations, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636471 (10jcrespo) @Anomie If you can give a look at this (I am myself a bit lost) and... [11:58:53] 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636525 (10jcrespo) I tried it, the only thing I took from here is that MariaDB -at least the version we use- is dumb, and I am 99% convinced this issue... [12:00:59] 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636529 (10jcrespo) a:05jcrespo>03None This is a one line patch, I can do it if you have faith in me; I personally don't. I would like to focus on t... [13:25:23] 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636683 (10mobrovac) Since ApiQueryBacklinks is using a straight join, it might be safe enough to do it for templatelinks as well? [13:30:36] 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636689 (10jcrespo) >>! In T145079#2636683, @mobrovac wrote: > Since ApiQueryBacklinks is using a straight join, it might be safe enough to do it for te... [13:52:15] 10DBA, 10MediaWiki-Page-deletion, 06Operations, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636715 (10Anomie) Removing #MediaWiki-API, since this has nothing to do with the API itself. >>! In T145... [13:54:27] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2636717 (10Marostegui) I have renamed the hitcounter table to TO_DROP_hitcounter in all codfw servers: ``` dbstore2001.codfw.wmnet dbstore2002.codfw.wmnet db2036.codfw.wmnet db2043.codf... [14:05:43] 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636772 (10Anomie) a:03Anomie Thanks for looking into alternative solutions, @jcrespo. I'll write the patch for straight_join. [14:08:38] 10DBA, 10ChangeProp, 10MediaWiki-API, 10MediaWiki-Database, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636786 (10jcrespo) I am sorry, Anomie, for this and all other issues. We are already looking at next versions of the server that will fix this and othe... [14:09:28] 10DBA, 10MediaWiki-Page-deletion, 06Operations, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636787 (10MarcoAurelio) Is there any way via API sandbox (maxlag / maxage / smaxage) which can be used to... [14:21:07] 10DBA, 06Operations, 10ops-eqiad: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2636820 (10Marostegui) Kernel has been upgraded to 4.4.0-2 and a full-upgrade has been performed as well. [14:42:46] 10DBA, 10MediaWiki-Page-deletion, 06Operations, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636899 (10jcrespo) @MarcoAurelio Translated, this means that the change now it is safe to be done normaly... [15:11:23] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2633433 (10Cmjohnson) I do not see anything with the server that we could pinpoint to a h/w issue. [15:51:16] I am going to start mysql on db1082, to check it is working properly [15:52:10] and make sure it doesn't get behind too much. We will probably have to stop it if maintenance is done later [15:52:40] jynus: sounds good [15:53:11] oh, I thought you were away [15:53:26] you can do it yourself, if you prefer it [15:54:02] Sure! [15:54:42] I will do it now with —skip-slave-start [15:54:54] I am tailing the error log [15:55:18] me too :) [15:55:34] you will see 2 executions of mysqld_safe before for testing the config options [15:56:09] (it was safer to do it on down mysql) [15:57:04] complains about mysql.user mysql.event and heartbeat.heartbeat [15:57:08] yeah [15:58:05] Apart from that it looks clean [16:00:17] I am going to drop a couple of users from that database [16:02:11] you are probably now gone [16:02:15] I will start replication [16:02:50] yeah, sorry I am being dragged into this meeting. Thanks :) [16:03:47] Oh, they are having technical issues with the hangouts [16:03:52] So I am around for a bit longer :) [16:04:18] there is nothing to do really [16:04:22] about this [16:14:23] so some months ago, this slave should be reimaged [16:14:47] I trust now quite a lot the transactional replication control [16:15:23] what I do not trust is the node- we will see if we reboot it for maintenance again, depending on what chris can do [16:15:44] all i can do is run a memtest [16:15:59] cmjohnson1, how useful do you think that will be [16:16:08] historically not very useful [16:17:08] so I trust your advice, as I said, whatever it will be [16:18:22] this is the first of the new main dbs to crash [16:18:45] in what, around 6 months? [16:18:53] it is...and strange..i can run the test [16:19:01] just working on be1022 issue rightnow [16:19:14] so, I am not in a hurry [16:19:28] specially not once we started it again [16:19:46] let's talk again on thursday? [16:20:01] and we will put it out of production for now, ok? [16:30:57] jynus: is server down? [16:31:14] admin down? i am going to start the test shortly...just waiting for the iso to burn [16:31:47] cmjohnson1, no, it is up [16:31:53] I will stop it now [16:32:47] okay [16:35:12] it should be down now, cmjohnson1 [16:35:18] great..thx [16:35:25] going to run tests now [17:20:24] 10DBA, 10MediaWiki-Page-deletion, 06Operations, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2637627 (10aaron) There is a DeleteBatch maintenance script that could take a page via stdin or a list of... [17:56:49] 10DBA, 10MediaWiki-Page-deletion, 06Operations, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2637707 (10aaron) 05Open>03Resolved a:03aaron I deleted both now.