[00:03:47] 10DBA, 06Operations: db1092 crash - https://phabricator.wikimedia.org/T151272#2812819 (10Reedy) [07:01:46] 10DBA, 06Operations: db1092 crash - https://phabricator.wikimedia.org/T151272#2812799 (10Marostegui) Thanks for Robh for taking care of this. I am going to have a look to see if we can find why it crashed. [07:02:24] 10DBA, 06Operations: db1092 crash - https://phabricator.wikimedia.org/T151272#2813334 (10Marostegui) a:03Marostegui [07:30:57] 10DBA, 06Operations: db1092 crash - https://phabricator.wikimedia.org/T151272#2813348 (10Marostegui) Error from yesterday ``` /system1/log1/record12 Targets Properties number=12 severity=Caution date=11/21/2016 time=23:52 description=Option ROM POST Error: 1719-Slot 1 Drive Array - A c... [08:00:39] 10DBA, 06Operations, 10ops-codfw: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2813371 (10Marostegui) 05Open>03Resolved All good now - thank you! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physic... [08:17:39] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2813401 (10Marostegui) >>! In T149553#2811528, @Papaul wrote: > The HP Tech didn't show up. :-| The server can remain off, no worries. I have downtimed it for 7 days just in case. [08:23:08] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2813424 (10Marostegui) db1095 got replication broken last night (reminder: it was imported from dbstore2001): ``` 2016-11-21 16:14:01 139805028624128 [E... [08:30:00] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2813433 (10Marostegui) db1081 is done ``` root@neodymium:~# mysql -hdb1081 -A commonswiki -e "show create table revision\G" *************************** 1. row *************************** Table: revision Cr... [08:49:36] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2813443 (10Marostegui) I have run the first iteration of the script. By the way, I have created a new script that bypasses the safety measures of `redact... [09:01:00] what is the deal with db1092? [09:01:10] jynus: It died with a controller error [09:01:16] I have upgrade its firmware [09:01:24] (that is what the error message suggested) [09:01:28] to upgrade it [09:01:40] ticket? [09:01:56] https://phabricator.wikimedia.org/T151272 [09:04:11] https://phabricator.wikimedia.org/T141756 :-/ [09:04:36] what? [09:05:06] Keep in mind that db1082 was the one that also died in a weird situation and we upgraded its firmware and nothing ever happened again (which can be a coincidence) [09:14:56] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2813486 (10Marostegui) db1084 is done ``` root@neodymium:~# mysql -hdb1084 -A commonswiki -e "show create table revision\G" *************************** 1. row *************************** Table: revision Cr... [09:19:51] jynus: How long does your alter in db1059 takes? Because I have finished with db1084 and it is ready to be pooled back (it is in s4), so maybe you can add it back once you are done? [09:19:55] (or I can) [09:20:41] it takes 30 minutes [09:20:52] but let me check if I have to do it on db1084 [09:21:00] sure :) [09:21:03] maybe that one didn't fail [09:22:04] db1084 failed, but due to activity, no key errors [09:22:15] so leave it depooled and I will do it next [09:22:33] ok, thank you! [09:22:33] please create the revert so I do not get confused [09:22:40] but I will merge it [09:22:47] ok, I will do it now [09:22:57] are you done with it? [09:23:03] yes [09:23:07] it is ready to be pooled back anytime [09:24:01] https://gerrit.wikimedia.org/r/#/c/322843/ -> you want me to rebase now? [09:24:22] "I have the plane" for db1084 [09:24:34] XD [09:24:35] leave it as is, I will take care of rebases [09:24:39] thank you [09:59:14] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2813561 (10Marostegui) I have checked the following list of table.column making sure that the column has no records or they have the ID set to what the l... [10:01:03] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2813566 (10jcrespo) ``` MariaDB [(none)]> SHOW GLOBAL VARIABLES like 'slave_run_triggers_for_rbr'; +----------------------------+-------+ | Variable_name... [10:01:54] jynus ^ thanks [10:02:34] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2813568 (10jcrespo) Also, please upload a redact_standard_output.sh which does not bypass the checks, only add db1095 as a good sanitarium host. [10:06:46] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2813571 (10Marostegui) >>! In T150960#2813566, @jcrespo wrote: > ``` > MariaDB [(none)]> SHOW GLOBAL VARIABLES like 'slave_run_triggers_for_rbr'; > +----... [10:09:25] 10DBA, 06Labs, 10Labs-Infrastructure: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2813574 (10jcrespo) > and now > Changed to only allow db1095. Please add me as reviewer of both changes. [10:40:09] jynus: if you depool db1091, can you please ping me so I can run an alter table there too? it is quick, 10 minutes or so [10:41:30] yes [10:41:43] well, I am about to do it now marostegui [10:41:44] thanks [10:41:53] ah cool [10:42:15] let me know when you've finished you alter, so I can run mine [10:42:55] I think they can run in paralel [10:43:01] yours is in revision [10:43:09] mine is on page, and blocking [10:43:16] sure [10:43:27] https://gerrit.wikimedia.org/r/322858 [10:43:59] \o/ [10:44:24] I think reedy deployed it without commenting the issue but in the changeset [10:44:28] thanks for commenting out db1092 [10:44:33] yes [10:44:36] looks so [10:44:45] which is normal for code [10:45:03] but for is way more clear to keep it on the source code too [10:45:06] *us [10:45:16] totally agree [10:45:48] the blame feature I am afried would not work well for single-character comments [10:46:22] but it is nice to see https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [10:46:33] and control what is bad [10:47:56] BTW, did we heard back from wikidata team why wikidata is reading 10x more rows than all other wikis combined? [10:48:20] I haven't seen anything, after you talked to that lady on the #operations channel [10:49:17] "lady"? I suppose you mean lydia -wikidata product manager [10:49:44] yes, mispelled it :) [10:50:22] I will mention it to performance, too [10:51:25] December is dangerously near. I think I will spend next week doing a rolling restart of external storage [10:55:16] rolling restart? [10:59:32] do you guys know anything about a create-dbusers.service that runs on labstore1004? [11:00:13] volans, chase told us recently he was working on that [11:00:51] because icinga is alarming that the service is failed, not sure if I should take care of it and restarting/checking why failed [11:01:07] well, I would ask him or yuvi first [11:01:31] probably it is in the middle of a rewrite [11:07:09] found the issue, opened a task, thanks :) [11:08:27] related? https://phabricator.wikimedia.org/T151296 [11:08:37] in any case, offtopic here [11:09:01] no, unrelated [11:13:51] marostegui, db1091 is depooled [11:15:25] jynus: great, will run my alter now [11:19:10] db1048 problems gone? https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1048&from=1479078536607&to=now&panelId=32&fullscreen [11:20:40] media errors still present. I am confused [11:25:05] :| [11:25:07] I don't get it [11:25:24] did they release soemthing? [11:30:15] I am not sure it is the disk anymore [11:30:43] it certainly could be an import [11:34:04] An import? Do we do imports on phabricator? [11:34:13] My alter table on db1091 is finished [11:34:56] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2813909 (10Marostegui) db1091 is done ``` root@neodymium:/home/marostegui/git/software# mysql -hdb1091 -A commonswiki -e "show create table revision\G" *************************** 1. row *************************... [11:56:43] lunch [11:57:01] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2811476 (10Marostegui) Which table(s) are we looking at here? [12:37:41] marostegui: FYI Icinga downtime for db2034 will expire in 7 days [12:41:25] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2814054 (10daniel) Here's a query that returns the past events that were affected. I do not know which ones of these actually had edited titles/descriptions. Probably most. ht... [12:41:58] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2814057 (10daniel) This seems like a pretty nasty upstream bug. If you can confirm the problem, should we file it somewhere? [13:02:31] volans: that is fine - I downtimed it for 7 days yesterday [13:03:02] volans: The HP guy didn't show up to change its main board, so as it is totally broken, I have left it down so it can be changed today hoepfully and papaul doesn't need me to power it off :) [13:03:26] ah ok, sorry tought was a more long-term down host :) [13:04:21] it has been down for a long time indeed now :( [13:32:46] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2814138 (10Marostegui) I am not really sure I am looking into the right table here (learning phabricator schema as we go!) but I can see that the following query: ``` select F... [14:02:20] 10DBA, 13Patch-For-Review: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029#2814184 (10jcrespo) 05Open>03Resolved I did the rest of the changes in a blocking way, and I got no error. For the master, I could not do it blocking without a failover or read_only, but of course I tried th... [14:02:22] 07Blocked-on-schema-change, 10DBA, 10Wikimedia-Site-requests, 06Wikisource, and 2 others: Schema change for page content language - https://phabricator.wikimedia.org/T69223#2814186 (10jcrespo) [14:02:37] 10DBA, 07Schema-change, 07Tracking: Schema changes for Wikimedia wikis (tracking) - https://phabricator.wikimedia.org/T51188#2814189 (10jcrespo) [14:02:44] 07Blocked-on-schema-change, 10DBA, 10Wikimedia-Site-requests, 06Wikisource, and 2 others: Schema change for page content language - https://phabricator.wikimedia.org/T69223#2294914 (10jcrespo) 05stalled>03Open [14:04:34] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814202 (10Marostegui) >>! In T150960#2810612, @Marostegui wrote: > Thanks - as per the replication filters, these tables woulld ne... [14:07:24] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814221 (10Marostegui) And as soon as I sent that: ``` 2016-11-22 14:04:55 139805000005376 [ERROR] Slave SQL: Error executing row... [14:11:04] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814227 (10jcrespo) > these tables woulld need to be dropped I have some comments of some tables that I know about: * some of tho... [14:17:09] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2814251 (10Marostegui) >>! In T147305#2814237, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AViMY4BElCyyDMEPuxY5} [2016-11-22T14:14:40Z]... [14:18:31] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814256 (10jcrespo) > And as soon as I sent that: We have to do some study: http://dev.mysql.com/doc/refman/5.6/en/replication-ru... [14:23:19] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814265 (10Marostegui) >>! In T150960#2814256, @jcrespo wrote: >> And as soon as I sent that: > > We have to do some study: > > h... [14:25:06] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814271 (10jcrespo) The problem is that, as do has preference over ignore, and ROW based and STATEMENT based behave differently, it... [14:36:54] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814298 (10Marostegui) Should we stick to STATEMENT as of today (and as that is what we have today) instead of exploring other terr... [14:39:12] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2814299 (10jcrespo) > That is why I was suggesting that maybe we can include a full shard (maybe S1 or S5?) Sure, as long as it is... [15:51:25] 10DBA, 06Operations, 10ops-eqiad: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2814621 (10Cmjohnson) @jcrespo I re-seated all the components to the raid controller and powered on, all disks are now showing as 1 LD and booted to the OS You may want to do s... [15:52:03] 10DBA, 06Operations, 10ops-eqiad: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2814625 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson @jcrespo please re-open if problem persists. [16:36:21] 10DBA, 06Operations: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2814733 (10Volans) I've done a bit of cleanup, re-enabling some of them that were ok and leftover of other maintenance. `maps-test*` is being worked by @Gehel for a proper fix. All the others at th... [16:39:51] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2814739 (10Cmjohnson) @jcrespo, these 2 servers have been moved to rack C5, connected to... [16:41:56] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2814740 (10jcrespo) a:03jcrespo @Cmjohnson Thank you a lot! I will take it from here [16:44:00] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2814745 (10daniel) >>! In T151228#2814138, @Marostegui wrote: > Despite of the time the event was created, it always has the same description for all the 18 rows it finds. > Ex... [17:50:19] 10DBA, 06Operations: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2815029 (10Dzahn) re-enabled notifications on some install1001/2001 services [17:53:55] researchers devastated to discover that the world ignores ER diagrams http://cacm.acm.org/blogs/blog-cacm/208958-database-decay-and-what-to-do-about-it/fulltext [17:56:30] aka "researchers discover in the real world things working is actually a priority" [17:56:50] yeah but database decay! [17:57:07] "look users, I know you cannot acces wikipedia, but look at this nice rational rose design" [17:58:27] heh [17:59:00] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2815088 (10hoo) [18:00:33] jynus: ^ A quick look at the queries there would be appreciated [18:00:43] we don't run them in webrequests, delayed jobs only [18:02:12] we do not use HAVING [18:02:17] unless you use group by [18:02:32] jynus: Feared something like that [18:02:40] could possibly also rewrite using sub-queries [18:02:45] I think you are falling into a trap [18:02:53] of executing twice the same query [18:03:00] thinking the second is more optimized [18:03:19] use SHOW GLOBAL STATUS like 'Hand%' [18:03:23] sorry [18:03:28] jynus: Well, the initial one is still slow [18:03:32] SHOW SESSION STATUS like 'Hand%' [18:03:35] even when run several times [18:03:47] to check the execution differences [18:03:53] do not rely on time [18:04:13] ah, I see the difference [18:04:21] the extra field selected [18:04:35] maybe there is a problem with the extended index [18:04:43] hm [18:04:44] use EXPLAIN and the above command [18:04:45] jynus: (purely anecodtal) at my last gig we had a 64 bit field that was an aggregate of normalized stored values that /never change/ in reality. Normalized and stored A/S/L etc is super burdensome when queried, especially in a sharded inf (not like we mean shards but by table across a cluster of boxes) and whenever someone new who was a middling dev saw this they thought they should rearchitect only all rearchite [18:04:45] cture led to worse solutions. [18:05:14] show me the difference, hoo [18:05:19] but do not make me work :-) [18:06:09] if HAVING has a different performance than WHERE that is a server bug, not a feature [18:06:26] ah [18:06:33] spot other difference [18:06:38] the '0' and the 0 [18:06:50] check those 2 changes but without the having [18:07:01] Already tried that [18:07:03] doesn't matter [18:07:33] in any case, please copy the EXPLAIN and the handler status to demonstrate it is faster [18:07:47] will amend the bug [18:07:57] it is ok, you can comment [18:08:45] most probably it is a lack of a proper index and we can fix that [18:09:06] OR run a union [18:12:14] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2815150 (10hoo) EXPLAINs: ``` mysql:wikiadmin@db1082 [wikidatawiki]> EXPLAIN SELECT /* Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds da... [18:12:24] Commented [18:12:27] I got to go now [18:12:32] thanks for your help [18:12:43] I will give you a comment [18:12:47] thanks [18:13:55] the query is funny, because the explain shows a complete opposite plan than the one shown [18:20:46] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2815190 (10jcrespo) I have some ideas, this is a very interesting edge case, but I will give you a better answer tomorrow- I have some pendin... [18:23:42] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2815197 (10jcrespo) This is the goal, but we will try to achieve this without the force index, depending on how much I can change the origina... [18:38:04] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2815230 (10Papaul) @Marostegui main board replacement complete. We can crash the server again. [19:54:05] 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2815666 (10jcrespo) [20:08:16] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2815737 (10mmodell) upstreamed: https://secure.phabricator.com/T11909 [20:08:51] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2815743 (10mmodell)