[03:03:08] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, and 2 others: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4119149 (10mmodell) [05:18:47] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4119255 (10Marostegui) [05:20:01] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4119261 (10Marostegui) [05:20:07] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4027643 (10Marostegui) 05Open>03stalled So this task is now stalled. Only the primary masters in eqiad... [05:24:41] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4119265 (10Marostegui) a:03Marostegui I will start deploying this on s8 tomorrow probably as I need to depool those servers for som... [05:34:29] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018): Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4119272 (10Marostegui) p:05Triage>03Normal a:03Marostegui As I need to depool s8 servers, I will start deploying... [05:51:06] Amir1: I am going to merge: https://gerrit.wikimedia.org/r/#/c/425098/ [05:54:16] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4119309 (10Marostegui) [05:54:18] 10DBA: Decommission db1051-db1060 (DBA tracking) - https://phabricator.wikimedia.org/T186320#4119310 (10Marostegui) [05:54:20] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#4119311 (10Marostegui) [05:54:22] 10DBA: Rebuild user_newtalk on db1052 - https://phabricator.wikimedia.org/T186503#4119307 (10Marostegui) 05Open>03declined This host will be decommissioned (or moved to misc) so no need to rebuild this table really. [06:31:01] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4119330 (10Marostegui) [06:31:09] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4119327 (10Marostegui) 05stalled>03Open I didn't notice that s8 was still not done [06:31:30] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4119332 (10Marostegui) [06:44:30] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4083536 (10Marostegui) Just for the record, this table has less than 5 rows in all the wikis. [07:22:10] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4119392 (10jcrespo) Yes, we could do this on the masters even with a table reconstruction- but we should check if a table reconstruction is needed only for a definition change. [07:23:33] I will merge this later, once labsdb1010 has caught up: https://gerrit.wikimedia.org/r/#/c/425095/ [07:29:21] sure [07:29:50] note it takes 14400 minutes for connectiuons to fail over [07:30:05] yeah [07:30:17] there is also [07:30:29] https://gerrit.wikimedia.org/r/#/c/423494/ [07:31:22] but in reality, someone doing a query every 5 minutes will not be timed out [07:31:24] yeah, I would push that to be honest and see what we see [07:43:37] 10DBA, 10Operations, 10ops-eqiad: Rack and setup 8 new eqiad DBs - https://phabricator.wikimedia.org/T191792#4119436 (10jcrespo) [07:43:53] hahah I was doing the same! [07:46:56] we could depool a db106* or db107* to move it to x1 [07:47:06] rathen than the new one [07:47:32] yeah, that is what I meant, pool one of the new ones to replace one of the db106* or db107* and move that to x1 [07:47:39] so we don't have to work twice [07:48:28] is x1 stretch already? [07:48:52] probably yes [07:49:04] yes [07:57:28] if you are going to check later labsdbs- give a look at weights [07:57:54] not sure if 1:3 is the right one, it seems 1011 got more delay, so QPS may be missleading? [07:58:19] Yeah, could be [07:58:26] because the others recovered pretty quickly [07:58:52] just a high level check of how things are going load wise [08:24:32] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4119485 (10EddieGP) >>! In T190780#4119348, @Marostegui wrote: > Just for the record, this table has less than 5 rows in all the wikis. For each and every wiki, this table is expected to contain exactly... [09:01:52] <_joe_> dbas: it would be ok for me to run a second updateCollation script on s2? [09:02:06] <_joe_> or do you think it would be too much to have two running in parallel? [09:02:13] <_joe_> on different databases ofc [09:02:43] there is no maintenance on s2 from my side btw [09:05:11] what was too much was running several of them for multi-instance/multi-source wikis e.g. labsdb lagged [09:05:28] and so did dbstores (backups) [09:12:08] is there something ongoing on enwiki master right now? [09:12:18] I see some reports of writes failing there [09:12:55] there is an alter table finishing [09:13:04] table? [09:13:05] on the archive table [09:13:13] oh, that is actually the one failing [09:13:18] do you have an ETA? [09:13:29] probably 5 minutes [09:13:46] what is the error? [09:13:52] Read timeout is reached [09:14:25] I didn't see those on the others masters that were altered [09:14:52] the user report is objetivized by errors on the log [09:15:40] Yeah, I am seeing that ticket [09:15:57] But at this point it is better to let it finish (it is almost done) than to kill it and rollback [09:16:47] of course [09:17:08] keep me update when it finishes, I will check why it is happening [09:17:17] yep [09:17:52] <_joe_> jynus, marostegui it might be a good idea to comment here https://phabricator.wikimedia.org/T191875 [09:18:37] there is a metadata lock [09:18:39] ongoing [09:19:07] this is an ongoing outage [09:20:20] it is finished [09:20:27] all inserts were "Waiting for table metadata lock" [09:20:50] I guess it was the last part of the alter [09:20:53] when it renames the table [09:20:55] but it is all done now [09:21:13] errors should be gone now [09:21:42] _joe_: I was commenting yea [09:21:43] I killed one process [09:21:47] that was blocked [09:24:13] can you check when the alter started? [09:24:22] I will check the logs [09:24:45] there were several alters, I will try to check when the archive one started [09:24:53] errors since 8:40 [09:25:22] "only" 72, but these are high profile edits [09:26:05] yeah, archive started at 8:20 [09:26:30] I guess the other masters didn't have sooooo much activity on archive compared to enwiki [09:26:34] not even commons [09:28:19] what command did you use for alter? osc, something else? [09:28:59] I am pasting the command [09:39:52] I would have expected it to create issues at start or at the end with metadalocking but not during the alter itself [09:40:30] that was mostly what I wrote [09:41:39] Yeah just saw that [09:41:49] Maybe Mariadb version? [09:42:04] there are some strange patterns on db1052 [09:43:26] Keep in mind it also run some other alter tables [09:43:39] So it could be messing with the graphs [09:44:01] The version is 10.0.28 so not that old [09:54:05] I can only think of IO causing this really [09:54:31] Like big deletes failing because if IO [09:54:45] *of [09:54:56] I see a rename ongoing at the same time [09:55:20] maybe some bad combination of things - rename + alter + deletes created a race condition [09:55:46] Mmmmm [09:55:54] a more complex alter system could detect medata locks or other issues and abort early [09:55:55] Maybe yeah [09:56:13] Yeah but we didn't have that metadalocking [09:56:20] At the end only, I mean [09:56:56] are you sure the alter was fully online? [09:57:06] on that version, with that table? [09:57:21] Yeah that alter is fully online [09:57:27] It has not created lag anywhere [09:57:32] oh, I know it should be [09:57:35] On any server [09:57:44] but it is mariadb we talk about :-) [09:57:49] Where do you see the rename? [09:57:59] I am adding it to the ticket, one sec [09:58:05] Yeah, I know... That's why I left Enwiki master till the end hehe [09:59:07] I even altered masters with older versions [10:07:37] interesting- alter table was ongoing until then (other alter?) and then the archive alter starts in state "tmp table", when the waits happen [10:08:11] also lots of aborted clients during that [10:08:39] https://grafana.wikimedia.org/dashboard/db/mysql?from=1523349381351&to=1523352294139&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1052&var-port=9104 [10:24:52] marostegui: Thanks. I just got to the office [10:31:48] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119788 (10jcrespo) [10:45:48] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119820 (10Marostegui) These are the versions of the previous altered masters s1: 10.0.28 (the one that caused this) s2: 10.0.29 s3: 10.0.23 s4: 10.0.... [10:47:02] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4119823 (10Marostegui) Altering enwiki master caused issues. We are investigating why, as... [10:47:25] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4119826 (10Marostegui) [10:48:02] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4119829 (10Marostegui) [10:48:19] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119830 (10jcrespo) CC @Anomie this is not directly related- maintenance was the direct cause, but I believe the new comment model may be creating wors... [10:55:36] marostegui: can you remember when you did, if you did already, the commonswiki master equivalent? [10:57:45] nah, forget it, not related [10:59:23] I checked that already and nothing was found: https://logstash.wikimedia.org/goto/529482c085a2064ab252e4765458cfad [12:48:45] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120050 (10Anomie) >>! In T191875#4119830, @jcrespo wrote: > We can create a specific task for that. Please do. > Could the SELECT ... FOR UPDATE be... [13:26:14] 10DBA, 10MediaWiki-Page-deletion, 10Operations: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4120123 (10jcrespo) p:05Triage>03Normal [13:28:32] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120153 (10jcrespo) I agree with everything you said, my comment was a quick sketch of what I wanted, and what you proposed was what I really wanted, c... [13:31:01] 10DBA, 10MediaWiki-Page-deletion, 10Operations: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4120159 (10jcrespo) I believe this have been happening for some time now, but this incident only made it more real (happening not only for large deletes, but for small on... [14:18:19] 10DBA, 10Dumps-Generation: Some dump hosts are accessing main traffic servers - https://phabricator.wikimedia.org/T143870#2581645 (10hoo) This might partly or fully overlap with {T138208}. [14:20:02] <_joe_> status update: the updateCollation script is just running on s1 and s2, it's done everywhere else [14:23:04] Last comments in https://gerrit.wikimedia.org/r/#/c/419798/ is for the DBAs :) [14:45:20] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120406 (10RobH) [edit interfaces interface-range vlan-private1-a-codfw] member xe-2/0/0 { ... } + member ge-3/0/27; [edit interfaces ge-3/0/27] + description db2040;... [14:56:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4120431 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [14:57:01] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4120437 (10Marostegui) Let's hope this time it finishes correctly! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuildi... [15:23:20] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120530 (10Marostegui) We have started an Incident Report for this: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Deleting_a_page... [15:30:57] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120557 (10Papaul) Move db2040 from C6 to A3 in racktables Please advice what is the next server [15:31:16] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120559 (10Papaul) [15:38:12] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120575 (10jcrespo) 05Open>03Resolved a:03Marostegui I am going to close this ticket as the initial report, "Deletion not working", was resolved... [16:08:08] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120667 (10Papaul) switch port information when ready to move db2045. db2045 was on asw-c6-codfw ge-6/0/14 and now will be on asw-b3-codfw ge-3/0/ 20 new ip address will be :... [17:16:03] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120861 (10Marostegui) [17:26:48] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120912 (10Papaul) moved db2045 from C6 to B3 in racktables Please update task with next server we need to move next week. thanks [17:49:51] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4120980 (10Marostegui) 05Open>03Resolved This is all good now! Thanks ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldri... [18:33:31] 10DBA, 10Operations, 10ops-eqiad: Rack and setup 8 new eqiad DBs - https://phabricator.wikimedia.org/T191792#4121164 (10Marostegui) [20:07:14] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4121476 (10Peachey88)