[00:02:38] 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Legoktm) [00:59:17] PROBLEM - MariaDB sustained replica lag on db2132 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:03:49] RECOVERY - MariaDB sustained replica lag on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [03:06:41] 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) p:05Triage→03Medium [03:22:43] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [03:27:23] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [03:30:54] 10DBA, 10serviceops, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) The regression can be seen on the MediaWiki RED dashboard as well, as a 40 percentage point drop in PHP-FPM responses that respon... [05:56:30] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) Thank you! The RAID is now back in optimal state ` logicaldrive 1 (3.6 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 800 GB, OK) physicaldrive 1I:... [06:07:06] 10DBA, 10serviceops, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Marostegui) We pool and repool hosts pretty much all the time during core hours so it is sort of normal that anything can match any of tho... [06:24:41] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by root@cumin1001 for hosts: `db1084.eqiad.wmnet` - db1084.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - F... [06:25:28] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10Marostegui) This is ready for DC-Ops [06:25:31] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10Marostegui) [06:26:10] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:32:10] 10DBA, 10Patch-For-Review: Add *_direct_link to imagelinks and templatelinks - https://phabricator.wikimedia.org/T278236 (10Marostegui) What are you trying to achieve with this change? I am especially concerned with adding more things to the already massive `templatelinks` table, which is now one of the top3 t... [07:16:31] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10jcrespo) [07:18:59] 10DBA, 10serviceops, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Joe) 05Open→03Invalid We've had all supportive services serving from codfw for the well announced rebuild of the eqiad kubernetes clus... [07:38:35] 10DBA, 10serviceops, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Marostegui) Thanks @Joe - I have started to repool db1141 [07:39:20] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [07:47:03] kormat jynus sobanski tendril has been running now on a dbmonitor that is running buster. The change was made 1 hour ago and so far so good. If you see something strange, you can revert https://gerrit.wikimedia.org/r/c/operations/dns/+/674303 - I will be keeping an eye today, but I am off from tomorrow till the 5th [07:50:12] Thanks for the heads up [07:55:51] Enjoy your break marostegui, you lot deserve it [07:57:00] RhinosF1: :*** [07:57:45] I assume that's an emoji and my tired brain can't read [07:58:30] RhinosF1: It is an old-school emoji of course https://en.wikipedia.org/wiki/List_of_emoticons [07:59:27] Ah [08:50:31] marostegui: I think T276150 would free a bit of space probably not that much [08:50:31] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [08:50:53] Amir1: yeah, not counting much on that [09:26:30] 10DBA, 10decommission-hardware: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) Let's wait for another week before accepting that db1162 is ok, and then we can proceed [09:57:11] 10DBA, 10SRE, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Kormat) a:05Kormat→03Marostegui Assigning to @Marostegui (but he might not get to it until he's back f... [10:39:57] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db1181 is ready, but I am not pooling a new host for the first time before going on vacation. So I will leave it depooled and with notifications disabled. [10:49:56] https://i.imgflip.com/52xq6d.jpg [10:50:11] XDDDDD [10:50:27] labsdb is even worse, s3+all the rest + the views [10:51:14] https://i.imgflip.com/52xqce.jpg [10:51:23] hahah [10:51:25] hahahah that's definitely me [11:22:51] 10DBA, 10Patch-For-Review: Add *_direct_link to imagelinks and templatelinks - https://phabricator.wikimedia.org/T278236 (10Ladsgroup) Please don't do this. - *links tables are massive and messy and bound to break wikis (specially commons and enwiki), pagelinks table is already bigger than revision table in e... [14:39:15] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) [14:40:03] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > I'll try to get together a concrete list of services that currently do rely on Media... [16:55:59] hey, any ideas what's causing the database lock on T278350 or what to do to make things work again? [16:56:00] T278350: Bug restoring [[ https://en.wikipedia.org/wiki/Help_talk:Getting_started | Help_talk:Getting_started ]] on en.wiki - https://phabricator.wikimedia.org/T278350 [17:13:15] 10DBA, 10serviceops, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) @Joe Thanks, I did see the "kubernetes rebuild in eqiad" in the SAL but didn't connect the dots. Thanks. [17:16:16] Majavah, no idea- do you know if the page had lots of edits? [17:18:02] jynus: not really, about 20 total in a year and last edit before these moves was in January [17:18:23] well, it is not the number of recent edits, but the total one [17:19:02] it is possible that older edits are being worked on in the background, or that there is a thread restorign it in the background, all causing errors [17:19:43] moves and deletes all happen instently, but they can take some time to complete in the background(because ongoing jobs) [17:19:58] I would suggest to not touch it for a day and retry later [17:20:36] *instantly [17:21:01] ack, thanks, can you comment that on the task too or should I? [17:21:53] my comment is non-canonical- I am not a developer, just trying to guess a way [17:22:10] try adding #production-error tag for a more informed advice [17:22:39] (I am more sure about that last advice) :-) [17:23:44] it looked like a database-level lock, not a mediawiki-level one, guessing from "Error 1205: Lock wait timeout exceeded; try restarting transaction (10.64.16.101)" [17:25:03] no, that I know for sure it is not a database error, but an application-produced one- it is trying to write many things at the same time, so the database rejected (there is nothing DBAs can do about it) [17:26:50] Majavah, actually, I my guess was right [17:27:17] The db is doing lot of work at "REPLACE /* WatchedItemStore::duplicateEntry */ INTO `watchlist` " [17:27:29] meaning it is still updating the watchlist of users on Getting_started [17:27:41] and that blocks more moves [17:27:59] Number of page watchers 18,777 [17:28:07] ahh, thanks, that explains it :/ [17:28:12] yeah, I guess first edits, but in this case is watchers [17:28:14] :-( [17:28:20] *guessed [17:28:22] sorry about that [17:28:44] thanks for looking, at least the cause is now known [17:28:48] I am confident enough now to advice to wait, and retry later [17:29:07] Majavah, I heard also there was some job delays because maintenance [17:29:08] will you leave comment there or should I? [17:29:24] that could also impact why it is slower than usual [17:29:36] I can do with the data I got [17:29:42] thank you! [17:38:15] Majavah, would you have a contact with admin rights to retry the rename I can ping when I think the problems is solved ? [17:40:35] jynus: it's only semiprotected so any autoconfirmed user (10 edits and 4 days old account) can move that [17:40:50] sorry, I mean the restore [17:40:57] or was only a move pending? [17:41:29] afaik currently a move pending, since https://en.wikipedia.org/wiki/Help_talk:Getting_started does not exist [17:41:40] ah, ok, then no issue [17:41:42] thanks [17:42:18] if you happen to need someone with +sysop, you'll usually find one in #wikipedia-en [17:42:29] thanks, Majavah [18:57:57] 10DBA, 10Patch-For-Review: Add *_direct_link to imagelinks and templatelinks - https://phabricator.wikimedia.org/T278236 (10BrandonXLF) That mostly works, but there are still issues if it's done that way. Let's say template A redirects to template B. If page 1 transcludes template A and template B, but page 2... [21:56:18] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) >>! In T256538#6933316, @Marostegui wrote: > @Legoktm @Dzahn you might need to open firewall rules to be able to reach db1128 (m5 master) from lists1002. > ` > # telnet db1128.e... [22:00:09] any DBAs still around? I could use a sanity check on ^ which I've tried to implement as https://gerrit.wikimedia.org/r/c/operations/puppet/+/674724 [22:29:00] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) Can connect now: ` legoktm@lists1002:~$ telnet db1128.eqiad.wmnet 3306 Trying 10.64.0.98... Connected to db1128.eqiad.wmnet. Escape character is '^]'. ] 5....