[01:53:41] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) p:05Triage>03Lowest [01:58:20] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) [02:00:45] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) [02:02:15] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) fyi: This needed to done after 1 month standard waiting for usurpation, just created the task in advance. [04:09:16] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10dbarratt) @Marostegui was the falloever to eqiad completed? [04:58:46] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) Yes, it was done correctly. We will see if we can finish this schema change by the end of next week [04:59:49] 10DBA, 10MediaWiki-Special-pages, 10Datacenter-Switchover-2018: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages - https://phabricator.wikimedia.org/T206592 (10Marostegui) So we are back in equiad. Can you try the script and... [05:07:21] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) 05Open>03Resolved So after replacing the disk 3 times yesterday evening...we finally got this fixed! Thanks a lot Chris! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Na... [05:21:47] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:21:49] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) [05:23:03] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) 05Open>03Resolved a:03Marostegui All the tasks we scheduled to do whilst eqiad was passive, were done!. We also included T184805 on a last minute task,... [05:28:09] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) I have run an `analyze table` on both db1109 and db2083 and things are similar now: ```... [05:31:09] I have deployed: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458790/ [05:42:56] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [05:43:12] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) p:05Triage>03Normal [05:45:57] 10DBA, 10MediaWiki-API, 10MediaWiki-Database: prop=revisions API timing out for a specific user and pages they edited - https://phabricator.wikimedia.org/T197486 (10Marostegui) As discussed via email, this bug was fixed on 10.1.37: https://jira.mariadb.org/browse/MDEV-17155 which could end up "fixing" this i... [05:48:13] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [05:48:43] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Edited the task description to remove some graphs that could be misleading (they were server reboots due to upgrades) [06:10:52] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714 (10Marostegui) 05stalled>03Open a:05Marostegui>03None This can now proceed whenver you guys want. [06:11:26] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 (10Marostegui) 05stalled>03Open a:05Marostegui>03None This can now proceed whenver you guys want. [06:35:39] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10MW-1.32-notes (WMF-deploy-2018-05-22 (1.32.0-wmf.5)), and 2 others: Clean up indexes of wb_terms table - https://phabricator.wikimedia.org/T194273 (10Marostegui) Is there any point on having this task if at some point wb_terms is going to be k... [07:02:51] morning marostegui ! [07:02:57] o/ [07:03:02] re the wikidata dispatching ticket did re repool db1109 yet? [07:03:08] was db1109 a slave? [07:03:33] yes [07:03:38] it is a slave [07:03:47] I depooled it to run the analyze table only [07:03:56] why? [07:04:54] aaaah so the Cardinality was essentialy being affected while it was in use ? but deplooled it returns to the same as codfw? [07:05:09] addshore: No, the analyze table did it [07:05:15] oh [07:05:18] As it recalculates all the table stats [07:05:31] The depooling was done because it would generate lag [07:05:38] gotcha [07:18:25] This is worrying: https://phabricator.wikimedia.org/T206740 [07:18:38] banyek ^ you might need to be careful tomorrow [07:20:49] that is the high load enwiki experienced [07:21:19] So it should stabilize soon? [07:22:33] thanks for the notice! [07:22:52] I think I have to find a feew infos in wikitech [07:23:05] marostegui: not necessarilly [07:31:20] marostegui: re the dispatching ticket, now that we know it returned to the same old slowness of eqiad im going to add some more tracking for the suspect parts and see what is taking all the time, then we will see if it is db related or redis related, or something else [07:33:04] do we want to keep this open? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/465634/ [07:37:08] we are quickly down from 20% to 14% [07:37:14] on pc1 [07:37:27] since :28 [07:48:41] yeah, they are eating 10GB every 30 minutes or so [07:50:09] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) p:05Normal>03Unbreak! This is increasing quite rapidly [07:50:36] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [07:51:03] all the PC replicas are up-to-date [07:55:04] I'll purge the binary logs on all the PC hosts (pc1004, 1005, 1006, 2004, 2005, 2006) with `PURGE BINARY LOGS BEFORE (NOW() - INTERVAL 1 HOUR);` [07:55:26] at 10am CET [07:55:33] (please mention that on the other channel, so we can group all the conversations on a single place) [07:56:31] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) Looking at a couple of minutes of extra timing data it looks like this is down to the selec... [07:56:36] marostegui: I have narrowed it down to 1 query, or so it would seem https://phabricator.wikimedia.org/T205865#4657657 [08:03:54] !log running /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 0 [08:03:55] jynus: Not expecting to hear !log here [08:05:44] addshore: we are dealing with a fire, sorry [08:06:33] marostegui: ack. I'll watch from afar and come back later :) [08:12:28] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (10MarcoAurelio) 05Open>03stalled Stalling given that this cannot be actioned until the monthly period has passed. Thanks. [08:16:34] https://www.irccloud.com/pastebin/W1GfdPEC/ [08:16:37] jesus [08:17:35] cool, now rotate them [08:18:06] maybe 10MB it too low, you can up that, you are in charge [08:18:14] but get us space! [08:20:19] !log setting up replication from pc2004 -> pc1004 [08:20:19] jynus: Not expecting to hear !log here [08:29:28] addshore: sure, but, the last update doesn't give much idea on what you want next, as the conditions are kinda the key part of the query :-) [08:29:57] :D I can make the conditions clearer :) [08:30:02] im a meeting right now though :( [08:37:49] marostegui: I added some more details to https://phabricator.wikimedia.org/T205865#4657657 [08:38:41] will check later, still putting fires out [08:38:45] ack! [08:50:48] wow, my laptop just arrived [08:57:03] 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, 10User-Addshore: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) I picked a slave for s8 in each DC. ``` addshore@deploy1001:~$ sql wikidatawiki --host db1104 MariaDB [wikidatawiki]> select * from... [08:57:25] marostegui jynus ^^ i hate to introduce another bit of fun to your day [08:57:35] marostegui jynus ^^ i hate to introduce another bit of fun to your day/ [08:57:48] But it appears that we have some data missing for wikidatawiki in eqiad? [08:58:38] where is that stored? [08:58:48] in a "content"? [08:58:48] s8, wikidatawiki, page table [08:59:07] no, page don't really have content [08:59:16] is there a full page missing? [08:59:23] jynus: see https://phabricator.wikimedia.org/T206743#4657812 [08:59:28] or a page-like object [08:59:30] the row in the page table is missing, [08:59:52] possibly rows in the revision table too *checks* [09:00:01] that row is missing on the whole eqiad dc indeed [09:00:10] then you have a replication problem, wikidata may be doing unsafe statements [09:00:24] or mw, one of the 2 [09:00:38] at a guess this is due to the mcr related refactorings [09:01:19] yup, codfw also has the revision rows for those entities, but not eqiad [09:01:24] but wikidata didn't enable that? [09:01:31] when was the page created? [09:01:55] 20180913091819 [09:02:06] that is after the dc failover [09:02:07] according to the creation revision row in codfw [09:02:13] ? [09:03:22] and it wasn't deleted? [09:03:40] because the only issues I kneow about were about the archive table [09:03:45] *checks* but no i don't think so [09:04:01] check all logs you can find and update the task if you can [09:04:06] I will do archeology [09:04:07] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) [09:04:41] missing full transactions is very weird [09:04:50] yeah [09:05:16] addshore: is just that one? [09:05:16] I mean, its not even 1 trnsation [09:05:18] I and I could see it during a switch [09:05:20] that page has multiple revisions [09:05:27] but not one day later? [09:05:38] 0 revisions [09:05:39] *20 [09:06:02] so that should have been multiple transactions [09:06:02] first timestamp was 20180913091819 last one was 20180913092845 [09:06:16] the rest are there? [09:06:20] or it is just for that page? [09:06:27] on multiple times [09:06:34] so it looks like 3 pages are missing all entries in the page tables and revision tables [09:06:45] and that made replication not break [09:06:54] even on row based statement [09:06:58] that makes no sense [09:07:04] * addshore is very confused [09:07:05] I think that was deleted [09:07:10] yeah, I was right now thingking about the row replication on labs [09:07:10] it the only explanation [09:07:14] that would have broken [09:07:17] unless we filter page [09:07:18] do we? [09:07:19] right, let me check for deletion logs on both eqiad and codfw [09:07:24] no no [09:07:28] I mean physically deleted [09:07:32] oh...... [09:07:34] :| [09:07:49] someone with access removing, possibly accidentaly the row on one dc only [09:07:57] which makes no sense also [09:08:11] because if it is present on one full dc [09:08:13] let me get a list of all rows and ids that appear to be missing [09:08:17] but not on other [09:08:24] addshore: please do [09:08:27] we have backups [09:08:28] addshore: I am checking SAL for 13th Sept [09:08:30] Just in case [09:08:32] so no data loss should happen [09:08:41] we can compare backups at the time [09:09:10] the thing is, unless you have super [09:09:23] you cannot delete it from one dc only [09:09:35] e.g. imagine a maintance script, for any reason, goes bad [09:09:49] it could delete all rows in all servers [09:09:53] but not on a single dc [09:10:25] do you see why that is very weird [09:10:37] So the only thing that happened around that time is a schema change deployed on eqiad master that day but has nothing to do with table: https://phabricator.wikimedia.org/T89737 [09:10:39] I can see happening on eqiad while it was passive [09:10:52] has nothing to do with page table or its content [09:10:57] and replication unconected [09:11:04] but a very specific row? [09:12:06] which tables were altered? [09:12:13] MCR ones maybe? [09:12:16] no [09:12:18] it is pretty old ticket [09:12:29] maybe a reboot [09:12:33] bot_passwords, change_tag, page_restrictions, tag_summary, user_newtalk, user_properties [09:12:44] that miracly remove something [09:12:48] or setting up gtid? [09:13:01] let me see when we set up gtid [09:13:24] we should have binarly logs still on the masters of the event [09:13:29] as it was replicated [09:13:45] we set up gtid AFTER that [09:13:55] at around 9:44 is the SAL entry [09:14:24] 2018-09-13 09:18:19 is a very narrow window [09:15:21] addshore: that is vert strange, but we should have the tools to get to this [09:15:49] and to recover any loss [09:16:09] although the fact that are very specific type of pages [09:16:16] L* right? [09:16:51] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) From the reported pages in the ticket **page table** ``` CODFW: MariaDB [wikidatawiki]> select page_id from page where ( page_title = 'L20540' O... [09:16:59] okay, all the ids i found are at https://phabricator.wikimedia.org/T206743#4657831 [09:17:02] please tell users or other devels not to recreate or recover the rows [09:17:15] as that will make things more complicated [09:17:27] I am doing a compare.py between codfw vslow and eqiad vslow [09:17:45] narrow time range of 20180913091819 and 20180913095832 so far [09:17:49] marostegui: thanks [09:18:11] I'm going to look at the other things that happened around the time and see if it stretches to other pages [09:18:13] addshore: thanks [09:18:20] I can get you all activity [09:18:23] thourgh the binlog [09:18:27] infact, there is a revision for 20180913095832 for one of those pages in eqiad! [09:18:28] don't worry [09:18:32] jynus: lvoely [09:18:33] Not very accurated, I would need to stop replication [09:18:34] *lovely [09:18:49] marostegui: it does the checks on a single transaction [09:18:58] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) p:05High>03Unbreak! [09:18:59] so it should be mostly fine [09:19:10] then we are full of differences [09:19:16] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The binary logs were purged on pc1004. pc1005 pc1006. Also the binlog_max_size were set to 10M and the hosts now have this running in a screen: ``` while true; do echo "p... [09:19:32] I am going to investigate some of the random ones detected by compare [09:20:39] we need to save the binlogs [09:20:49] as they are deleted after 30 days [09:23:27] interesting, the binlog starts at exactly that time [09:23:42] over 12,000 revision rows missing it looks like? [09:23:43] ah ,no [09:23:49] what? [09:23:50] 12k?? [09:23:53] SELECT rev_id, page_title from revision, page where rev_id > 745455836 and rev_id < 745468727 and page_id = rev_page; [09:23:57] try that in both DCs [09:24:15] also I spot this [09:24:15] | 745468464 | WikiProject_every_politician/France/data/Assembly/15th_(bio) | [09:24:25] so it is not wikidata / wikibase specific it would seem? [09:24:43] what wikis are you checking? [09:24:47] just wikidatawiki [09:25:12] how is this even possible? [09:26:37] as long as my query is right? [09:26:37] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:04] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:10] it could be a bug [09:27:16] on the query [09:27:19] So I am trying just a simple query that reported differences [09:27:20] which is [09:27:26] select page_touched from page where page_id=2408366 [09:27:27] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:28] In eqiad we have [09:27:40] page_touched: 20180805140619 and in codfw we have: page_touched: 20180913092709 [09:27:47] :/ [09:28:09] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) It looks like it is more than just a few rows: CODFW: ``` MariaDB [wikidatawiki]> SELECT count(*) from revision, page where rev_id > 745455836 a... [09:28:43] I'm going to see if I can find anything in the same time range on other dbs to see if it is just the wikidatawiki db or not [09:29:20] could this be unsafe statements? [09:29:53] this makes no sense [09:30:02] 10DBA, 10Lexicographical data, 10Multi-Content-Revisions, 10Wikidata, and 4 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) Tagging #multi-content-revisions as a suspect [09:30:47] addshore: can you try to find something the other way around, that got written into eqiad yesterday after the failover and never arrived to codfw? that would narrow things down to unsafe statements I think [09:31:03] marostegui: I can do [09:31:06] / try [09:31:31] thanks [09:32:51] marostegui: should I be looking close to the time after the switchover? [09:33:24] yeah [09:33:33] after 15:00 UTC I would say [09:33:52] okay! [09:34:14] so this is bad [09:34:40] :/ [09:34:40] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10akosiaris) [09:34:51] at 2018-09-13T09:18 the only thing being replicated is heartbeats [09:34:58] which means replication wasn't working [09:35:00] ???? [09:35:08] ouch [09:35:21] just for s8? or? [09:35:23] i mean, it could have a delay [09:35:30] replication stopped [09:35:32] that is normal [09:35:43] we stopped replication many times [09:35:59] now the issue is whre it started replicating from [09:37:05] so it looks like it was a replication issue rather than some transaction issue? [09:37:20] I see starting to get events at 9:58 [09:37:38] but need to know for what real timestamp [09:37:55] from 20180913095825 [09:37:58] that fits with the revision from my query earlier that is in eqiad, 20180913095832 [09:38:00] so that is ok [09:38:08] just replication stopped for a while [09:38:12] which we did many times [09:39:46] we have no logs after 17th sept [09:39:47] damn [09:39:48] yep, I am seeing recentchanges from 20180913095825 being applied at 180913 9:58:26 [09:40:31] so just a delay [09:40:42] which could be an upgrade [09:40:42] let me get the first and last timestamp since then [09:40:47] yes, that is normal [09:41:25] but if a 2018-09-13 09:18:19 update is missing [09:42:01] wait [09:42:02] but it is missing or just delayed? [09:42:18] the server was rebooted the 17th [09:42:26] But I am sure it was rebooted maybe before too [09:42:31] can you check revisions earlier than 20180913095825 ? [09:42:46] on the live dbs [09:42:51] and after [09:42:55] addshore: ^ [09:42:58] I am expecting something like [09:43:06] have all revisions after [09:43:12] and missing before [09:43:53] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) Thanks @Marostegui but this sadly didn't help. Do you have any other ideas what could cause thes... [09:44:12] *&looks* [09:44:13] GTID enablement was done at 09:44 that same day , on all eqiad masters [09:44:26] rev_id=745468910 is the first one being replicated [09:44:32] which some may be missing [09:44:35] before [09:44:39] can you confirm that? [09:45:57] eqiad goes from 745452473 to 745468910 [09:45:58] 10DBA, 10Lexicographical data, 10Multi-Content-Revisions, 10Wikidata, and 4 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Marostegui) So for the record: we are live debugging with @addshore on IRC [09:46:26] addshore: can those account for all missing rows? [09:46:35] which is a timestamp of 20180913090816 to 20180913095825 [09:46:35] *checks* [09:46:37] or are there more [09:46:44] at the moment [09:47:06] jynus: yes, everything I have spotted with what was reported in the ticket [09:47:12] if it is only that, that is a localized replication failure [09:47:22] and a) it is recoverable and it is easy [09:47:29] but I still don't know the reasons [09:47:37] I still don't get how is that possible [09:47:43] for some reason, replication jumped ahead [09:47:50] on a master, which we treat like [09:47:55] in a golden plate [09:48:19] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) and removing MCR as it looks like this was some issue with replication [09:48:22] also, the good news is that anything that has a uniwue id [09:48:32] can be reinserted [09:48:43] updates are more complicated [09:48:46] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) ```wikiadmin@db1109(wikidatawiki)> SELECT * FROM information_schema.tables WHERE table_name = 'w... [09:49:21] let me translate that into binlogs [09:49:29] so, there was lag at that time (probably because of the schema change): https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1071&var-port=9104&from=1536821324794&to=1536832784000 [09:49:33] lag on the master [09:49:43] for more exact identification (transactin size) [09:50:08] marostegui: can you check logs at that time [09:50:14] operation ones [09:50:24] yeah, as I said, there was a schema change on going [09:50:28] but not touching any of those tables [09:50:40] yeah, I just want to know if someting was rebooted [09:50:46] or an alert went off [09:50:49] or soemthing [09:50:51] No, no reboots [09:50:55] just gtid enabled at 09:44 [09:51:00] ok [09:51:02] on eqiad masters [09:51:07] that could be it [09:51:14] but that is AFTER this issue [09:51:18] maybe gtid got autopositioned [09:51:24] wrongly [09:51:32] but we enabled it after this happened [09:51:33] or some weird thing [09:51:45] so at 09:13 gtid wasn't enabled [09:51:56] this is already weird, we need to think possiblities [09:52:13] that are improbable, like bugs [09:53:31] for now I am saving the binlogs [09:53:36] to be able to recover [09:53:49] let's check also the systemd logs [09:53:53] I did [09:53:57] it starts the 17th [09:53:59] and, no start? [09:54:00] when we rebooted it :( [09:54:01] ah [09:54:02] sorry [09:54:22] I have not been able to find anything older than the 17th [09:54:36] we have the binlogs :-) [09:54:59] yeah, but wanted to check mysql replication on the logs [09:55:18] so a petition- can you and addshore check for other pontential losses? [09:55:34] on wikidata gaps on revision are quite noticeable [09:55:50] can you try to find others, to see if it is a one time thing or a reapeated thing [09:55:57] that is my biggest worry right now [09:56:21] aka a query to find the largest gap on the revision timestamp or the revision id? [09:56:48] while I keep checking the binlogs? [09:57:25] hmmmmmmmm * thinks about how to write such a query* :P [09:57:28] jynus: I will jump into the meeting to give m*rk an update on this and on the purchases, feel free to skip it, up to you [09:57:36] ok, thanks [09:57:43] I can join, too, and hear [09:57:52] up yo sou [09:57:54] whatever you prefer [09:58:05] jynus: I also have a meeting in 3 mins, but will be back shortly after ~10 / 15 mins [09:59:36] so per the graphs, this server was only rebooted the 17th [10:01:25] banyek: are you joining the meeting? [10:01:32] oh, yes [10:20:58] back [10:24:21] addshore: can you check if this happened enwiki? [10:24:40] marostegui: can I check on dewiki (its a little bit smaller) ;) [10:24:48] or would enwiki be better? [10:25:04] addshore: sure, check dewiki [10:31:30] So [10:31:33] MariaDB [dewiki]> select * from revision where rev_id = 180868940; [10:31:37] returns a row in doth DCs [10:31:45] timestamp is 20180913093000 [10:32:00] which is within the range we are looking at of 20180913090816 to 20180913095825 [10:32:06] thanks [10:32:21] so just s8? [10:32:40] if you could check frwiki and jawiki? [10:32:41] (that s6) [10:32:46] just to have another check [10:33:09] can do [10:36:57] for fr wiki [10:37:02] thanks [10:37:09] select * from revision where rev_id = 152158259; returns in both DCs [10:37:23] timestamp of 20180913093002 [10:38:09] what query can i run to find out the hostname of the db im on? [10:38:19] select @@hotsname; [10:38:46] yup, good, just checking i was actually in the right dcs [10:38:52] i'll check jawiki now [10:39:25] <3 [10:41:07] jawiki also looks fine with select * from revision where rev_id = 69914221; and a timestamp of 20180913093006 [10:41:19] thanks [10:41:41] so just s8? :) [10:42:09] and it is not limited to revision and page tables right? it would be all tables as it was a replication issue? [10:42:27] yep [10:42:33] we are discussing things on the meeting [10:42:42] we will end up rebuilding all eqiad hosts from codfw [10:42:46] ack, ping me if you need me [10:42:53] thank you [10:43:00] shall i assign the ticket to one of you? [10:43:14] sure, assign it to me [10:43:20] I will update with what we have decided [10:44:12] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) a:05Addshore>03Marostegui [10:44:29] :) [11:01:54] I set expire_logs_days to 60 on db2045 - codfw master [11:02:43] thanks [11:04:30] ```set global expire_logs_days=60;``` [11:06:34] there are 147 binlog files on the codfw host, and they take ~150Gb of space [11:06:43] shall I simply copy them with scp/ [11:06:44] ? [11:07:38] banyek: copy them to dbstore1001 using transfer.py [11:07:50] I made a local copy only to /srv/tmp [11:07:56] of the oldest ones [11:08:14] I copy that directory then [11:09:14] it is the older ones, but they contain all the info we need [11:09:21] I mean we just need those (the oldest) ones to be copied, right? [11:09:27] he he yes [11:09:28] ah, I just wanted to ask this :) [11:09:28] ok [11:09:35] gr8 [11:13:05] db1071-bin.007238:795791989 2018-09-13 9:08:17 is the last replicated event [11:14:36] jynus: marostegui: I'm about to run `/home/jynus/wmfmariadbpy/wmfmariadbpy/transfer.py db2045.codfw.wmnet:/srv/tmp dbstore1001.eqiad.wmnet:/srv/s8.tmp` on neodymium [11:14:39] and db1071-bin.007238:796727644 2018-09-13 09:58:26 the restart of events [11:14:54] banyek: ok [11:15:50] s8.tmp -> tmp.s8 [11:17:16] happening now [11:18:21] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) What we know for now is that there was a gap in replication... [11:18:30] I have started the transfer from db2085 to db1099 and also updated the ticket ^ [11:18:37] thanks [11:23:32] I am going to go and grab food [11:23:37] I will not touch db2085 [11:23:41] I will do, too [11:23:49] yeah, db2085 and db1099 are down [11:23:53] but I may start filling in the others [11:24:01] be careful with the sanitarium master [11:24:11] no, only one host at a time [11:24:13] we'll need to see how we rebuild that one [11:24:17] and not touching sanitarium [11:24:19] but that is for later to discuss [11:24:20] cool [11:24:29] lunch time then! :) [11:24:34] we can copy it from codfw [11:24:37] bye [11:26:49] I copied the binlogs [11:31:30] so I have the sql with the missing transactions, it is only 225MB [11:41:30] sounds small [11:54:24] I know you are at middle of this mess, just letting you know I deployed the change to read from the ct_tag_id column instead of ct_tag (change_tag normalization). This changes queries of basically everything (RC/Watchlist/History/API modules/....) for mediawiki.org/testwikis/some small wikis for now. If anything pops up performance-wise, let me know: https://phabricator.wikimedia.org/T194164 [11:55:29] I will deploy it on some bigger wikis next week (probably section by section). If any section is better to start or should not have the change, also let me know [12:25:14] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) We know the exact timestamps of missing rows, from ``` db1071-b... [12:42:38] Amir1: or addshore do you know how advanced is MCR in wikidata? [12:42:49] *checks* [12:42:55] I know it's not on READ_NEW [12:43:13] it should be only commons and it'll be on October 15 [12:43:24] I was wondering if I could speed up by only recovering revision, page, recentchanges, user and text [12:43:44] and do comment, slot content later [12:43:49] i expect youll also need actor and comment tables [12:43:53] mmm [12:43:56] ok [12:44:03] but let me confirm [12:44:37] looks like you should also grab all the other related db tables, as the default on all wikis is currently write both [12:45:03] I am not worried about write [12:45:16] we will fully resover all the hosts [12:45:21] default for read is read old [12:45:25] I just want to make for now work all missing pages [12:51:03] sorry, I'm super late with the topic, just wondering if switching back to codfw was considered as an option [12:51:34] it is being considered [12:51:46] but the more I think, 1 hour of data 1 month ago that is recoverable [12:52:42] ack [13:01:12] there is 0 rows on actor [13:01:19] like everywhere [13:07:29] jynus: banyek any objection if I reclone db1116:3318 backup source for s8 eqiad, so we have a non partitioned copy in eqiad? [13:08:22] why not keeping that around just in case? [13:08:45] it may fit if you want to clone it from codfw? [13:08:55] That is what I want to do [13:09:04] i think if the backup source is the first you clone that's a really good idea [13:09:05] Sorry, I was not clear, let me rephrase [13:09:13] ok [13:09:35] I want to reclone db1116:3318 from codfw, so we have a non partiioned copy of the data (as the host that is being recloned now is a RC slavee) [13:09:59] and I say ok, but to keep the old instance [13:10:14] keep the old instance? [13:10:43] ther is 2.3T GB available [13:10:51] jynus: Ah cool, I will move it then [13:10:52] rename it and do not drop it? [13:10:59] yep, now I get what you mean :) [13:19:32] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 4 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) s8 host cloning process [] labsdb1011 [] labsdb1010 [] labs... [13:22:43] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 4 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [13:22:44] I have updated the task to put the relevant information links on top, so it is easier to track [13:23:15] I am ready to do the fill in [13:23:19] great [13:24:45] which host to test first? [13:25:18] db1092 maybe? [13:25:19] jynus: either db1092, db1104 or db1109 [13:25:34] jynus: or evne dbstore1002, but that one might have even more issues [13:25:43] ah, good one [13:39:44] addshore: can you check dbstore1002? [13:40:14] it is not perfect, but that should give the page backs until we do a proper fix [13:41:14] it takes around 5 minutes to fix one host [13:46:16] *reads up* [13:47:09] hmmmh, dbstore1002 isn't in the list of each servers to access for me :P [13:47:11] jynus: which tables did you fix? [13:47:19] it is on the etherpad [13:47:45] revision, page, user, recentchanges, text, comment, content, logging, revision_comment_Temp and slots [13:48:23] there will be still some issues with change tags wb changes and others [13:48:35] but I think those should be minor [13:50:21] some things like links* and page date refress will be outadated [13:50:39] yeah, change_tags and tag_summary [13:50:41] or lovely friends [13:51:45] so what do you say, I apply the changes on a production host, wait a bit and the apply them to all? [13:52:15] maybe except the labs master [13:52:20] that we can depool [13:52:55] yeah, db1099:3318 will fail, it is down [13:53:10] the sanitarium master we will need to rebuild with mydumper I think [13:53:21] or copy it from codfw [13:53:39] but we still have to propagate the changes to the labs hosts [13:53:55] so should I go on with the main hosts? [13:53:59] the others? [13:54:31] yeah, go for the main hosts and leave the sanitarium master aside for now I would say [13:54:56] I will do db1092, and wait a bit [13:54:59] feel free to do db1101, the other rc slaves [13:55:02] jynus: sounds good [13:55:03] why with mydumper [13:55:05] ? [13:55:26] banyek: yeah, because the sanitarium master will replicate data to sanitarium and to labs [13:55:33] we cannot do scp [14:00:28] we could apply the changes with replication on the master [14:00:40] (the labs master, not the real one) [14:00:56] addshore: check now on db1092 [14:01:01] *checks* [14:01:03] if it is ok, I will apply to all replicas [14:01:23] it is not fixed, ok, just hopefully made it enough to work [14:01:59] hey guys [14:02:01] how's it going? [14:02:04] [14:02:13] jynus: the revision table there loooks great [14:02:20] I have the logical fix [14:02:29] but we are double checking before apply it everywhere [14:02:39] hey mark, banyek and me were on the call [14:02:44] and the page table also looks good for the titles i was checking before [14:02:47] do you want to do the call or should we do it via irc? [14:03:04] addshore: ok, so I will apply to the other hosts [14:03:05] either way, what do you prefer? [14:03:32] whatever is easier for you [14:03:52] jynus: great! [14:04:05] irc is easiest I think but i don't mind a call :) [14:04:40] let's do it on irc then [14:05:33] so, the cloning process: the first rc slave is about to finish. Once it finishes and it catches up, i will move it under eqiad master and start the second rc slave, which hopefully will be ready before I go to bed [14:05:57] At the same time, I am cloning the backup source now from codfw as well, which will allow us to have a non partitioned copy ready to be distributed in eqiad [14:06:08] reminder: rc slaves are special as they have partitions [14:06:14] right [14:06:37] so to sum up: I expect to finish both recentchanges slaves later today and have a fresh non partitioned copy ready for tomorrow so we can clone a couple of hosts tomorrow [14:07:12] the fill-in will finis in around 20 minutes [14:07:13] [14:07:19] thanks manuel [14:07:38] we will see how that goes then- if it is good enough for short term [14:07:51] it takes around 5 minute to be applied per server [14:07:51] I can start tomorrow with cloning the non-rc slaves. I just need a few rule of thumbs like, which machines I don't supposed to do the 'basic' way (eg. clone from the source) [14:08:14] but if jynus can help me with that tomorrow if he's there, I can do them one by one [14:08:25] I will work tomorrow from around 7am to 12 [14:08:30] if my fix is good enough [14:08:36] we can take it easy [14:08:39] and do it next week [14:08:57] but how will we know your fix is good enough to last through the weekend? [14:09:10] actually there's a limit of parallelism we can do with cloning [14:09:12] as the data loss will be technically fixed, only we will have discrepancies on chached, non canonica data [14:09:23] mark: that is true for the current state [14:09:29] yes [14:09:34] but we might just be lucky it hasn't broken yet [14:09:42] if replication breaks is the answer :-) [14:09:50] that's what i'm worried about ;) [14:09:55] I mean we need machines to serve, so I guess there's a limit of 'we can't clone more than 2 hosts at a time' [14:10:14] banyek: yes, we cannot clone more than once at the time [14:10:19] the lexemes are back in read only https://www.wikidata.org/wiki/Lexeme:L20543 [14:10:26] https://www.wikidata.org/wiki/Lexeme:L20540 [14:10:32] ^addshore [14:10:36] how come they are readonly? [14:10:40] they are not [14:10:44] great! [14:10:45] but they don't exist on the master [14:10:50] haha [14:10:54] right [14:10:56] once I add the fix on the master [14:11:02] there is no going back [14:11:10] (for eqiad) [14:11:15] no going back to what? [14:11:25] jynus: I assume you are doing sql_log_bin=0 when inserting, no? [14:11:29] to what eqiad looked like before the fillin? [14:11:30] marostegui: yes [14:11:33] great [14:11:43] marostegui: yes [14:11:48] mark: yes [14:11:58] well, once it starts being editted [14:12:22] but on the other side, not adding them is also a rick of breaking codfw [14:12:28] so I think I am going to add it [14:12:29] yeah both ae [14:12:32] can we... [14:12:38] we have backups [14:12:40] have a snapshot of a host which has not yet been filled in? [14:12:42] so we can go back? [14:12:47] yes [14:12:49] ok [14:12:56] I literally asked to leave dbstore2001 [14:12:59] not that [14:13:03] db1116 [14:13:04] alone [14:13:09] to manuel and he did [14:13:15] set an MOTD [14:13:18] I am going to run in on the master [14:13:31] where, -ops? [14:13:33] yes, there is a copy at: /srv/sqldata.s8_BACKUP_T206743 [14:13:48] no on the server :P [14:13:54] ok, I can do that [14:13:56] making sure noone does it by accident [14:14:14] it is moved, cannot run [14:14:26] a new instance is being setup there [14:14:28] yeah, that is why I moved [14:14:30] while keeping the old archived [14:14:34] if that is clearer [14:14:42] so it is stopped and shut down [14:14:44] I will leave a MOTD anyways [14:14:47] sure [14:14:55] ok applying on the master and praying [14:15:46] MOTD in place [14:16:39] I have not applied it to db1087 [14:16:41] jynus: once you are done for the fill in day, can you comment on the task with the hosts done so we get to know which ones were done and which ones were not done? [14:16:42] we should depool it [14:16:47] yes [14:16:49] great [14:16:56] ok, master applied [14:16:59] banyek: can you depool db1087? [14:17:02] I am going to depool db1087 [14:17:04] or he can [14:17:13] yes [14:17:14] and then we will edit the page [14:17:24] banyek: give more weight to db1092 (give it 200) [14:17:33] it is doing fine without the BBU [14:17:39] ok [14:19:26] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 4 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) applied a fix to: ``` db1109 db1071 db1104 db1101:3318 db1092... [14:19:30] T206743#4658476 [14:19:30] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [14:20:17] is the depool ongoing, or did I missed it? [14:20:26] 1087 is the only host in vslow, and dump doesnt it a problem? [14:20:45] no worries [14:20:49] just put no host [14:22:47] so the fix is on all slaves and the master now? [14:23:42] on most slaves [14:23:45] if you +1 it I can deploy [14:23:51] we are depooling the one that isn't [14:24:02] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) p:05Unbreak!>03High This is no longer "unbreak now" ``` pc1004 Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/pc1004--vg-srv xfs 2.2T... [14:24:47] banyek: remove the red and we can deploy [14:24:50] jynus: lovely [14:24:52] I'll let you know if I spot anything odd happening, but I think everything should be okay in this state until the full fix is done [14:24:59] addshore: wait we are depooling [14:25:05] and then writing [14:25:11] to check it will not break immediately [14:25:18] aaah depooling then doing the master [14:25:19] gotcha [14:25:20] then we can reduce the UBN [14:25:28] yup [14:25:33] and deploy slowly the Good fix [14:25:41] This has been an action packed day for you all :P [14:25:44] which is reimage all servers [14:25:59] I am starting db1099:·3318, the first recloned rc! [14:26:08] addshore: all dba days are action packed! [14:26:17] do you know we are hiring! [14:26:23] inded, but not normally with this many UBNs :P [14:26:25] :-) just kidding [14:26:26] *indeed [14:26:35] unless you are intersted, in whcih I am not [14:26:36] hahaha, wikidata are hiring too, come join us ;) [14:26:43] nope, too much data [14:26:56] its fine, we have you to manage it for us ;) [14:27:19] I fixed it can I have a +1 ? :) [14:27:37] banyek: what about a +2? [14:27:57] even better [14:28:01] I deploy this [14:28:10] mark: so id the above works- labsdb or codfw replication may break [14:28:17] (replicatino) [14:28:23] but eqiad hosts shouldn't [14:28:31] and that I would be pretty sure [14:28:31] we need to discuss how we want to reclone the sanitarium master [14:28:38] no need to do it now [14:28:51] yes, I am saying that we may do all the followups at a latter time [14:28:52] jynus: what do you mean? [14:28:57] ^that [14:29:05] assuming this works [14:29:08] if what works? editing a page that used to be missing? [14:29:11] yep [14:29:25] why would codfw repl break? [14:29:35] because I have imported the basic data [14:29:42] but not the ancillary one [14:29:46] right [14:29:49] he cached like links here [14:29:56] that doesn't affect the reali integrity [14:30:01] but it make break repl [14:30:14] (as it should had happened 1 month ago) [14:30:51] but it cannto break replication if it is not 100% as before, but all the eqiad ones are identical [14:30:56] db1087 is depooled [14:31:01] thanks, banyek [14:31:04] ok [14:31:05] np [14:31:20] I use should and cannot and must in the rfc meaning [14:31:50] I indeed see it at https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [14:32:47] addshore: can you test edit https://www.wikidata.org/wiki/Lexeme:L20543 for us [14:32:52] we are now in switchover window [14:32:54] sure! :) [14:32:55] can we hold off for a bit? [14:33:00] ok [14:33:03] this is a bad time for stuff to break [14:33:05] addshore: wait [14:33:10] ack [14:34:05] so to be clear, wikidata cannot break- we are only testing the codfw replication, but of course I can wait [14:34:37] yeah but even noise in -operations is not great right now [14:34:38] even if unrelated [14:34:54] cool then, ping us when it is ok [14:35:13] meanwhile, should we delete some binlogs on pc2XXX to go thru the night without issues on codfw? [14:35:21] https://phabricator.wikimedia.org/T206740#4658480 [14:35:26] I think banyek can take care of parser caches [14:35:30] banyek: ^ [14:35:33] he is actively deleting those [14:35:40] only on eqiad as far as I know [14:35:53] oh, are you worried about codfw? [14:35:59] yeah, it is 85% [14:36:00] did replication catch up? [14:36:03] yep [14:36:07] check my comment [14:36:16] sorry, I missed that [14:36:17] yes, the cleaners are running on the parsercache hosts [14:36:29] banyek: but not on codfw, right? [14:36:33] but only in eqiad, yes [14:36:40] you can now add them on codfw [14:36:48] banyek: double check replication is up to date, and clean some on codfw I would suggest [14:37:01] db1099:3318 started now? [14:37:11] yep, it is catching up [14:37:11] or recently? [14:37:20] ok, that why I know not to touch it [14:37:24] nope [14:37:27] it has codfw data [14:37:48] I will reclone db1101:3318 in an hour or so [14:37:53] we should have quickly at least 2 servers with codfw data [14:38:11] jynus: today we will have both rc slaves and the backup source [14:38:29] for me quickly == next week [14:38:33] all the pc hosts in eqiad has Seconds_behind_master=0 [14:38:35] hehe [14:38:51] jynus thinks in 5 year scales [14:38:51] jynus: I will leave those 3 today, and tomorrow using the backup source I will try to clone 1 or 2 time allowing [14:38:53] so I clean up the binlogs from the pc200* hosts [14:38:54] no revisions or content are lost right now [14:39:02] so I am pretty satisfied [14:39:04] I set the binlog size to 10M [14:39:07] flush flogs [14:39:18] flush logs and clean up [14:39:22] mark: ok to update the status? [14:39:25] on the ticket [14:40:39] sure [14:42:25] marostegui: what do you think about applying the fixes to the labsdb master with binlong on? [14:42:58] it wont break, and it will sanitize [14:43:22] so we could repool db1087 [14:43:23] ? [14:43:44] yes, and also prevent a potential replication breakage [14:43:45] I was actually thinking that even with myloader, the triggers and filter should work just fine too [14:43:51] jynus: yeah, let's go for it [14:43:57] oh, I mean as a temporary issue [14:43:59] after all you are just doing inserts [14:44:01] I know I know [14:44:04] we can later do any reimport or whateber [14:44:07] I just went a step ahead [14:44:28] you are thinking more and more like me [14:44:30] :-D [14:44:38] will deploy on db1087, then [14:44:41] I have moved db1099:3318 from under db2045 to under eqiad master, so it is almost ready to be repooled back [14:44:44] jynus: +1 [14:44:50] jynus: uh oh :P [14:44:58] ? [14:45:01] not now [14:45:05] * mark makes a snapshot of manuel [14:45:08] XDDD [14:45:14] I purged the binlogs on the pc2000x hosts [14:45:15] I mean whenever the manteinance finishes [14:45:23] I have leave now sorry [14:45:33] ah, sorry, I tough you were telling me no to do it [14:45:40] I'll check the disk space on this hosts at the night [14:45:42] no was just joking [14:45:50] banyek: is the purge running on codfw? [14:46:02] no, that would be my queston [14:46:07] shall I set up it there too? [14:46:14] yes [14:46:15] (I think yes, but double check_ [14:46:17] OK [14:47:06] I'll do it in the evening/night whenever I can do (I'll keep my eye on the disk space) [14:47:10] but now I leave [14:47:16] and sorry about it [14:47:18] ok, please make sure it is done [14:47:22] later is fine [14:47:30] 👍 [14:47:56] So the recloning for db1099:3318 is good, we now have rows: https://phabricator.wikimedia.org/P7665 [14:47:57] jynus: marostegui banyek: feel free to resume operations. Thanks for the patience! [14:47:59] addshore: ^ [14:48:04] akosiaris: thanks [14:48:11] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 4 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) We believe the most important issues (missing revisions, pages... [14:48:11] shall I edit? :) [14:48:17] yes [14:48:20] addshore: wait [14:48:26] ? [14:48:27] deployin fix to db1087, then edit [14:48:32] ack [14:48:42] so only codfw replication will be at risk [14:48:43] ah the lost lexemes, ok sorry [14:48:53] akosiaris: not just lexemes ;) [14:48:54] akosiaris: it is more than lexemes [14:49:01] 1 hour of wikidata data was missing [14:49:07] or around 300MB [14:49:15] i thought it was 40 mins :P [14:49:26] 55 I thought! [14:49:42] 2018-09-13 09:08:17 [14:49:44] wikidata adds 300 MB/hour? [14:49:47] 2018-09-13 09:58:26 [14:49:48] we need to send WMDE a bill ;p [14:49:53] 50mins [14:49:55] do the math [14:49:58] mark: hah [14:50:09] I can give you the # transactions [14:50:15] if you want too :-) [14:50:22] it's on the ticket right [14:51:16] jynus: we need to disconnect codfw -> eqiad replication btw (not now, not tomorrow), but on monday [14:51:32] swift window starts in 10 [14:52:10] jynus: ping me when you want an edit [14:52:47] it is about to finish [14:53:38] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) We have had unexpected fires that might last long, so this might be delayed more than next week. We will do whatever we can, but... [14:54:23] addshore: ok now [14:54:29] {{doing}} [14:55:01] done [14:55:08] revid 762124226 [14:55:10] oh, I didn't fix the archive table [14:55:22] that may be an issue if undeletions happen [14:55:58] well it hasn't broken in 1 month, what are the changes of breaking after they have been mostly fixed? [14:56:07] dammit! [14:56:12] don't jinx it! [14:56:16] lol [14:56:39] btw, any ideas what on earth happened ? [14:56:44] no idea [14:56:51] we know the exact what [14:56:54] but not the why [14:57:33] what amazes me is the fact that so many transactions were gone and yet, replication didn't break [14:57:40] that I cannot really understand [14:57:48] and happening 30 days ago, most debugging options are gone [14:57:55] what would cause replication to break in that scenario normally? [14:58:10] addshore: the thing we fear now at the time? [14:58:17] just codfw breaks [14:58:26] which is inactive [14:58:30] addshore: an update on a missing row for example [14:58:36] marostegui: ack [14:58:43] addshore: replication didn't break [14:58:47] that is the problem [14:58:52] even with row based replication [14:59:01] I guess there are only a few causes where row updates would be attempted on rows that don't exist though [14:59:08] maybe no one uses wikidata really? :-) [14:59:24] addshore: deletes, I guess, which maybe are minimal [14:59:25] well, wikidata is used allot, but 99% of requests are just doing inserts [14:59:32] which is good [14:59:42] and maybe mediawiki entertily should move to that model [14:59:49] it makes backups and recoveries trivial [15:00:09] yeh, deletes would also be minimal, at least minimal for a 1 hour period, deletes happen, but the chances of that hiting the 1 hours worth of items would be unlikely [15:00:18] addshore: hopefully you saw my comment [15:00:34] there may be issues like people that changed preferences at that time [15:00:39] being reverted [15:00:47] or what links here being not perfect [15:00:48] yup [15:00:54] but those we can fix later [15:01:03] I'm going to move this ticket to the announce column on our board [15:01:12] I will remove the UBN [15:01:17] with your ok, addshore [15:01:25] yup, it sounds good [15:01:47] and we can resume next week :) [15:01:55] mark: should we announce something, or just the regular #user-notice is ok? [15:04:48] what do you mean? [15:04:49] announce what? [15:05:16] I was going to add a user-noticed [15:05:19] * addshore was just going to ask if something from ops / dba side would be announced for us to link to or something [15:05:27] that tag tells people to read the wikitech news [15:05:37] well, to add this to that [15:06:07] e.g. for people that noticed some pages where gone and now they are back [15:06:15] is that ok? [15:06:24] a phabricator tag? [15:06:27] yes [15:06:28] i'm not aware of that [15:06:34] yes, you can certainly add that [15:06:34] ok, I will use it [15:06:43] and then I can write an incident report [15:06:45] next week [15:06:48] indeed [15:06:58] do you think you still need me? [15:07:12] not now I think [15:07:15] I think worse case scenario, codfw breaks for the first time in 30 days [15:07:24] of course please stay available ;) [15:07:29] and that doesnt affect you [15:07:33] *users [15:07:38] not you :-) [15:07:46] I am going to finish cloning the other rc slave, and repool it once it finishes [15:07:52] I will leave that done before going to bed [15:08:03] Also will leave the fresh eqiad non partitioned copy on db1116:3318 [15:08:17] actually, it is safer now to leave all host equal? [15:08:21] addshore: I think we will mention this issue in a mail following the switchback completion (soon) [15:08:45] jynus: recentchanges will be the only ones with codfw content [15:08:50] the others will remain with the fix [15:08:52] "fix" [15:09:06] So, what do we want to do about tomorrow then? [15:09:21] so jaime will be available tomorrow [15:09:26] that means you don't need to be necessarily [15:10:02] available as in working or available as in: call me if needed? [15:10:09] he will be working [15:10:15] and he'll get that compensated soon [15:10:29] if you think that's useful you can work in the morning too, but not sure it's needed? [15:10:53] If he is going to be working, I will be working from 7 to 12 or so too [15:11:05] balazs should be there too of course [15:11:08] sorry, I undertood as "available" just in case? [15:11:16] no, working, right? [15:11:18] recloning servers? [15:11:33] sorry if I misunderstood [15:11:35] yes, that is what I said before, but I am saying now [15:11:45] we are no longer in a hurry [15:12:24] 10DBA, 10MediaWiki-Special-pages, 10Datacenter-Switchover-2018: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages - https://phabricator.wikimedia.org/T206592 (10Bawolff) Im away this next 1.5 weeks for a conference. I will fo... [15:12:30] fires are gone, we need a followup, but that will not be done in a day either [15:12:46] no but the earlier we get it done the less risk of course? [15:12:54] not at this point [15:13:10] to reclone the master we need to do a failover [15:18:41] alright [15:20:29] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) For #user-notice, if it seems reasonable, see T206743#4658525 f... [15:20:56] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) p:05Unbreak!>03High High because data came back. [15:20:56] marostegui: should we repool db1087 as a last action? [15:21:02] jynus: yeah [15:21:11] I will later repool db1101:3318 which I am cloning now [15:23:59] db1116 finished, maybe I can reclone db1104 today even (the candidate master) [15:24:05] jynus: ^ what do you think? [15:24:16] I think we really shouls stop [15:24:23] you will be able to do 1 [15:24:26] but that may break [15:24:33] for the same resons than codfw [15:24:36] yeah [15:24:38] we should do a lot at the same time [15:24:47] but not now in a rush [15:25:00] ok, I will leave db1099 and db1101 (rc) [15:25:03] we will keep the status quo until we can clone several at the same time [15:25:06] and on monday we can continue [15:25:15] help me with the revert [15:25:21] which hosts should be pooled now? [15:25:29] db1087 [15:25:31] db1101 or db1099? [15:25:37] no, among those 2 [15:25:41] db1099 [15:25:43] db1101 out [15:25:45] db1099 in [15:26:56] double check the review, please [15:27:24] https://gerrit.wikimedia.org/r/466653 [15:27:38] checking [15:28:11] leave db1092 with weight 200 [15:28:16] ok [15:28:25] the rest is ok [15:28:53] while I deploy, can you send an email to not touch anything [15:29:00] just keep an eye on the parsercaches? [15:29:08] to bany*k [15:29:10] yep [15:29:12] will CC you [15:29:15] thanks [15:29:45] mark: you want to be CC'ed on that? [15:29:49] or no need? [15:30:09] I *really* think it is safer to have all hosts with the same content [15:30:39] yeah [15:30:52] always CC in doubt [15:30:59] on monday let's do a bunch of the quickly, as we have now a copy in eqiad [15:31:01] < meeting [15:31:06] we will be able to do 2-3 quickly enough [15:31:10] mark: great, thanks [15:31:12] yep [15:31:15] marostegui: we can even copy [15:31:25] and later depool, change ,quickly [15:31:30] to avoid potential drifts [15:31:42] yep, if we have enough space [15:31:44] and then we have to think about the master [15:31:51] we have to failover it [15:32:00] as soon as all the hosts are "good" [15:32:01] maybe, we will see [15:32:13] not something that can be solved in a day [15:32:13] I think we must do [15:32:15] no [15:32:25] but as soon as we clone the other hosts, we probably should [15:32:27] buuuuut [15:32:37] we need to grab another host to replace that old host, and I thought about db1118 [15:32:45] And we need to make sure it goes to row B [15:32:51] so we kill another master from row D [15:32:57] as always, many things to do before [15:32:57] yep, could be [15:33:05] I need the time to think [15:33:36] it is ok to rush the fire fix but we shouln't rush on the long term fix [15:34:10] jynus: another case of: https://media1.tenor.com/images/34584055d25cd4fa2db8521257bcfc05/tenor.gif [15:34:42] well, all of the things that failed, except he unknown root cause [15:34:49] we have a fix for but no time [15:35:07] automtic checking, running out of space on parsercaches, more alerts [15:38:24] so db1116 is now replicating under eqiad master [15:39:06] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10jcrespo) Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts). [15:43:45] jynus: are we still purging on eqiad? [15:43:51] pc keys [15:44:01] (I am summing up things on the email) [15:44:03] yes, I pined alex to keep an eye [15:44:07] cool [15:44:10] but you can sumamrize on mail [15:44:20] still running on mwmaint1002 [15:44:25] great [15:44:32] 22% [15:44:40] probably deleting the whole old ones [15:44:58] may create some lag on codfw parsercache, but we don't care about taht [15:45:13] yep [15:47:50] the deploy worked well, no errors [15:48:07] I am going to go offline [15:48:12] see you on monday [15:49:12] thanks for all the great work jynus [15:49:23] I am going to go offline as soon as I send the email and then back online to finish up with db1101 [15:50:45] Thanks for all of your work :D [15:53:57] addshore: thanks for all your help and for discovering this [15:54:23] I guess you didn't get a chance to look at the dispatch ticket at all today? ;) [15:55:01] hehe nope :) [15:55:08] but it is fast now, no? [15:55:10] as we are in eqiad? [16:05:26] marostegui: no it was fast in codfw [16:05:28] slow in eqiad [16:05:42] and it seems to be down to that 1 query [18:15:41] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10akosiaris) >>! In T206740#4658660, @jcrespo wrote: > Forgetting the codfw -> eqiad replication was the most likely cause of overload on the applicatio... [18:17:11] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) >>! In T206740#4659202, @akosiaris wrote: >>>! In T206740#4658660, @jcrespo wrote: >> Forgetting the codfw -> eqiad replication was the mo... [23:17:50] 10DBA, 10JADE, 10Operations, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @Marostegui These are the proposed indexes, if you want to discuss something concrete: h...