[01:53:41] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) p:05Triage>03Lowest [01:58:20] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) [02:00:45] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) [02:02:15] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (101997kB) fyi: This needed to done after 1 month standard waiting for usurpation, just created the task in advance. [04:09:16] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10dbarratt) @Marostegui was the falloever to eqiad completed? [04:58:46] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) Yes, it was done correctly. We will see if we can finish this schema change by the end of next week [04:59:49] 10DBA, 10MediaWiki-Special-pages, 10Datacenter-Switchover-2018: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages - https://phabricator.wikimedia.org/T206592 (10Marostegui) So we are back in equiad. Can you try the script and... [05:07:21] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) 05Open>03Resolved So after replacing the disk 3 times yesterday evening...we finally got this fixed! Thanks a lot Chris! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Na... [05:21:47] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:21:49] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) [05:23:03] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) 05Open>03Resolved a:03Marostegui All the tasks we scheduled to do whilst eqiad was passive, were done!. We also included T184805 on a last minute task,... [05:28:09] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) I have run an `analyze table` on both db1109 and db2083 and things are similar now: ```... [05:31:09] I have deployed: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458790/ [05:42:56] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [05:43:12] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) p:05Triage>03Normal [05:45:57] 10DBA, 10MediaWiki-API, 10MediaWiki-Database: prop=revisions API timing out for a specific user and pages they edited - https://phabricator.wikimedia.org/T197486 (10Marostegui) As discussed via email, this bug was fixed on 10.1.37: https://jira.mariadb.org/browse/MDEV-17155 which could end up "fixing" this i... [05:48:13] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [05:48:43] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Edited the task description to remove some graphs that could be misleading (they were server reboots due to upgrades) [06:10:52] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714 (10Marostegui) 05stalled>03Open a:05Marostegui>03None This can now proceed whenver you guys want. [06:11:26] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 (10Marostegui) 05stalled>03Open a:05Marostegui>03None This can now proceed whenver you guys want. [06:35:39] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10MW-1.32-notes (WMF-deploy-2018-05-22 (1.32.0-wmf.5)), and 2 others: Clean up indexes of wb_terms table - https://phabricator.wikimedia.org/T194273 (10Marostegui) Is there any point on having this task if at some point wb_terms is going to be k... [07:02:51] morning marostegui ! [07:02:57] o/ [07:03:02] re the wikidata dispatching ticket did re repool db1109 yet? [07:03:08] was db1109 a slave? [07:03:33] yes [07:03:38] it is a slave [07:03:47] I depooled it to run the analyze table only [07:03:56] why? [07:04:54] aaaah so the Cardinality was essentialy being affected while it was in use ? but deplooled it returns to the same as codfw? [07:05:09] addshore: No, the analyze table did it [07:05:15] oh [07:05:18] As it recalculates all the table stats [07:05:31] The depooling was done because it would generate lag [07:05:38] gotcha [07:18:25] This is worrying: https://phabricator.wikimedia.org/T206740 [07:18:38] banyek ^ you might need to be careful tomorrow [07:20:49] that is the high load enwiki experienced [07:21:19] So it should stabilize soon? [07:22:33] thanks for the notice! [07:22:52] I think I have to find a feew infos in wikitech [07:23:05] marostegui: not necessarilly [07:31:20] marostegui: re the dispatching ticket, now that we know it returned to the same old slowness of eqiad im going to add some more tracking for the suspect parts and see what is taking all the time, then we will see if it is db related or redis related, or something else [07:33:04] do we want to keep this open? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/465634/ [07:37:08] we are quickly down from 20% to 14% [07:37:14] on pc1 [07:37:27] since :28 [07:48:41] yeah, they are eating 10GB every 30 minutes or so [07:50:09] 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) p:05Normal>03Unbreak! This is increasing quite rapidly [07:50:36] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [07:51:03] all the PC replicas are up-to-date [07:55:04] I'll purge the binary logs on all the PC hosts (pc1004, 1005, 1006, 2004, 2005, 2006) with `PURGE BINARY LOGS BEFORE (NOW() - INTERVAL 1 HOUR);` [07:55:26] at 10am CET [07:55:33] (please mention that on the other channel, so we can group all the conversations on a single place) [07:56:31] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) Looking at a couple of minutes of extra timing data it looks like this is down to the selec... [07:56:36] marostegui: I have narrowed it down to 1 query, or so it would seem https://phabricator.wikimedia.org/T205865#4657657 [08:03:54] !log running /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 0 [08:03:55] jynus: Not expecting to hear !log here [08:05:44] addshore: we are dealing with a fire, sorry [08:06:33] marostegui: ack. I'll watch from afar and come back later :) [08:12:28] 10DBA, 10Wikimedia-Site-requests: Global rename of Tarawa1943 → Ontzak: supervision needed - https://phabricator.wikimedia.org/T206730 (10MarcoAurelio) 05Open>03stalled Stalling given that this cannot be actioned until the monthly period has passed. Thanks. [08:16:34] https://www.irccloud.com/pastebin/W1GfdPEC/ [08:16:37] jesus [08:17:35] cool, now rotate them [08:18:06] maybe 10MB it too low, you can up that, you are in charge [08:18:14] but get us space! [08:20:19] !log setting up replication from pc2004 -> pc1004 [08:20:19] jynus: Not expecting to hear !log here [08:29:28] addshore: sure, but, the last update doesn't give much idea on what you want next, as the conditions are kinda the key part of the query :-) [08:29:57] :D I can make the conditions clearer :) [08:30:02] im a meeting right now though :( [08:37:49] marostegui: I added some more details to https://phabricator.wikimedia.org/T205865#4657657 [08:38:41] will check later, still putting fires out [08:38:45] ack! [08:50:48] wow, my laptop just arrived [08:57:03] 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, 10User-Addshore: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) I picked a slave for s8 in each DC. ``` addshore@deploy1001:~$ sql wikidatawiki --host db1104 MariaDB [wikidatawiki]> select * from... [08:57:25] marostegui jynus ^^ i hate to introduce another bit of fun to your day [08:57:35] marostegui jynus ^^ i hate to introduce another bit of fun to your day/ [08:57:48] But it appears that we have some data missing for wikidatawiki in eqiad? [08:58:38] where is that stored? [08:58:48] in a "content"? [08:58:48] s8, wikidatawiki, page table [08:59:07] no, page don't really have content [08:59:16] is there a full page missing? [08:59:23] jynus: see https://phabricator.wikimedia.org/T206743#4657812 [08:59:28] or a page-like object [08:59:30] the row in the page table is missing, [08:59:52] possibly rows in the revision table too *checks* [09:00:01] that row is missing on the whole eqiad dc indeed [09:00:10] then you have a replication problem, wikidata may be doing unsafe statements [09:00:24] or mw, one of the 2 [09:00:38] at a guess this is due to the mcr related refactorings [09:01:19] yup, codfw also has the revision rows for those entities, but not eqiad [09:01:24] but wikidata didn't enable that? [09:01:31] when was the page created? [09:01:55] 20180913091819 [09:02:06] that is after the dc failover [09:02:07] according to the creation revision row in codfw [09:02:13] ? [09:03:22] and it wasn't deleted? [09:03:40] because the only issues I kneow about were about the archive table [09:03:45] *checks* but no i don't think so [09:04:01] check all logs you can find and update the task if you can [09:04:06] I will do archeology [09:04:07] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) [09:04:41] missing full transactions is very weird [09:04:50] yeah [09:05:16] addshore: is just that one? [09:05:16] I mean, its not even 1 trnsation [09:05:18] I and I could see it during a switch [09:05:20] that page has multiple revisions [09:05:27] but not one day later? [09:05:38] 0 revisions [09:05:39] *20 [09:06:02] so that should have been multiple transactions [09:06:02] first timestamp was 20180913091819 last one was 20180913092845 [09:06:16] the rest are there? [09:06:20] or it is just for that page? [09:06:27] on multiple times [09:06:34] so it looks like 3 pages are missing all entries in the page tables and revision tables [09:06:45] and that made replication not break [09:06:54] even on row based statement [09:06:58] that makes no sense [09:07:04] * addshore is very confused [09:07:05] I think that was deleted [09:07:10] yeah, I was right now thingking about the row replication on labs [09:07:10] it the only explanation [09:07:14] that would have broken [09:07:17] unless we filter page [09:07:18] do we? [09:07:19] right, let me check for deletion logs on both eqiad and codfw [09:07:24] no no [09:07:28] I mean physically deleted [09:07:32] oh...... [09:07:34] :| [09:07:49] someone with access removing, possibly accidentaly the row on one dc only [09:07:57] which makes no sense also [09:08:11] because if it is present on one full dc [09:08:13] let me get a list of all rows and ids that appear to be missing [09:08:17] but not on other [09:08:24] addshore: please do [09:08:27] we have backups [09:08:28] addshore: I am checking SAL for 13th Sept [09:08:30] Just in case [09:08:32] so no data loss should happen [09:08:41] we can compare backups at the time [09:09:10] the thing is, unless you have super [09:09:23] you cannot delete it from one dc only [09:09:35] e.g. imagine a maintance script, for any reason, goes bad [09:09:49] it could delete all rows in all servers [09:09:53] but not on a single dc [09:10:25] do you see why that is very weird [09:10:37] So the only thing that happened around that time is a schema change deployed on eqiad master that day but has nothing to do with table: https://phabricator.wikimedia.org/T89737 [09:10:39] I can see happening on eqiad while it was passive [09:10:52] has nothing to do with page table or its content [09:10:57] and replication unconected [09:11:04] but a very specific row? [09:12:06] which tables were altered? [09:12:13] MCR ones maybe? [09:12:16] no [09:12:18] it is pretty old ticket [09:12:29] maybe a reboot [09:12:33] bot_passwords, change_tag, page_restrictions, tag_summary, user_newtalk, user_properties [09:12:44] that miracly remove something [09:12:48] or setting up gtid? [09:13:01] let me see when we set up gtid [09:13:24] we should have binarly logs still on the masters of the event [09:13:29] as it was replicated [09:13:45] we set up gtid AFTER that [09:13:55] at around 9:44 is the SAL entry [09:14:24] 2018-09-13 09:18:19 is a very narrow window [09:15:21] addshore: that is vert strange, but we should have the tools to get to this [09:15:49] and to recover any loss [09:16:09] although the fact that are very specific type of pages [09:16:16] L* right? [09:16:51] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) From the reported pages in the ticket **page table** ``` CODFW: MariaDB [wikidatawiki]> select page_id from page where ( page_title = 'L20540' O... [09:16:59] okay, all the ids i found are at https://phabricator.wikimedia.org/T206743#4657831 [09:17:02] please tell users or other devels not to recreate or recover the rows [09:17:15] as that will make things more complicated [09:17:27] I am doing a compare.py between codfw vslow and eqiad vslow [09:17:45] narrow time range of 20180913091819 and 20180913095832 so far [09:17:49] marostegui: thanks [09:18:11] I'm going to look at the other things that happened around the time and see if it stretches to other pages [09:18:13] addshore: thanks [09:18:20] I can get you all activity [09:18:23] thourgh the binlog [09:18:27] infact, there is a revision for 20180913095832 for one of those pages in eqiad! [09:18:28] don't worry [09:18:32] jynus: lvoely [09:18:33] Not very accurated, I would need to stop replication [09:18:34] *lovely [09:18:49] marostegui: it does the checks on a single transaction [09:18:58] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) p:05High>03Unbreak! [09:18:59] so it should be mostly fine [09:19:10] then we are full of differences [09:19:16] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The binary logs were purged on pc1004. pc1005 pc1006. Also the binlog_max_size were set to 10M and the hosts now have this running in a screen: ``` while true; do echo "p... [09:19:32] I am going to investigate some of the random ones detected by compare [09:20:39] we need to save the binlogs [09:20:49] as they are deleted after 30 days [09:23:27] interesting, the binlog starts at exactly that time [09:23:42] over 12,000 revision rows missing it looks like? [09:23:43] ah ,no [09:23:49] what? [09:23:50] 12k?? [09:23:53] SELECT rev_id, page_title from revision, page where rev_id > 745455836 and rev_id < 745468727 and page_id = rev_page; [09:23:57] try that in both DCs [09:24:15] also I spot this [09:24:15] | 745468464 | WikiProject_every_politician/France/data/Assembly/15th_(bio) | [09:24:25] so it is not wikidata / wikibase specific it would seem? [09:24:43] what wikis are you checking? [09:24:47] just wikidatawiki [09:25:12] how is this even possible? [09:26:37] as long as my query is right? [09:26:37] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:04] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:10] it could be a bug [09:27:16] on the query [09:27:19] So I am trying just a simple query that reported differences [09:27:20] which is [09:27:26] select page_touched from page where page_id=2408366 [09:27:27] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:28] In eqiad we have [09:27:40] page_touched: 20180805140619 and in codfw we have: page_touched: 20180913092709 [09:27:47] :/ [09:28:09] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) It looks like it is more than just a few rows: CODFW: ``` MariaDB [wikidatawiki]> SELECT count(*) from revision, page where rev_id > 745455836 a... [09:28:43] I'm going to see if I can find anything in the same time range on other dbs to see if it is just the wikidatawiki db or not [09:29:20] could this be unsafe statements? [09:29:53] this makes no sense [09:30:02] 10DBA, 10Lexicographical data, 10Multi-Content-Revisions, 10Wikidata, and 4 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) Tagging #multi-content-revisions as a suspect [09:30:47] addshore: can you try to find something the other way around, that got written into eqiad yesterday after the failover and never arrived to codfw? that would narrow things down to unsafe statements I think [09:31:03] marostegui: I can do [09:31:06] / try [09:31:31] thanks [09:32:51] marostegui: should I be looking close to the time after the switchover? [09:33:24] yeah [09:33:33] after 15:00 UTC I would say [09:33:52] okay! [09:34:14] so this is bad [09:34:40] :/ [09:34:40] 10DBA, 10MediaWiki-Database, 10Operations: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10akosiaris) [09:34:51] at 2018-09-13T09:18 the only thing being replicated is heartbeats [09:34:58] which means replication wasn't working [09:35:00] ???? [09:35:08] ouch [09:35:21] just for s8? or? [09:35:23] i mean, it could have a delay [09:35:30] replication stopped [09:35:32] that is normal [09:35:43] we stopped replication many times [09:35:59] now the issue is whre it started replicating from [09:37:05] so it looks like it was a replication issue rather than some transaction issue? [09:37:20] I see starting to get events at 9:58 [09:37:38] but need to know for what real timestamp [09:37:55] from 20180913095825 [09:37:58] that fits with the revision from my query earlier that is in eqiad, 20180913095832 [09:38:00] so that is ok [09:38:08] just replication stopped for a while [09:38:12] which we did many times [09:39:46] we have no logs after 17th sept [09:39:47] damn [09:39:48] yep, I am seeing recentchanges from 20180913095825 being applied at 180913 9:58:26 [09:40:31] so just a delay [09:40:42] which could be an upgrade [09:40:42] let me get the first and last timestamp since then [09:40:47] yes, that is normal [09:41:25] but if a 2018-09-13 09:18:19 update is missing [09:42:01] wait [09:42:02] but it is missing or just delayed? [09:42:18] the server was rebooted the 17th [09:42:26] But I am sure it was rebooted maybe before too [09:42:31] can you check revisions earlier than 20180913095825 ? [09:42:46] on the live dbs [09:42:51] and after [09:42:55] addshore: ^ [09:42:58] I am expecting something like [09:43:06] have all revisions after [09:43:12] and missing before [09:43:53] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) Thanks @Marostegui but this sadly didn't help. Do you have any other ideas what could cause thes... [09:44:12] *&looks* [09:44:13] GTID enablement was done at 09:44 that same day , on all eqiad masters [09:44:26] rev_id=745468910 is the first one being replicated [09:44:32] which some may be missing [09:44:35] before [09:44:39] can you confirm that? [09:45:57] eqiad goes from 745452473 to 745468910 [09:45:58] 10DBA, 10Lexicographical data, 10Multi-Content-Revisions, 10Wikidata, and 4 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Marostegui) So for the record: we are live debugging with @addshore on IRC [09:46:26] addshore: can those account for all missing rows? [09:46:35] which is a timestamp of 20180913090816 to 20180913095825 [09:46:35] *checks* [09:46:37] or are there more [09:46:44] at the moment [09:47:06] jynus: yes, everything I have spotted with what was reported in the ticket [09:47:12] if it is only that, that is a localized replication failure [09:47:22] and a) it is recoverable and it is easy [09:47:29] but I still don't know the reasons [09:47:37] I still don't get how is that possible [09:47:43] for some reason, replication jumped ahead [09:47:50] on a master, which we treat like [09:47:55] in a golden plate [09:48:19] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: A few lexemes disappeared - https://phabricator.wikimedia.org/T206743 (10Addshore) and removing MCR as it looks like this was some issue with replication [09:48:22] also, the good news is that anything that has a uniwue id [09:48:32] can be reinserted [09:48:43] updates are more complicated [09:48:46] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) ```wikiadmin@db1109(wikidatawiki)> SELECT * FROM information_schema.tables WHERE table_name = 'w... [09:49:21] let me translate that into binlogs [09:49:29] so, there was lag at that time (probably because of the schema change): https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1071&var-port=9104&from=1536821324794&to=1536832784000 [09:49:33] lag on the master [09:49:43] for more exact identification (transactin size) [09:50:08] marostegui: can you check logs at that time [09:50:14] operation ones [09:50:24] yeah, as I said, there was a schema change on going [09:50:28] but not touching any of those tables [09:50:40] yeah, I just want to know if someting was rebooted [09:50:46] or an alert went off [09:50:49] or soemthing [09:50:51] No, no reboots [09:50:55] just gtid enabled at 09:44 [09:51:00] ok [09:51:02] on eqiad masters [09:51:07] that could be it [09:51:14] but that is AFTER this issue [09:51:18] maybe gtid got autopositioned [09:51:24] wrongly [09:51:32] but we enabled it after this happened [09:51:33] or some weird thing [09:51:45] so at 09:13 gtid wasn't enabled [09:51:56] this is already weird, we need to think possiblities [09:52:13] that are improbable, like bugs [09:53:31] for now I am saving the binlogs [09:53:36] to be able to recover [09:53:49] let's check also the systemd logs [09:53:53] I did [09:53:57] it starts the 17th [09:53:59] and, no start? [09:54:00] when we rebooted it :( [09:54:01] ah [09:54:02] sorry [09:54:22] I have not been able to find anything older than the 17th [09:54:36] we have the binlogs :-) [09:54:59] yeah, but wanted to check mysql replication on the logs [09:55:18] so a petition- can you and addshore check for other pontential losses? [09:55:34] on wikidata gaps on revision are quite noticeable [09:55:50] can you try to find others, to see if it is a one time thing or a reapeated thing [09:55:57] that is my biggest worry right now [09:56:21] aka a query to find the largest gap on the revision timestamp or the revision id? [09:56:48] while I keep checking the binlogs? [09:57:25] hmmmmmmmm * thinks about how to write such a query* :P [09:57:28] jynus: I will jump into the meeting to give m*rk an update on this and on the purchases, feel free to skip it, up to you [09:57:36] ok, thanks [09:57:43] I can join, too, and hear [09:57:52] up yo sou [09:57:54] whatever you prefer [09:58:05] jynus: I also have a meeting in 3 mins, but will be back shortly after ~10 / 15 mins [09:59:36] so per the graphs, this server was only rebooted the 17th [10:01:25] banyek: are you joining the meeting? [10:01:32] oh, yes [10:20:58] back [10:24:21] addshore: can you check if this happened enwiki? [10:24:40] marostegui: can I check on dewiki (its a little bit smaller) ;) [10:24:48] or would enwiki be better? [10:25:04] addshore: sure, check dewiki [10:31:30] So [10:31:33] MariaDB [dewiki]> select * from revision where rev_id = 180868940; [10:31:37] returns a row in doth DCs [10:31:45] timestamp is 20180913093000 [10:32:00] which is within the range we are looking at of 20180913090816 to 20180913095825 [10:32:06] thanks [10:32:21] so just s8? [10:32:40] if you could check frwiki and jawiki? [10:32:41] (that s6) [10:32:46] just to have another check [10:33:09] can do [10:36:57] for fr wiki [10:37:02] thanks [10:37:09] select * from revision where rev_id = 152158259; returns in both DCs [10:37:23] timestamp of 20180913093002 [10:38:09] what query can i run to find out the hostname of the db im on? [10:38:19] select @@hotsname; [10:38:46] yup, good, just checking i was actually in the right dcs [10:38:52] i'll check jawiki now [10:39:25] <3 [10:41:07] jawiki also looks fine with select * from revision where rev_id = 69914221; and a timestamp of 20180913093006 [10:41:19] thanks [10:41:41] so just s8? :) [10:42:09] and it is not limited to revision and page tables right? it would be all tables as it was a replication issue? [10:42:27] yep [10:42:33] we are discussing things on the meeting [10:42:42] we will end up rebuilding all eqiad hosts from codfw [10:42:46] ack, ping me if you need me [10:42:53] thank you [10:43:00] shall i assign the ticket to one of you? [10:43:14] sure, assign it to me [10:43:20] I will update with what we have decided [10:44:12] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 3 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) a:05Addshore>03Marostegui [10:44:29] :) [11:01:54] I set expire_logs_days to 60 on db2045 - codfw master [11:02:43] thanks [11:04:30] ```set global expire_logs_days=60;``` [11:06:34] there are 147 binlog files on the codfw host, and they take ~150Gb of space [11:06:43] shall I simply copy them with scp/ [11:06:44] ? [11:07:38]