[05:42:09] 10DBA, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) p:05Triage>03Normal [06:38:46] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Marostegui) Incident report (please feel free to add or modify whatever you feel it needs some changes!): https://wikitech.w... [06:39:16] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover [07:25:21] banyek|away, marostegui: I'll look into the pt-kill patch later the day, it's on my list [07:25:36] moritzm: no worries, it is low priority [07:27:21] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Marostegui) what's pending here? [08:19:55] good morning [08:20:00] moritzm: thanks! [08:22:11] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) Nothing, se just agreed yesterday to not put it to production until today iirc [08:22:48] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Marostegui) Make sure you enable notifications first and update zarcillo DB before repooling. [08:26:09] 10DBA, 10Cloud-Services: Prepare and check storage layer for vnwikimedia - https://phabricator.wikimedia.org/T207095 (10Banyek) p:05Triage>03Normal Will this wiki be public, or not? [08:26:44] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for vnwikimedia - https://phabricator.wikimedia.org/T207095 (10Banyek) [08:28:24] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for vnwikimedia - https://phabricator.wikimedia.org/T207095 (10Marostegui) I think the wiki will be public. publicly readable, editing restricted. But I will leave @Urbanecm to confirm whether this can be replicated to labs. This one is... [08:29:43] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) Is there a tool for updating zarcillo about the host, or I should 'INSERT INTO ...' ? [08:31:56] marostegui: can I proceed with pc2004? [08:32:21] banyek: if you have downtimed all the stuff, stopped replication and all that, up to you [08:32:31] 👍 [08:32:43] tx [08:32:54] I keep updating the ticket about the progress [08:32:58] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for vnwikimedia - https://phabricator.wikimedia.org/T207095 (10Urbanecm) 05Open>03stalled Fishbowl mean public can read, those with account can edit. The current standard is to replicate fishbowl wikis to labs. Please note some obj... [08:34:11] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Marostegui) >>! In T206593#4669570, @Banyek wrote: > Is there a tool for updating zarcillo about the host, or I should 'INSERT INTO ...' ? I would ask @jcrespo to be sure. You also need to upd... [08:34:45] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for vnwikimedia - https://phabricator.wikimedia.org/T207095 (10Banyek) We can only proceed after the wiki is created, so if that happens, ping me (or anybody in the persistance team) , and we can do the sanitzation [08:36:47] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) ok, waiting on @jcrespo's answer then I'll update zarcillo and tendril at once [08:52:16] marostegui: I don't really see that to downtime on pc2004 - I checked the icinga checks on the host, and I think none of them will triggered [08:52:49] banyek: sure, I haven't looked at it, just mentioned it just in case :) [08:53:32] well, better to be checked than not! [08:53:42] :) [09:13:46] I checked the binlogs on pc2004 - it seems it only contains data from pc10004, so I stopped replication [09:14:10] I'll `reset slave all` on pc1004 [09:15:22] Sure, grab the position just in case [09:15:29] and execute the alters without binlog on pc2004 [09:17:13] yep [09:23:40] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) On pc1004 the slave status before 'reset slave all': ``` Slave_IO_State: Master_Host: pc200... [09:39:46] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) Regarding the information_schema there will be not too much data to reclaim: ``` MariaDB [(none)]> select SUM(DATA_LENGTH)/1024/... [09:41:26] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Pick some of the biggest tables and try to alter them to see how much you get and then we can probably extrapolate. If it i... [09:42:53] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good enough. I do it now, then provide the data here [09:43:50] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) >>! In T206740#4669725, @Banyek wrote: > for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good e... [09:46:08] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127 (10Marostegui) [09:46:15] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [09:53:28] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) the result is 5.9 Gb, which is more than expected, but still not too much - less than 10% - with the whole operation we can free... [09:54:33] Marostegui: damn me, I forgot to add log_bin=0 before ran the table rebuild - if we connect the pc1004 as a slave it will be run on it too [09:54:50] what? [09:54:53] it took 3 minutes, so that's not *that* bad, but we need to know [09:55:18] the `alter table pc255 engine=innodb` query on pc2005 [09:55:22] *2004 [09:55:28] (damn mac keyboard) [09:55:40] we don't have to connect replication on eqiad for now, in fact it is normally disabled [09:56:11] I know, but I wanted to say it out loud to be noticed [09:56:26] Yeah, it could have been a lot worse if that would have happened somewhere else [09:56:33] So be careful when running that kind of things [09:57:20] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Considering we are at around 70% at eqiad, and should remain like that, I don't think it is worth the hassle and the risk of... [10:09:15] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) ok. I'll truncate the tables then in pc2004 today, 2005 tomorrow and 2006 the day after tomorrow. I also stop the binlog purgers... [10:10:05] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) You can probably disconnect pc1005 and pc1006 today already, but keep executing the truncating without binlog, just in case. [10:12:44] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [10:15:53] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) ok, good idea [10:26:19] marostegui: can you give me a +1 on https://gerrit.wikimedia.org/r/#/c/operations/debs/wmf-pt-kill/+/466886/ again? I lowered the log keep period, and then I can merge this before lunch [10:26:36] (I'll finish the pc2004 beforehand anyways) [10:27:20] banyek: I can but logrotate is really low priority as we stated before, there are lot of more important things we need to address first and are part of incidents [10:28:02] I know, that but if I have it in a [10:28:21] have it in a? [10:28:23] 'mergeable' state means, there will be no blockers on when we'll have time on it [10:28:28] ok [10:28:49] (I HATE THIS KEYBOARD) ((IT'S BUILT BY THE DEVIL HIMSELF)) [10:29:35] not saying don't work on it, but wait until we have things on the way, specially the parsercache alert, the schema change plan, the db2042 BBU plan... [10:30:00] (((The previous mbp/mba keyboard was the best keyboard ever, but this is simply too sensitive, and the touchbar is just ... bad ))) [10:30:35] marostegui: I know, but later this week we'll be in the state (hopefully) when all of those are solved (I hope friday will be that day) [10:31:00] we'll see [10:31:08] you never know what can happen tomorrow! [10:33:00] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) pc1005 replication before reset slave all: ``` Master_Host: pc2005.codfw.wmnet Master_Log_Fi... [10:34:16] indeed: http://www.ironworksforum.com/forum/showthread.php?t=89044 [10:41:32] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) pc1006 replication before reset slave all: ``` Master_Host: pc2006.codfw.wmnet Master_Log_File: pc2006-bin.154452 Exec_Master_Lo... [10:51:55] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) I'll start truncating the tables on pc2004 with: ```for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do my... [10:52:03] marostegui: ^ [10:52:40] please double check nothing replicates from pc2004 [10:54:32] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [10:54:53] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [10:55:32] ```banyek@pc2004:~ $ sudo mysql --skip-ssl -e "show slave hosts"``` [10:55:45] empty result set [10:56:32] ```banyek@pc1004:~ $ sudo mysql --skip-ssl -e "show slave status\G" [10:56:32] banyek@pc1004:~ $``` [10:56:39] empty result set [10:57:00] running truncates [10:59:34] good [10:59:47] !log it please [10:59:47] marostegui: Not expecting to hear !log here [11:01:28] ^ (^_^) [11:33:34] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [12:36:40] 10DBA, 10Cloud-Services, 10Cloud-VPS, 10Epic, 10Tracking: Labs databases rearchitecture (tracking) - https://phabricator.wikimedia.org/T140788 (10Marostegui) [12:36:43] 10DBA, 10Cloud-Services, 10Cloud-VPS: Explore 'Analyze' statement as substitute for Explain - https://phabricator.wikimedia.org/T141095 (10Marostegui) 05Open>03declined I am going to decline as there are no plans to use analyze as a substitute of EXPLAIN. Instead `show explain` maybe in combination with... [12:50:15] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 6 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) 05Open>03Resolved a:03Addshore All looks good to me! :) I'm going to mark this as re... [14:00:49] I am finishing fixing db1087 [14:00:59] jynus: <3!!!! [14:01:02] I will start replication and fix the master later [14:01:05] tomorrow [14:01:32] wikidata replication for labs may break [14:01:42] cannot do much about that [14:02:00] sure, we can always use myloader of course [14:02:30] jynus: how were the diffs? too bad? [14:02:50] time consuming [14:03:08] you are the best [14:05:27] that is great news :) [14:58:23] I am adding these things to the timeline of the incident report so we don't have to go thru logs later [15:01:42] hooray for Jaime! [16:24:59] 10DBA, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Volans) I think we could also consider adding an alert based on the hit ratio of the parsercache caches (we already have the data in grafana) [22:01:50] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle)