[05:10:24] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:10:28] 10DBA, 10Operations, 10ops-eqiad: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) 05Open>03Resolved a:03Marostegui As it happened before, this recovered itself - closing for now: ``` 04:26 < icinga-wm> RECOVERY - Memory correctable errors -EDAC- on db1069 is... [05:53:21] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Marostegui) I have added it to tendril. [06:17:53] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) Update from yesterday at around 14:00UTC @jcrespo has done a... [06:22:18] Hello jynus db1124 replication failed on one of the tables, I have been taking a look but I didn't want to touch anything as I didn't know what was your plan for that table or anything [06:22:50] The id that is missed is: 580862119 [06:22:57] I will have a look [06:23:08] this was expected [06:23:15] I know [06:23:25] I just advanced a bit so you don't have to do all the digging [06:23:54] INSERT INTO `wb_items_per_site` VALUES (580862119,20068036,'enwiki','Ihor Lapin'); [06:24:20] however, there is an unique key there so you can see there is a row if you do: select * from wb_items_per_site where ips_site_id='enwiki' and ips_site_page='Ihor Lapin'; [06:24:26] because db1087 is fixed but the replicas are not [06:24:40] did you fix db1087 with replication? [06:24:51] no, without it [06:24:57] ah then this is "good"! [06:24:57] :) [06:38:24] 10DBA, 10Wikimedia-Incident: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10Marostegui) [06:38:48] 10DBA, 10Wikimedia-Incident: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10Marostegui) p:05Triage>03Normal [06:39:11] I've restarted db1124, but it will break again, it needs the same treatment than db1087 [06:39:17] I will focus on the master now [06:39:36] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [06:40:18] jynus: sounds good, thank you [06:52:18] be prepared for s8 replication breaking everywhere if I make a mistake [06:52:34] * marostegui ready! [06:54:56] you doing it without replication, no? [06:55:02] no [06:55:18] Ah right! [06:55:20] * marostegui ready [07:05:21] there are some cases where I will have to use replication, like updating counters [07:05:30] so that they are consistent evertwhere [07:05:57] yeah [07:06:12] and the master is the one that has the right data [07:13:48] I check the smart error on db2051 [07:14:17] banyek: check the automatic ticket created [07:18:08] I am not sure smart errors create tickets, there is some overlap there and uncoordination [07:18:19] but worth checking if it is created anyway [07:18:28] No, there is a degraded raid ticket created [07:18:33] ah, that [07:18:34] which auto-acks the alert [07:18:43] but the SMART alert should be ack'ed manually [07:19:02] 2/28 tables fixed [07:19:07] on master [07:19:34] <3 [07:19:53] the problem is there will be some inconsistencies that may existing in the data [07:20:02] like counted items on a category [07:20:09] from before you mean? [07:20:14] in general [07:20:16] yeah [07:20:21] That for sure [07:20:22] that may need an app refresh [07:20:36] maybe created by replication [07:20:46] but where both are right in its local state [07:21:09] but I guess as long as the primary data is fixed, and the databases agree with each other [07:21:13] those are minor details [07:21:39] yeah [07:21:49] Those probably come from even months ago [07:22:01] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Banyek) [07:44:15] 10DBA, 10Operations, 10ops-eqiad: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [07:45:26] 10DBA, 10Operations, 10ops-codfw: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) [07:45:45] 10DBA, 10Operations, 10ops-codfw: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) p:05Triage>03Normal [07:46:01] 10DBA, 10Operations, 10ops-eqiad: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) p:05Triage>03Normal [07:46:31] 10DBA, 10Operations, 10ops-codfw: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) [07:46:47] 10DBA, 10Operations, 10ops-eqiad: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [07:47:14] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) after cleaning up parsercache data yesterday on pc2004, I checked the disk usage of the 'parsercache' database and the binlogs. Th... [07:47:40] I clean up the pc2005 host (and stop the binlog purger there too) [07:51:31] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [07:51:51] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Keep in mind that the disk space is always constant on the pc hosts: https://grafana.wikimedia.org/dashboard/file/server-board.... [07:57:04] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) the 'parsercache' database takes around 1,5T data on pc1004 the binlogs are 7Gb on pc1004; but the binlogs are cleaned up in every... [07:58:20] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) 05Open>03Resolved p:05Triage>03High [07:59:14] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) >>! In T206740#4673427, @Banyek wrote: > the 'parsercache' database takes around 1,5T data on pc1004 > the binlogs are 7Gb on p... [07:59:26] I double checked, pc1005 is NOT replicating from pc2005 (`show slave hosts` is empty in pc2005, and `show slave status` is empty on pc1005) I proceed on truncating tables on pc2005 [07:59:35] ok [08:00:08] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) ok, I top the purgers there too. [08:01:53] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) I double checked, pc1005 is NOT replicating from pc2005 (`show slave hosts` is empty in pc2005, and `show slave status` is empty on... [08:12:09] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [08:23:43] I am going slower on the master and more careful, but I am progressing faster [08:24:07] great jynus!+ [08:24:11] thanks! [08:24:29] I am on a meeting with banyek, if you need anything just let me know and I will leave it [08:24:48] I have fixed almost 50% of the tables, but that doesn't count for the 2 largest ones [08:27:16] jynus: this is great! [08:27:24] jynus: and replication working like a charm in sanitarium too! [08:28:23] that is broken still [08:28:32] but it is less important [08:28:35] at the moment [08:28:41] is it broken? [08:28:45] I don't see it on icinga [08:28:50] the internal consistency [08:28:54] it is not good [08:28:57] ah [08:29:02] I thought you meant replication [08:29:04] ok ok :) [08:36:25] the reason why I chose to do it without replication is that if I make a mistake I can recover [08:36:36] as I am not touching the other hosts [08:36:48] indeed! [08:36:51] good idea :) [08:37:01] even if it is more dangerous for replication [08:37:26] it would "only" break sanitarium as the rest is statement [08:37:27] no? [08:39:01] anything can happen [08:39:24] the fact that replication kept going was a special state becase only data was deleted [08:39:31] now I am adding it back [08:39:45] if something I add is wrong and it gets modified just then [08:39:48] things can break [08:47:25] yeah [08:47:43] I still find unbelievable that in a month, nothing got modified that could have broken sanitarium [08:48:05] it makes sense [08:48:34] because the row based replicatoin only protects against changes done directly by the row creator [08:48:45] what do you mean? [08:48:51] however, by the time it reached to db1087 [08:49:07] it had been "corrected" by db1071 in statement [08:49:25] Yeah, but what I am saying, is that nothing that got inserted on those 50 minutes gap, never got modified again [08:49:32] most of the traffic is insert [08:49:34] If so, that would have broken replication on sanitarium [08:49:36] and the only updates are on page [08:49:43] which will never complain [08:50:03] most modifications were just new pages/revisions/sections [08:50:18] so there was only missing stuff [08:50:18] I mean, it is true that 50 minutes window is that that that big [08:50:27] with very little updates [08:50:39] enwiki would have broken for sure [08:50:53] it happened early in the day [08:51:17] what? [08:51:21] however, once the row based replication was "fixed" if failed within a day [08:51:26] early in the day? [08:51:29] yes [08:51:35] early in the day most edits are bots [08:51:43] aaaah [08:51:48] ok ok, I didn't understand what you mean [08:51:50] yeah [08:52:01] vs 9UTC where most humans start editing [08:52:12] not that there are no human edits before [08:52:17] it was a "small" window, on a wiki which is mostly bots and on a time of day that is even more only bots [08:52:20] but there is a higher proportion at that time [09:26:43] user_newtalk requires a primary key [09:26:55] can any of you check if there is a pending ticket about that [09:27:01] yes there is [09:27:03] let me look for it [09:27:07] it is affecting the speed of the fix [09:27:12] not having it [09:27:35] https://phabricator.wikimedia.org/T146585 [09:27:36] if you can ping around if someone is working on it or something [09:27:53] I ping them https://phabricator.wikimedia.org/T146585#4014628 in march ha! [09:27:56] I will do that again [09:29:58] I do truncate the parsercache tables on pc2006 as well, and then we can close that ticket [09:30:16] T206740 [09:30:19] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [09:32:23] I double checked, pc1006 is NOT replicating from pc2006 (`show slave hosts` is empty in pc2005, and `show slave status` is empty on pc1005) I proceed on truncating tables on pc2006 [09:32:41] great [09:35:23] jynus: https://phabricator.wikimedia.org/T146585#4673612 [09:35:58] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:37:31] thanks, marostegui [09:37:36] no, thank you [10:56:57] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) All done, the parsercache tables are truncated on all the codfw hosts, and the disk usage of binlogs seems normalized in eqiad. Th... [11:01:10] * banyek away for 20 minutes [11:11:40] hey, can I run the deleteLocalPassword again? is it okay if it causes lag on codfw? [11:13:21] marostegui: jynus ^ [11:13:44] not on s8 [11:14:50] noted [11:23:56] back [11:24:34] as triaging the tickets I am not sure what to do with this: https://phabricator.wikimedia.org/T164382 [11:26:50] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) p:05High>03Lowest As the ticket is finished, and just kept open for a few days, there is not need to keep it at high priority [11:43:49] I am in the last 3 tables, but they are the largests, and I have to check them in ranges [11:47:22] FYI, as of now the change_tag table will stop growing as fast as it used to (around one percent of its original growth), I'm working to drop that column which would make it really small, then probably it's possible to load them into the memory (is it automated or do we need to do anything?) Tell me if queries got slow or there is unreasonable load [11:48:02] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) After checking the zarcillo database I believe the following queries will add the host there as well, however there are some blurry pioints for me: ``` INSERT INTO instances (name, serv... [11:49:25] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10jcrespo) > @jcrespo it's ok to run these 3 inserts? Yes [11:50:07] Amir1: MySQL keeps all recent data in the buffer pool, so if a small table is used often, it will be cached, but if it is only accessed rarely then the rows will be removed [11:51:19] hmm, so it means the performance will get improved gradually [11:51:55] If I understood your question: yes [11:57:00] I add the db2096 to zarcillo then, and close the ticket T206593 [11:57:01] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [12:02:12] That's good. I hope we can measure it somehow [12:06:23] I am not sure if this applies to mariadb, but maybe it worth to look at: https://dev.mysql.com/doc/refman/5.7/en/innodb-information-schema-buffer-pool-tables.html [12:06:48] he means the performance increase due to table normalization [12:07:28] it helps, sadly, there is so many variables that is difficult to measure (other than the number of hits on the buffer pool of the table) [12:10:07] All tables on s8 master fixed except pagelinks and wb_terms [12:10:11] will take a break now [12:12:09] have a good rest [12:12:34] Kudos for Jaime! [12:24:07] as we were talking with M@nuel about the schema changes I made a draft of step-by-step execution. Did I missed something? https://docs.google.com/document/d/1UbMLLRrKx2hgFV53JCd94HBaZdESkO-Gu3h9QcHr6r0/edit [12:27:21] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) Merging this patch is also needed for enabling notifications https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467722/ [12:27:38] I go now and check what needs to be done to have replication icnga checks on PC* hosts [12:32:55] 10DBA, 10monitoring: Alert based on the hit ratio of the parsercache - https://phabricator.wikimedia.org/T207273 (10Banyek) [13:05:22] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) Update from Jaime at 12:10 UTC All tables on s8 master fixed... [13:05:51] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [13:18:49] 10DBA, 10MediaWiki-Cache, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) 05Open>03Resolved p:05Lowest>03Normal Let's close it and if something breaks we can reopen. Good job! [13:19:58] 10DBA, 10monitoring: Alert based on the hit ratio of the parsercache - https://phabricator.wikimedia.org/T207273 (10Marostegui) p:05Triage>03Normal This needs some thought in order to make it effective: The hosts in the passive DC will always have 0 hit ration. What will happen if we depool a host? Do we... [13:20:17] banyek: what is the experience of your first outage? [13:20:50] well, maybe not outage, but incident [13:21:12] first of all, you guys are professionals. [13:21:38] I liked the way as the quickfix/proper fix were following each other [13:21:52] this is about s8 [13:22:00] I was more asking about pc [13:22:01] the parsercache one was more clear to me [13:22:06] as you handled that mostly on your own [13:22:59] actually I enjoyed it - if I can say this on an incident 😂 [13:23:20] I learned a lot about that part of environment [13:24:41] I am not sure how you will evaluate what I've done, but now I am pretty sure if there would be an another incident/outage on PC I'd be more confident to find the cause and fix [13:25:05] so, pc is a quite ugly part of the infra [13:25:17] with lack of resources, no proper redudancy [13:25:28] banyek: next time you can also do the IR [13:25:33] the purchase that is being done as we speak was supposed to address taht [13:25:42] but we can only fix one issue at a time [13:26:10] in fact, it is surprising, given there is no redundancy and a raid0, that we had no more incidents before [13:26:18] jynus: don't jinx it!!!! [13:26:38] the new hosts will be raid5 and we will have an extra host to have room for failure [13:27:14] and disk performance is not a huge priorirt there [13:28:15] sadly, because load balancing is at the moment purely code-based, there is not much room for rearchitecturing (vs e.g. a proper load balancer service) [13:30:16] etcd will solve a part of this I suppose [13:32:56] not really [13:33:09] I have to leave for a while (kindergarten stuff). [13:33:09] After I am back we'll finish the (de)pooling of labsdb hosts with Brooke, and I am planning to check Jaime's gerrit [13:33:09] patch of socket directory around roles/and datacenters. [13:33:09] (Also I'll check if I can do anything with the PC replication lag monitor - there's a patch prepared, but I don't want to put it to production until I am not sure if it will generate false alarms) [13:33:19] etcd is just allows to agree on a configuration in a distributed way [13:33:41] but if sharding function and keys are stored on the code, we cannot really do much [13:33:47] banyek: yeah, run the compiler for the pc replication alert [13:34:41] jynus: I hear you [13:39:02] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Addshore) [13:44:28] tomorrow we should talk with b*nyek about descriptions on tickets T207273 [13:44:29] T207273: Alert based on the hit ratio of the parsercache - https://phabricator.wikimedia.org/T207273 [13:46:06] yep [13:53:42] checking row 200 million of 2380 millions https://upload.wikimedia.org/wikipedia/en/3/3d/WaitCursor-300p.gif [13:54:07] hahah [13:54:21] I would have expected a youtube video [13:54:25] But I guess you are too tired [14:46:13] <_joe_> jynus: where are you now? 400 millions? [14:46:24] <_joe_> sounds like one of those super boring idle games :P [14:48:35] 500000000 one one table, 4000000 on the other [14:48:49] but the 40M one has many rows per id [14:50:15] back [15:18:30] 10DBA, 10Analytics, 10Analytics-Kanban, 10Growth-Team, and 2 others: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10elukey) Tables dropped with Marcel on db110[7,8] (eventlogging master/slave). Marcel checked and nothing is there on HDFS. The above code change has... [15:18:55] 10DBA, 10Analytics, 10Analytics-Kanban, 10Growth-Team, and 2 others: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10elukey) [15:25:23] I checked the nagios check for pc* hosts, it will work (checked the catalog compiler, and the nrpe check itself) if anyone +1 it I can merge [15:35:51] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) 05Open>03Resolved [15:36:08] db2096 now in production, I closed the task T206593 [15:36:08] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [15:43:02] did you check prometheus monitoring works? [16:14:10] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Banyek) The check is ready to be deployed [16:31:29] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) Where is the compiler run link? [16:48:25] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Banyek) I made them earlier: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/13001/console https://integratio... [16:53:09] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) >>! In T206992#4674780, @Banyek wrote: > I made them earlier: > > https://integration.wikimedia.org/ci/job/operations-puppet-... [17:26:21] 10DBA, 10SDC Engineering, 10Wikidata, 10Core Platform Team (MCR), and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10CCicalese_WMF) [17:36:41] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Banyek) updated the gerrit patch then [19:03:31] all tables except wb_terms, which is half done, should be equal on the s8 master [19:03:46] (although more checks may be needed) [19:04:26] I am going to stop now because it is affecting performance: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1071&var-port=9104&from=1539716652708&to=1539803052708 [19:12:36] it works: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2096&var-port=9104 [21:22:10] jynus: marostegui: sorry for bothering but I don't know if we have to do with something on broken replication on db1124 (s8) I know Jaime worked on that, but I can't find any message about it may be break [21:23:25] (there was no paging about it) [21:27:04] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Banyek) on db1124 with instance s8 we have a repliation error as ``` La...