[04:52:49] 10DBA, 10Operations, 10ops-codfw: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) 05Open→03Resolved No OS or idrac errors since the memory was replaced, so I am closing this as resolved. If it happens again, I will re-open Thanks @Papaul! [05:29:33] 10DBA, 10Goal, 10Patch-For-Review: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 (10Marostegui) [05:58:42] 10DBA: Update rack information on zarcillo.servers - https://phabricator.wikimedia.org/T229683 (10Marostegui) 05Open→03Resolved This is now fixed: ` root@db1115.eqiad.wmnet[zarcillo]> select fqdn,rack from servers where rack is NULL and fqdn like 'db%'; Empty set (0.00 sec) ` [08:22:03] marostegui: hey, I deployed a change on June 27th that reduced a crazy number of "INSERT IGNORE"s that are not needed (those rows were there already). Don't they show up in rows written metric? Because I couldn't find any big difference in graphs while it dropped tens of thousand writes like that per second :( [08:22:46] Amir1: if the rows where there already, I don't think they will show in any metric [08:23:01] your only hope could be the INSERT metric on the master specific dashboard [08:23:20] like: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=now-24h&to=now&refresh=5s&panelId=2&fullscreen [08:23:22] that is s1 master [08:23:23] it should reduce the deadlocks [08:25:07] marostegui: yeah it's actually visible. The deployment were around 20:00 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=1561623823497&to=1561710238504&panelId=2&fullscreen [08:31:47] Amir1: yeah, only noticiable if you know about it, it would have not been noticiable to me [08:32:21] true :( [08:32:36] there are too many ups and down for me to notice :) [08:34:08] Yeah, I just stick to reduction of deadlocks for now :P [08:34:50] that's already awesome :) [08:39:12] 10DBA, 10Operations, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [08:39:47] 10DBA, 10Operations, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) p:05Triage→03Normal [08:40:09] 10DBA, 10DC-Ops, 10Operations, 10decommission, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [08:44:08] 10DBA, 10DC-Ops, 10Operations, 10decommission, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [08:45:50] 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10Marostegui) [09:13:05] 10DBA, 10Operations, 10ops-eqiad: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) @Cmjohnson are we still good for tomorrow at 14:00 UTC? I will have the host depooled and off for you before 14:00 UTC [10:22:30] 10DBA, 10Operations: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:22:48] 10DBA, 10Operations: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) p:05Triage→03Normal [10:24:03] 10DBA, 10Operations: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:24:21] 10DBA, 10Operations: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:24:36] 10DBA, 10Operations: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:42:02] 10DBA, 10Operations, 10ops-eqiad: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Cmjohnson) @marostegui yes, still good fit tomorrow at 1400UTC [10:42:50] 10DBA, 10Operations, 10ops-eqiad: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) Excellent - thank you! [11:21:08] 10DBA, 10Math: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 (10Marostegui) @Physikerwelt there have been no errors or writes to the tables after T196055#5352527 and T196055#5384177 - I am thinking about starting to drop it everywhere this week. [11:45:05] marostegui: Actually bytes received had a dip when it got deployed: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=2&fullscreen&orgId=1&from=1561624886431&to=1561722086432 [11:45:14] (Around 20:00) [11:45:24] Amir1: and why did it recover again? [11:45:51] It didn't recover [11:46:26] sorry, my eyes went to the 3:00 sink [11:46:27] haha [11:46:31] I guess natural reaction :) [11:46:32] it was around 90MB/s and after deployment it went to 80 MB/sec :D [11:46:39] haha, it's fine [11:46:58] indeed, it never went back to previous values [11:46:59] nice [11:56:31] 10DBA, 10Math: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 (10Physikerwelt) @Marostegui that's good news. Go ahead. BTW. I checked the code again and found some leftovers https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Math/+/524592/ if you have some spare time may... [12:00:41] I measured and it seems it decreased 4% of traffic to the DB or 3MB/s [12:02:19] marostegui: I was planning to investigate the spikes but they are gone now: https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?refresh=30s&orgId=1&from=now-7d&to=now [12:02:41] Amir1: do you feel confident to enable again the change from last week? [12:03:07] yeah, specially since it's Monday [12:03:23] sure, the patterns are quite clear and won't take long to start as we saw [12:03:39] This external requests should have been addressed by blocking them or throttling them [12:04:35] marostegui: so let's do it then? [12:04:42] Amir1: sure [12:04:58] Amir1: if we saw them again, let's revert or you want to give them some hours to try to investigate? [12:05:27] A couple of hours would be great [12:05:35] sure [12:06:28] Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"""" :D [12:06:37] hahahaha [12:31:46] 10DBA, 10Math: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 (10Marostegui) Thanks for double checking! Regarding that patch, I can take a look but I am not familiar with the code, so I am afraid I won't be too helpful there :( [14:04:40] marostegui: so the graphs looks okay [14:04:57] After I enable reading new for clients, it should get a little bit better [14:08:05] Amir1: yeah, no spikes on db1104 :) [14:08:38] we got the first spikes and so far it looks fine [14:08:42] let's give it some more hours [14:09:00] it got increased but I think that's sorta okay because 1- We are reading more rows 2- we are at middle of migration, when we stop writing and reading the old system, things should get better [14:09:23] yeah, I was referring to the spikes we saw last week [14:09:43] it did cause locks, but it recovered [14:09:46] I will enable it for clients tomorrow and if there was no issue. I set it to read_new a week after [14:10:31] Amir1: and that will stop reading the old system? [14:10:47] marostegui: for properties only [14:11:05] right [14:11:15] very small amount of data, huge amount of reads [14:11:25] so we should see a decrease? [14:12:10] probably [14:12:16] nice [14:14:09] The most important part is migrating items. I started the process in beta cluster already. It seems okay as the code is virtually the same [14:15:24] How long you think that could take? [14:16:37] I hope we will start in two weeks, but I guess running the migration script would take months [14:16:45] at least a month [14:17:25] yeah, that's what I had in mind, that it will take quiiiite long [14:17:44] but the good thing is that we basically can switch off things gradually, e.g. after 10% is done, we can basically stop writing to wb_terms for those 10% of items [14:17:59] that would stop the table from growing [16:23:37] 10DBA, 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) As per the sync on the SRE meeting, @JHedden will be online from WMCS. I will handle the announcement for wikitech, could... [16:26:16] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Papaul) [16:26:54] 10DBA, 10Operations, 10ops-codfw: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10Papaul) [16:28:49] 10DBA, 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10JHedden) >>! In T229657#5393428, @Marostegui wrote: > As per the sync on the SRE meeting, @JHedden will be online from WMCS. > I will... [16:29:12] 10DBA, 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Thanks!