[05:20:28] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4135199 (10Marostegui) There is definitely an impact on which kernel we are running. After running: ``` root@db1114:~# uname -a Linux db1114 4.9.0-4-amd64 #1 SMP Debian 4.9... [07:47:07] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4135346 (10Marostegui) [07:55:57] so I think the extra purging fixed that, but slowed down replication [09:32:56] hello, two things: I'm still running the maintaince script to delete things from logging table everywhere. tell me if I should not do it in any shard/wiki since you might be doing ALTERs (I already added it to the calendar) 2- I made some changes on the isolated db you gave me, the alter table to drop that column took 1d and6h. Also made some changes that might make the table a little bit (but not very much) smaller, do you want to take a [09:32:56] look and let me know? [09:34:20] Amir1: Can you avoid s8? I should be finished with it tomorrow [09:39:39] T191996 is an interesting issue. So the tl;dr is: high dropped packets rate w/ linux 4.9.82, better w/ linux 4.9.65? [09:39:39] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [09:40:47] ema: not anymore [09:41:11] ema: after the last reboot with the original kernel 4.9.82 I am seeing no errors now for the last few hours XD [09:41:17] I am starting to think it is the cable or something [09:41:26] mmh [09:41:28] I am going to ask chris to replace it today [09:41:37] https://phabricator.wikimedia.org/P7000 <- here's the diff [09:41:51] I am trying to read the code for this: mbuf_lwm_thresh_hit [09:41:57] that rx_mcast_packets seems something to look into [09:42:11] it has been up only for 4 hours [09:42:25] let's give it sometime, but so far it is behaving better than yesterday (with that same kernel) [09:42:42] ema: what's that eth0 from? [09:42:53] db1080 [09:43:12] okay [09:43:16] let's give it some more hours [09:43:24] but so far it is behaving very good [09:43:34] sounds good [09:43:42] 1 error since it was rebooted and fully repooled [09:43:47] which is really strange [09:43:54] (with the original new kernel) [09:44:10] which would point it is the reboot and not the kernel? [09:44:16] it's curious how db1080 has been running for a week (rx_mcast_packets: 3080583), vs db1114 up for 4 hours (rx_mcast_packets: 44203532) [09:44:27] jynus: I am thinking about the cable (or NIC) at this point [09:44:30] loose cable or something [09:44:59] jynus: cause yesterday after the reboot with the other kernel (the older one) it was better, but still erroing (a lot less): https://phabricator.wikimedia.org/T191996#4135199 [09:45:25] but so far, it is behaving really well: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1114&var-port=9104&panelId=10&fullscreen [09:45:37] fully repooled at 6:03 [10:12:29] marostegui: the s8 one is on puppet, do you want me to stop it? [10:47:58] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#4135653 (10Ladsgroup) hmm, I can see why. Mostly jobqueue stuff. Can you file a phabrictor ticket so we can pick it up and do something about it? [10:54:16] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#3977606 (10jcrespo) Done as T192349 [10:55:18] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#4135671 (10Ladsgroup) Thanks! [11:30:51] have a look at my client wrapper on neodymium when you can, I may install it on all servers if it is helpful [11:31:20] I cannot reimage es1017 yet, dumps are still running there after 1 day depooled [12:29:05] 10DBA: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4135858 (10jcrespo) [12:29:07] 10DBA, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562#4135857 (10jcrespo) [12:32:22] 10DBA: Setup database logical backups on eqiad - https://phabricator.wikimedia.org/T192358#4135874 (10jcrespo) p:05Triage>03Normal [12:32:54] 10DBA, 10Patch-For-Review: Setup database logical backups on eqiad - https://phabricator.wikimedia.org/T192358#4135891 (10jcrespo) I am going to setup s1 on dbstore1001. [12:56:33] 10DBA: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4135933 (10jcrespo) @Rduran Do you think you can take care of this? There is a prototype at https://gerrit.wikimedia.org/r/280947 but all the other Remote Calling methods should be dropped and use instead cumin ( https://... [12:57:30] 10DBA: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4135937 (10jcrespo) [12:58:23] jynus: +1 if you want to install it everywhere [12:59:33] I may have to add methods of WMFMariaDB into it so it is self-contained [13:00:03] then add it with the proper depencies to wmfmariadbpy repository [13:00:32] also I may change WMFMariaDB.py to accept the hostname:port method everywhere [14:41:44] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4136190 (10Marostegui) Cable has been replaced by @Cmjohnson just now. [14:42:42] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136191 (10Papaul) switch port information when ready to move db2042. db2042 was on asw-c6-codfw ge-6/0/9 and now will be on asw-d3-codfw ge-3/0/ 10 new ip address will be :... [14:43:31] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136194 (10Marostegui) @ayounsi can you configure asw-d3-codfw ge-3/0/ 10 for us? We want to move db2042 to that port Thanks! [14:52:46] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136223 (10ayounsi) asw-d-codfw-ge-3/0/10 now in private1-d-codfw. Let me know when to disable asw-c6-codfw:ge-6/0/9 [14:55:37] Dear beloved DBAs :) , is there an official procedure documented somewhere to ask for a new database schema for a new misc service? [14:56:34] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136240 (10Papaul) [14:58:55] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4136265 (10Marostegui) [15:01:31] volans: create a ticket, just say what you need and estimate the needs (specially QPS and disk) [15:01:59] ack, thanks [15:09:00] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4136295 (10Marostegui) [15:14:44] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136339 (10Papaul) [15:19:10] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136423 (10Papaul) switch port information when ready to move db2048. db2048 was on asw-c6-codfw ge-6/0/17 and now will be on asw-a1-codfw ge-1/0/0 new ip address will be : 10... [15:19:42] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136425 (10Papaul) [15:21:19] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096579 (10Papaul) Moved db2042 from c6 to d3 in racktables [15:31:43] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136495 (10Papaul) [15:31:56] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136497 (10ayounsi) >>! In T191193#4136423, @Papaul wrote: > switch port information when ready to move db2048. > db2048 was on asw-c6-codfw ge-6/0/17 and now will be on asw-a1-... [15:33:41] if you have some schema changes pending on db1067, tell me and I can do them before the import on dbstore1001 [15:33:55] jynus: nope, all good there :) [15:33:56] thankjs [15:34:03] done already? [15:34:05] cool [15:46:23] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136548 (10Papaul) [15:47:35] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4136553 (10Marostegui) [15:47:42] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4136550 (10Marostegui) 05Open>03stalled So this task is now stalled. Only the primary masters in eqiad... [15:49:53] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136557 (10Papaul) a:05Papaul>03Marostegui Moved db2048 from C6 to A1 in racktables @Marostegui assigning the tasks back to you if you think everything looks good you can... [15:50:41] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136561 (10Papaul) a:05Papaul>03RobH @Robh everything done on my side only switch port left. Thanks [15:51:33] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136566 (10Marostegui) Adding @ayounsi as @RobH is away. [15:53:17] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136568 (10Papaul) switch port information db2012 asw-a6-codfw ge-6/0/11 [15:54:22] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136571 (10Marostegui) a:05Marostegui>03ayounsi Thanks @Papaul!! I have talked to @ayounsi and he will clean up the ports and close the task when ready [15:56:35] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136590 (10Marostegui) Adding @ayounsi as @RobH is away [15:57:10] so I am importing dbstore1001 with an s1 instance so we do work ahead of time for setting up eqiad backups [15:57:30] will you import some other section or just s1? [15:57:42] right now s1, se how it goes [15:58:03] the first one is the complicated one :-) [15:58:22] I intend to clone from clean hosts, mostly the candidates [15:58:36] pre-compress, etc. [15:58:43] cool, that sounds very good [15:58:55] so when we have the hardware we have everyhing prepared, only have to move things [15:58:58] and we can start monitoring dbstore1001 and see how the storage is [15:59:04] not sure if I will fill io [15:59:15] *as much as dbstore2001/2 [15:59:28] because it is also the target of backups [15:59:44] so I don't think we can put 5-6 instances [16:00:04] but we have 11 TB, we should at least start to do something [16:37:02] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136714 (10ayounsi) [16:37:53] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#4136726 (10ayounsi) [16:37:56] 10DBA, 10Goal, 10Patch-For-Review: Decommission database hosts <= db2031 (tracking) - https://phabricator.wikimedia.org/T176243#4136727 (10ayounsi) [16:38:00] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3984964 (10ayounsi) 05Open>03Resolved Switch port cleaned up. [16:40:42] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136735 (10ayounsi) Switch port cleaned up. [16:40:56] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136746 (10ayounsi) [16:49:20] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136752 (10Marostegui) Thanks @ayounsi, so only pending the mgmt dns entries! [16:53:27] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136764 (10ayounsi) 05Open>03Resolved asw-a1-codfw ge-1/0/0 cleaned up asw-c6-codfw ge-6/0/9 cleaned up [18:31:28] marostegui: any idea what's up with s1 (https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@e3739c2&_a=h@af1efc2) ? [18:34:18] AaronSchulz: That URL doesn't work for me [18:36:36] I see db2048 is not responding [18:36:44] That host was moved to a different rack a few hours ago [18:36:49] Let me see if it went down for some reason [18:37:02] db2048 is codfw master [18:37:03] for s1 [18:38:38] it was basically << +channel:DBReplication +shard:s1 >> [18:47:18] AaronSchulz: should be fixed now. The wrong switch port was cleaned up here:https://phabricator.wikimedia.org/T191193#4136497 so the host lost its network [18:47:24] Arzhel rolled back and it is now catching up [18:56:00] I still see lots of log spam for codfw ("Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode") [18:58:45] db2048 is still catching up [18:59:01] (and its replicas too) [21:39:19] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4137759 (10jcrespo) 05Resolved>03Open