[00:02:22] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987795 (10Dzahn) host 10.192.48.91 91.48.192.10.in-addr.arpa domain name pointer db2093.codfw.wmnet. host db2093.codfw.wmnet db2093.codfw.wmnet... [00:10:08] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987813 (10Dzahn) @Papaul prod IP added, renamed in DHCP, partman doesn't have to be changed. you can now go ahead with the OS install [00:51:31] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3987931 (10greg) Hi, sorry, my bugmail backlog is woefully long right n... [01:31:31] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3987999 (10Papaul) [01:34:16] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3988002 (10EddieGP) >>! In T176754#3987931, @greg wrote: > Hi, sorry, m... [04:04:34] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3988175 (10Papaul) [04:05:38] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3934359 (10Papaul) a:05Papaul>03Marostegui @Marostegui it is all yours. Installation complete . [06:23:51] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988250 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm... [06:48:09] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ``` and were **ALL** successful. [06:59:11] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988268 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm... [07:01:04] 10DBA, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3988269 (10Marostegui) [07:01:32] 10DBA, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3984964 (10Marostegui) This host is now set to spare, but as puppet cannot run (FS is corrupted), the new role will never get applied :-) [07:36:49] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988286 (10Marostegui) I am trying to check what's wrong with db2037, as it is showing: ``` [ 52.315934] blk_update_request: critical... [07:37:16] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988287 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['db2037.c... [07:37:44] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988288 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2037.codfw.wm... [07:52:11] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988292 (10Marostegui) a:05Marostegui>03Papaul And ILO isn't working any more, so the PXE cannot be set. ``` root@neodymium:~# ipmito... [07:53:12] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988294 (10Marostegui) HW logs show nothing by the way [08:11:07] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988323 (10Marostegui) I have managed to get the system up after fixing a few i-nodes: ``` root@db2037:~# touch test root@db2037:~# ```... [08:18:41] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988328 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['db2037.c... [08:19:38] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988329 (10Marostegui) I am unable to reimage the server due to the PXE thing I described at: T187722#3988292 The system looks fine, so f... [09:39:15] should we wait or do you want me to reassign db2044 to m5? [09:39:45] no, continue with m2 for it [09:40:04] I am importing data now into db2037, we will see what papaul says later today [09:40:48] it is also a good test to see if writing to the disk makes it crash or something [09:40:56] so let's continue with the original plan for now [09:41:35] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988386 (10jcrespo) [09:41:37] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3988385 (10jcrespo) [09:42:25] let me propose something- I do not want to have m5 without redundancy, let recover the backup and setup redundancy on db2078 [09:42:33] unpuppetized [09:42:46] and later we can just copy to the final place [09:42:55] *let me [09:43:16] also that way I can create a new backup [09:43:22] the copy on db2037 is now half way I think [09:43:29] but yeah, feel free to do it on db2078 too if you like [09:43:35] oh, it worked? [09:43:42] I thought it was broken [09:43:55] that is why I said it [09:44:12] ah sorry I thought you read: https://phabricator.wikimedia.org/T187722#3988323 [09:44:21] sorry I wasn't aware you were lacking some context :) [09:44:23] read from there [09:44:35] I just read from T187722#3988329 on [09:44:36] T187722: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722 [09:44:46] so it seemed reimage didn't even start [09:45:27] I want papaul to fix the idrac (I guess a power drain will work) and double check the disks' led [09:45:38] but meantime, as the system looks fixed, I am importing the data [09:45:48] ok, I didn't get that part [09:46:16] "I am unable to reimage the server" seemed like you were stuck [09:46:51] ah yeah, that comes from the fact that I wanted to reimage another time to see if the system would come back fine without any inode issue [09:46:57] It was kinda like another test [09:47:13] but doing some intensive data import will also work as a test for the disks [09:47:22] ok, I will then do db2044 [09:47:38] yep, continue with the plan unless db2037 dies forever I would say [09:47:44] remember to put a filter on test* dbs [09:47:58] I am not sure if they are actively being written [09:48:13] but they are explicitly removed from the backups [09:48:13] the big big one, we are not even dumping it [09:48:16] yeah [09:48:27] big one? [09:48:35] well, the biggest, 250G [09:48:43] which one? [09:48:52] testreduce_0715 [09:49:09] that is not being written since 2017 [09:49:21] I don't know, they told me not to back it up [09:49:41] some tables are being written but not all [09:50:23] I will exclude testreduce_vd and testreduce_0715 from replication [09:50:31] is that what you meant? [09:50:35] yes [09:50:45] we are on the same page then :) [09:51:03] I will then do db2044 (m2) [09:51:20] cool [09:57:28] what do you think of https://gerrit.wikimedia.org/r/#/c/412994/ ? [09:58:08] specifically the core-test role [09:59:09] (that all will be simplified when core is moved to a profile) [09:59:20] but at least it now has a separate name [10:00:29] I will check in a sec [10:04:38] I think you have a problem on your commit [10:04:57] you added db2037 to m5, but didn't remove it from s4 [10:05:14] https://gerrit.wikimedia.org/r/#/c/413103/5/manifests/site.pp [10:06:10] ops [10:06:13] I will do right now [10:06:34] thanks for checking [10:07:00] I didn't check [10:07:16] I was confused why it didn't have a conflict [10:07:29] then I checked the "before" on my own diff [10:15:42] so my change is now https://gerrit.wikimedia.org/r/412994 [10:16:06] I just commented on it :) [10:19:16] I think there is a mismatch between prometheus configuration and site.pp [10:19:50] ah no, it is ok [10:20:00] 5 hosts for s4 on codfw [10:20:16] and we should add a large one next [10:20:52] marostegui: I don't know if your comment is a -1, and if it is, what to do different, or just a FYI [10:21:24] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3988479 (10Marostegui) m5 is now replicating on db2037. I will leave notifications disable till we do the tests with @Papaul when he has... [10:21:32] core and core-test should work at first exactly the same, so I don't anticipate a problem for now [10:21:33] A -1? I did +1 :) [10:21:40] no, I mean the comment [10:21:58] ok, I see it now [10:22:19] so, I will do extra fixes later [10:22:49] I fear that sharing the role, while the right thing to do when setup, it would be confusing later [10:23:02] yeah, agreed [10:23:24] when roles become a skeleton, it will make more sense- it will only be 2 lines, and one with different monitoring [10:24:01] jenkins complains about style, but those will become fixed later when things are migrated [10:24:23] an when old hosts are decomm [10:24:30] yep, just +2 it manually I would say [10:25:02] when the replica has catched up, ping me and I will run a backup [10:25:48] it is already up to date, I added the comment when the replica caught up :) [10:27:39] oh [10:27:43] so fast [10:28:12] I will then change the dns for the slave right away [10:28:30] I am patching the proxy [10:34:28] 10DBA, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3988491 (10Marostegui) [10:35:55] s5 backup now happening, that should also help validating by reading all data [10:36:40] awesome [10:36:59] writes worked fine, we sill see if the reads crash it or something :) [10:37:18] enable gtid, in case it is put down inproperly [10:37:23] I did already [10:37:24] maybe you already did [10:37:27] cool [10:38:01] at some point in the future, we can reimport the test* db, but very low priority [10:38:10] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3988496 (10Marostegui) a:05Marostegui>03RobH Assigning it directly to @robh so he can finish up with this (please let me know if you prefer another way of letting... [10:38:21] no backups means no guarantee of being there [10:39:42] check also nagios and grafana, sometimes grants needs to be reloadade on very old hosts [10:39:59] it happened to me a few times [10:40:11] I checked earlier and it was working fine [10:40:30] coold [10:40:45] are you cold? [10:40:49] :( [10:41:06] just trying to be useful with things I broke in the past [10:41:16] you are :) [10:41:28] i faced that grants issue too in the past :( [10:41:58] I think I put on the roadmap some grant stuff [10:42:13] also I would like to convert nagios and prometheus to passwordless [10:42:28] but needs help from alex and filippo [10:43:04] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988511 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on sarin.codfw.wmnet for hosts: ``` ['db2044.codfw.wmnet'] ``` The log can... [10:47:45] so with 30, 11 and 12 replaced that would finish the goal for codfw [10:48:23] leaving I think just 4 on eqiad [11:10:45] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3988605 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2044.codfw.wmnet'] ``` and were **ALL** successful. [12:11:34] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3988795 (10Marostegui) watchlist table is done. Next: ores_classification [12:20:55] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3988871 (10Marostegui) labsdb1009 has killed a couple of legit queries, so I have applied the same patch to labsdb1010 and labsdb1011 and leave it running to get a higher sample of querie... [12:34:37] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3988898 (10Marostegui) [12:44:21] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3988925 (10Marostegui) For s8 I propose db1104 (there are only large servers there). So I propose that one because it is in a different row. [13:38:07] m2 codfw with stretch should be working now [14:45:18] 10DBA, 10Operations: Decommission db2011 - https://phabricator.wikimedia.org/T187886#3989185 (10jcrespo) p:05Triage>03Normal [15:11:35] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989328 (10Cmjohnson) [15:12:05] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3925325 (10Cmjohnson) @jcrespo and @Marostegui This is all yours. Please resolve once verified. Thanks! [15:27:45] 10DBA, 10Operations, 10netops, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3989413 (10Marostegui) Can this be resolved then? [15:28:18] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3989416 (10jcrespo) I would say this is resolved- only pending the actual decommission (tracked on separate tickets), and the setup of extra servers fo... [15:29:07] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3989417 (10Marostegui) [15:29:27] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3934359 (10Marostegui) 05Open>03Resolved Thanks @papaul! This looks good. We can continue the setup at T184704 [15:29:36] I would like to resolve https://phabricator.wikimedia.org/T183470#3989416 [15:30:09] Are db2011 and db2012 done already? [15:30:15] yes [15:30:23] Then I think we are done! [15:30:38] 10DBA, 10Goal, 10Patch-For-Review: Decommission database hosts <= db2031 (tracking) - https://phabricator.wikimedia.org/T176243#3989426 (10jcrespo) [15:30:42] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3989427 (10jcrespo) [15:30:46] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3989424 (10jcrespo) 05Open>03Resolved a:03jcrespo [15:30:55] https://phabricator.wikimedia.org/T187543 [15:31:06] https://phabricator.wikimedia.org/T187886 [15:31:20] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989429 (10Marostegui) 05Open>03Resolved Thanks @Cmjohnson the host looks good. We can continue the service setup at T184704 [15:31:33] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3989434 (10Marostegui) [15:32:29] 10DBA: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3989438 (10jcrespo) [15:32:32] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3989439 (10jcrespo) [15:33:57] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10jcrespo) [15:34:29] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10jcrespo) [15:35:22] 10DBA, 10Operations, 10netops, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3989446 (10Papaul) Yes we can resolve this [15:35:53] 10DBA, 10Operations, 10netops, 10ops-codfw: switch port configuration for db2093 - https://phabricator.wikimedia.org/T186172#3989448 (10Papaul) [15:37:35] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3989465 (10Marostegui) [15:37:40] 10DBA, 10Operations, 10netops, 10ops-codfw: switch port configuration for db2093 - https://phabricator.wikimedia.org/T186172#3989463 (10Marostegui) 05Open>03Resolved Thanks! [15:51:01] not sure if you've seen this yet, Icinga warns about predictive HP RAID failure on db2048: "WARNING: Slot 0: OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Predictive Failure: 1I:1:1 - Controller: OK - Battery/Capacitor: OK" [15:51:11] (only started 55 mins ago) [15:51:41] Yeah, I saw it it will most likely fail and we will get the task [15:51:47] I didn't create a preventive task this time [15:51:51] Thanks moritzm :-) [15:51:56] didn't db2048 failed recently? [15:52:08] a disk, I mean [15:52:17] yes [15:52:21] https://phabricator.wikimedia.org/T187419 [15:52:24] T159666 [15:52:25] T159666: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666 [15:52:30] T159666 [15:52:39] T159849 [15:52:39] And the one I pasted as well, is more recent :) [15:52:39] T159849: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159849 [15:52:53] T187419 [15:52:53] T187419: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T187419 [15:53:03] T187328 [15:53:04] T187328: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187328 [15:53:23] It is always the first slot apparently [15:53:27] motherboard issue? [15:53:44] Ah not always, but the more recent ones [15:56:05] maybe the last one was not a new disk [15:56:24] yeah, could be [15:58:27] are you planning to work on tendril? [15:58:41] if you do, I will look at labsdb1010 [15:58:57] yeah, I was planning to take it [15:59:00] (tendril) [15:59:09] I would like to give a change to mariabackup [15:59:14] *chance [15:59:18] oh that'd be nice! [15:59:30] tendril would be a good candidate [15:59:41] should the wort thing happens, it is not as bad [16:00:01] it is not on path, because it wasn't stable [16:00:28] but you will find it on /opt/wmf-mariadb101/bin , I think [16:01:04] although now that I think it, it won't work for tokudb [16:01:56] will you merge https://gerrit.wikimedia.org/r/#/c/412678/? [16:02:16] I will not start working on tendril today though [16:02:32] I will logoff soon, so I wasn't planning on it today [16:03:07] that doesn't mean you cannot start if you want to, of course. Don't know if you will be around for a bit more :) [16:03:08] just saying [16:14:54] is the os installed already? [16:15:01] yep [16:15:05] on both [16:15:53] let me see how much stuff there is on the db [16:16:05] I am worried about the events and replication [16:16:56] it is 800GB [16:16:59] 900GB [16:18:10] I forgot it was toku…:| [16:18:11] I wonder if we should start from 0 instead [16:18:21] or [16:18:35] even drop the graph data [16:18:47] we don't really use it at all (at least i don't) [16:19:04] that is what it takes most of the space [16:19:10] +1 to drop it [16:19:44] we could drop + mydumper to db1115 and then play with mariadbacbkup for db2093 if we want later [16:19:44] alternatively, we stop tendril and copy it, think later [16:19:55] haha, yeah, that too [16:20:14] it shouldn't take more than 1-2 hours, maybe less [16:20:29] Those disks aren't ssd, remember that [16:20:45] but yeah, maybe 2-3 hours or so, not a lot more [16:21:33] if we do it now, we can have it back up by the mediawiki deployment [16:22:12] I don't want to slack or throw it at your backlog, but I wasn't planning to be around for a lot longer, as I started quite early today :) [16:22:20] yes, it is ok [16:22:25] I can do it tomorrow morning (early morning as I start early) [16:22:26] in fact it is a good thing [16:22:39] because it means it won't affect you [16:22:53] hehe that is true :) [16:22:56] but [16:23:03] maybe it doesn't go back up [16:23:11] yeah, that is my fear [16:23:14] Buuuut [16:23:20] stopping + start mysql shouldn't be a problem [16:23:27] because after all, db1011 is crashing quite often [16:23:30] don't be so confident [16:23:33] the mysqld process [16:23:40] things that should work hadn't in the past [16:23:47] I would let you do it [16:23:54] it takes 2-3 hours [16:24:08] if it doesn't start back up, we do crisis meeting toghether [16:24:18] as by that time I will be around [16:24:27] and no monitoring in the morning will not be as bad [16:25:05] coolio [16:25:09] I can do it tomorrow then [16:25:15] I can spend time trying to "fix" tendril [16:25:25] will you merge the change today? [16:25:30] probably not [16:25:35] don't know [16:25:42] if I do it now, I can break things [16:25:52] if I don't do it, it will be a waste of a copy [16:25:59] I can merge it tomorrow morning [16:26:04] Stopping puppet on db1011 [16:26:04] maybe deploy after the copy [16:26:10] that too, yeah [16:26:23] that looks safer indeed [16:26:30] I know it is a waste [16:26:39] no, it is not, because the data will be copied already [16:26:41] but my biggest fear [16:26:47] is the host or disk [16:26:50] not the software [16:27:00] once it is fully copied and seems working/replicating [16:27:08] we can restart the whole server [16:27:21] sounds like a plan? [16:27:40] I will try to do all preparations [16:28:04] check how events work, crons, etc. [16:28:16] ˜/jynus 17:27> once it is fully copied and seems working/replicating -> you mean the new ones? [16:28:20] yes [16:28:31] they don't even have mariadb package installed [16:28:39] basically the proposal is not touch db1011 at all [16:28:46] until we have a full copy [16:29:13] then let's do a mydumper now from db1115 [16:29:17] and leave it there [16:29:26] if it is just in case [16:29:27] I have one [16:29:34] I can do another on dbstore2001 [16:29:44] I can take care of that [16:29:51] I am lost with the plan :) [16:30:07] so, I do a logical copy of the non-graph parts [16:30:34] tomorrow, you put the mysql down, and copy everything to the new eqiad host [16:30:53] ah ok ok I thought you wanted to do one or the other [16:30:55] gotcha [16:31:00] once the copy is done, we reboot/upgrade/etc [16:31:08] and I can merge changes, if I have something [16:31:13] yeah [16:31:18] my other fear [16:31:20] But [16:31:22] is that the copy itslef [16:31:28] could also bring it down [16:31:30] db1115 has no mariadb package installed or anything as it doesn't have any role [16:31:43] yeah, we can do at at any time [16:31:55] I just do not want to have puppet disabled the whole day [16:31:58] yeah [16:32:04] you can do that before the copy start [16:32:09] I will install the package tomorrow [16:32:10] disable puppet and merge? [16:32:10] yeah, that [16:32:17] et.c [16:32:28] I will do today all "preparations" [16:32:36] Yeah, that is what I suggested, I can merge that change after stopping puppet on db1011 [16:32:43] the thing with the logical copy [16:32:54] is that even it is ok to have it [16:33:04] it will take too much time to be recovered [16:33:18] there is nothing we can do about that really [16:33:32] yeah, but if we do both [16:33:38] it should take less time [16:33:44] both what? [16:33:47] I do the logical one on dbstore now [16:33:51] yeah [16:33:56] and tomorrow you do a proper cloning physically [16:34:04] Yes, that was the plan [16:34:22] ok, so we agree, right? now go awat! [16:34:25] haha [16:34:26] yeah [16:34:27] so [16:34:47] tomorrow: stop puppet on db1011, merge the change, and do the copy to db1115 [16:34:52] yep [16:35:00] by the time it finishes, I should be around [16:35:17] and also this: https://upload.wikimedia.org/wikipedia/commons/a/ac/Hands-Fingers-Crossed.jpg [16:35:49] 🤞 [16:36:16] interesting, my irc client shows a blank line [16:38:16] hey, later you tell me it is me that I make you lose time with silly things! [16:38:29] hahaha [16:38:33] ok ok ok ok ok [16:38:35] I am off! [16:38:36] o/ [20:58:35] Reedy: can you see if there's any incidence with my account on mediawiki.org for T187834 ? [20:58:36] T187834: Username not recognized by the extension - https://phabricator.wikimedia.org/T187834 [21:32:24] backing up tendril in a logical way is a nightmare- it actually uses a combination of tokudb and innodb, which means it is impossible to get a consistent view; plus querying the data dictionary is really hard because all federated/connect tables [21:34:03] so we cannot reuse the code, we cannot reuse the tree, we cannot reuse the query stats (it uses processlist instad of P_S) and we cannot reuse the graphs [21:34:20] I do not think there is much left to keep, honestly [21:34:35] except the schema structure [21:42:58] jynus: got a minute for a Logstash query? [21:44:51] logstash? [21:45:08] I am not master on that [21:45:18] search is normally more on top of that [21:45:23] or even releng as users [21:45:33] but aking is free :-) [21:46:57] 10DBA, 10Operations, 10Release-Engineering-Team, 10cloud-services-team, 10wikitech.wikimedia.org: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3896851 (10demon) >>! In T184805#3896861, @jcrespo wrote: > Only adding #releng and #wmcs in case they can think of a reason not to move them... [21:47:35] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3991062 (10Andrew) [21:49:03] jynus: okay... [21:49:17] finding people to do something is everyday harder [21:49:37] people to do something? [21:49:54] if you need something to get done, phabricator is the place [21:50:17] here is ok for pinging or conversation, but otherwise things will get forgotten [21:50:50] true [21:51:45] also plase be understunding, you may not know how many issues are happening at the same time that you cannot see [21:52:04] just because people are working hard to get them fixed because they affect you [21:52:33] yes, yes [21:52:34] people normally hangout on a few communities, but we have to serve all of them [21:53:06] just found someone able to query Logstash [21:53:10] for a MWException [21:58:30] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3991103 (10greg) [22:40:05] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3991217 (10Andrew) Proposed checklist: [] exclude labswiki and labtestwiki from dumps and tools replicas (this may happen by default but I'm not sure)... [23:25:08] 10DBA, 10Availability (Multiple-active-datacenters): Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504#3991348 (10Krinkle) [23:55:58] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#3991465 (10jcrespo) core dbs being accessed from non-core applications servers is something that needs checking- we do not have any of those right now, a...