[00:22:55] 10DBA, 07Epic, 13Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#2879055 (10Dzahn) @jcrespo :) cool, thank you [03:32:12] 10DBA, 10AbuseFilter, 06Security-Team, 06Stewards-and-global-tools, 13Patch-For-Review: Log accessing private information by those with `abusefilter-private` permission - https://phabricator.wikimedia.org/T152934#2879298 (10Huji) [07:10:28] 10DBA, 10AbuseFilter, 06Security-Team, 06Stewards-and-global-tools, 13Patch-For-Review: Log accessing private information by those with `abusefilter-private` permission - https://phabricator.wikimedia.org/T152934#2863957 (10Marostegui) Hi, I have some comments about the table structure you are proposin... [08:26:27] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2879937 (10Marostegui) Transfer started! [08:47:14] 10DBA, 06Labs, 10Labs-Infrastructure: fix m5 shard mysql issues with default encoding binary - https://phabricator.wikimedia.org/T112103#2880019 (10jcrespo) 05Open>03Resolved a:03jcrespo [08:48:28] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2880024 (10Marostegui) db1045 is done ``` root@neodymium:~# mysql -hdb1045 -A wikidatawiki -e "show create table revision\G" --skip-ssl *************************** 1. row **********************... [09:10:27] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2880074 (10Marostegui) The only pending hosts are: dbstore1001 dbstore1002 - alter running now db1049 (master) [09:23:08] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880144 (10Marostegui) Changing the motherboard made no difference, the server died after 50 minutes into the transfer: ``` Server Power: Off ``` Same error log as always (even though the date look... [09:36:44] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880172 (10Marostegui) As a new test I am throttling the netcat transfer to a max of 10mb from the source to db2034 to see if that gives less load to the disks/eth on db2034 and makes any difference. [10:07:40] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880252 (10Marostegui) I have set up a quick netconsole server logging over netcat on db2048:9999 to see if there is something sent which is not logged to disk (logs) when db2034 dies. [10:12:52] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2880261 (10Marostegui) dbstore1002 is done ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql -hdbstore1002 -A wikidatawiki -e "show create table revision\G" --skip-ssl ***********... [10:34:44] some of the old replica.my.cnf accounts start with a 'u' instead of 's' [10:34:46] for... some reason. [10:34:52] I'm going to clean those out just now [10:37:09] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880327 (10Marostegui) Server died again 15 minutes into the transfer with the throttle to 100MB ``` Server Power: Off /system1/log1/record2 Targets Properties number=2 severity=Critic... [10:44:12] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2880362 (10Marostegui) alter running - dbstore1001 [10:53:43] yuvipanda, the non-service ones [10:54:07] Krenair: nope, still only working with tools (hence service accts) [10:54:24] huh [10:54:59] yeah [10:55:26] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880406 (10jcrespo) Can we try a local `dd` with are without compression to discard network and/or CPU? [10:57:32] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880408 (10Marostegui) >>! In T149553#2880406, @jcrespo wrote: > Can we try a local `dd` with are without compression to discard network and/or CPU? I tried that (not sure if that's exactly what yo... [10:58:36] labdsdb1004 crashed while I was tring to do an import tablespace there [10:58:51] right now? [11:01:50] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880424 (10jcrespo) > I tried that And it crashed or not crashed? What if we do the same (large dd), without touching disks, on a ramdisk? [11:02:06] a few minutes ago [11:02:24] without .cfg [11:02:39] and with .frms generated from mysqlfrm [11:02:48] it was a nice recovery exercice [11:02:57] given that the original server was still a 5.5 [11:03:09] but the 18GB table said no [11:03:13] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2880426 (10Marostegui) >>! In T149553#2880424, @jcrespo wrote: >> I tried that > > And it crashed or not crashed? Nope > What if we do the same (large dd), without touching disks, on a ramdisk?... [11:15:07] I am pooling db1073, with 10.0.28 and no query rewrite [11:15:15] as a new enwiki api role [11:16:26] happy to see db1073 back [11:16:29] after all its pain :) [11:17:38] I will monitor "10.64.48.20" OR "10.64.48.21" OR "10.64.48.28" on kibana [11:18:02] if 1073 is better or equal to the other 2, we will reimage the other apis [11:18:12] do you think not having query rewrite can be a big issue? [11:18:24] I am not familiar with what we use query rewrite for in the api service [11:18:25] I think query rewrite is an issue [11:18:29] we never talked about it :) [11:18:53] maybe on monday you can give me some history? [11:18:58] because when we switched over to codfw, problems disappeared [11:19:23] marostegui, https://logstash.wikimedia.org/goto/7b33d55748a97bedffd08c8e4ad665c9 [11:19:47] marostegui, short story long, that predates me, changing api is scary [11:20:01] agreed [11:20:09] do you want to leave db1073 pooled in for the weekend? [11:20:16] yes [11:20:28] for now, it is literally infinitelly better than the other 2 [11:20:43] 12 errors existing servers, 0 errors new one [11:20:49] same pool weight [11:20:55] \o/ [11:21:42] and the current ones had regular outages/slowdowns [11:22:00] the log is pretty cool indeed, no errors from .28 [11:22:16] is it totally pooled now? as in: the servers re-read the config already and start sending traffic to it? [11:22:27] exact same traffic than the others [11:22:55] unless I am wrong: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [11:23:10] no, you are right: https://grafana-admin.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1073 [11:23:57] I believe it is not .28 [11:24:04] db1073.eqiad.wmnet has address 10.64.48.28 [11:24:29] but the slightly more powerful server + no query rewrite + old stuff [11:24:58] aah haha, I was talkiung about the IP and you about 10.0.28 [11:25:12] did I wrote the wrong ip? [11:25:17] no no [11:25:20] *write [11:25:41] ˜/marostegui 12:21> the log is pretty cool indeed, no errors from .28 -> there I meant 10.64.48.28=db1073 [11:25:51] one error now [11:25:59] maybe I am being optimistic [11:26:22] although I think we haven't applied the revision fix [11:26:27] yet on enwiki [11:57:03] 10DBA, 06Operations: Create a full backup of all external storage records that would be easy to restore/setup a temporary delayed slave - https://phabricator.wikimedia.org/T153440#2880544 (10jcrespo) [11:57:43] 10DBA, 06Collaboration-Team-Triage, 10Flow, 06Operations, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#1499219 (10jcrespo) a:05jcrespo>03None Focusing on the blocker subtask. [12:03:24] I was like "lag on dbstore1002 s5, toku db again?", then I saw the import [12:03:36] 1001 [12:03:40] ah, hehe [12:03:42] no, the alter :) [12:03:48] well, that [12:40:27] 10DBA, 13Patch-For-Review: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#2880669 (10Marostegui) Thinking about it, all the databases that have the tables `echo_xxx` and their size is 1M they do contain a row: ``` root@dbstore2001:/srv/sqldata# mysql --ski... [12:47:22] jynus: ema pointed me to a disk storage alert on labsdb1001, just pinging to see if there's anything I can do to help [12:57:30] 10DBA, 10AbuseFilter, 06Security-Team, 06Stewards-and-global-tools, 13Patch-For-Review: Log accessing private information by those with `abusefilter-private` permission - https://phabricator.wikimedia.org/T152934#2880705 (10Huji) On the changeset is probably best. Just pick one of the SQL files and comme... [13:25:17] 10DBA, 13Patch-For-Review: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#2880849 (10jcrespo) We should ask (for example, on the thread of tables to drop) if that row should be there- if not we can drop them from production, too (not right away, but with t... [13:47:57] 10DBA, 13Patch-For-Review: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#2880913 (10Marostegui) >>! In T151552#2880849, @jcrespo wrote: > We should ask (for example, on the thread of tables to drop) if that row should be there- if not we can drop them fro... [14:20:32] 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#2036080 (10Marostegui) While working on importing x1 data into dbstore2001 (T151552) I noticed that the following tables are existing in most of the wikis... [14:41:34] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2881046 (10Marostegui) Update: 4 hours into the transfer (145G copied) and the server is still up. Looks again either disk or ethernet related. Recap of what the server survived for: - CPU burning... [15:22:51] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2881174 (10Marostegui) I have mounted a 512G ramdisk and I am writing with 12 `dd` instances to it. [16:34:52] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2881478 (10Marostegui) The server has not crashed after 5 hours into the netcat transfer and I believe it won't crash. So that leave us with the fact that with rate limiting the connection (to 10mbs... [16:56:14] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2881551 (10jcrespo) 05Open>03Resolved a:05jcrespo>03Cmjohnson The server is now back fully into production after being cloned from db1052 as an extra api node (which... [19:34:23] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2882144 (10Marostegui) I have run the loop twice, writing 1.1T each time and the server didn't crash. The average writing time was 515MB/s Going to run the loop again. Maybe we need to try the appr... [21:02:36] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2882420 (10Marostegui) dbstore1001 is done ``` root@neodymium:~# mysql -hdbstore1001 -A wikidatawiki -e "show create table revision\G" --skip-ssl *************************** 1. row ************... [21:17:42] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2882453 (10Marostegui) @Papaul we have not replaced the CPUs, right? I am thinking that this might be related to CPU and the IRQs. Right now I am seeing that the NIC interrupts are just being handle...