[01:26:36] 10DBA, 10SDC Engineering, 10SDC General, 10Wikidata, and 2 others: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 (10MusikAnimal) Is this intentionally missing from the replicas? ` MariaDB [testwiki_p]> USE testcommonswiki_p; ERROR 1044 (4... [02:39:50] 10DBA, 10SDC Engineering, 10SDC General, 10Wikidata, and 2 others: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 (10Reedy) >>! In T197616#4876205, @MusikAnimal wrote: > Is this intentionally missing from the replicas? > > ` > MariaDB [tes... [02:49:27] 10DBA, 10SDC Engineering, 10SDC General, 10Wikidata, and 2 others: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 (10Krenair) >>! In T197616#4876222, @Reedy wrote: >>>! In T197616#4876205, @MusikAnimal wrote: >> Is this intentionally missin... [03:00:22] 10DBA, 10SDC Engineering, 10SDC General, 10Wikidata, and 2 others: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 (10MusikAnimal) > Don’t think anyone has filed a task requesting the views on labs to be created. It doesn’t happen automatica... [07:29:32] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['pc1007.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-a... [08:29:55] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1007.eqiad.wmnet'] ` and were **ALL** successful. [08:33:29] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) pc1007 got installed and looks good: ` root@pc1007:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [08:33:49] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [08:33:57] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) 05Open→03Resolved [09:06:00] 10DBA, 10Operations, 10ops-eqiad: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Marostegui) @Cmjohnson can we request a new DIMM to Dell? [10:18:54] 10DBA, 10Operations: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10jcrespo) I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failo... [10:27:44] do you need help with dbstore1002 ? [10:28:00] No, so far I am finishing the aria_chk [10:28:12] We'll see what we get [10:28:24] if that doesn't work, I will try to start without aria, and then convert the existing aria tables to innodb [10:28:28] I want to avoid restarting the host [10:28:33] (although I think it is a FS issue) [10:28:43] ok, standing by because based on your helpful debugging I would know what to do- I suffered a very similar issue some time ago [10:28:57] (Aria crashing) [10:28:57] oh [10:29:04] advise! :) [10:29:31] well, the nuclear option- removing aria, but you were also going to do that [10:29:35] yeah [10:29:36] XD [10:29:39] I wanted to leave it to the end [10:29:40] removing the logs, let it crash [10:29:49] I couldn't do anything else [10:30:08] of course, you can copy away the files, the structure is very similar to that of myisam [10:30:13] I will finish the aria check, then remove the aria logs, (as suggested by the error log itself) and see what we get [10:30:19] so they can recover on an individual basis [10:30:23] *be recovered [10:30:43] we don't have may myisam tables on the wikis, but the staging db does [10:31:00] So I want to avoid starting without aria if possible as that means altering all those tables [10:33:06] (people using dbstore1002 should already have an email about the issue + task) [10:33:28] marostegui: oh, I was suggesting to do it by moving out the tables [10:33:43] so the server is up and we can then recover those in an individual bases [10:33:51] aaah [10:33:52] for example, the linter tables can be reimported [10:33:54] I didn't get that :) [10:33:56] yeah [10:34:05] but not the staging ones :( [10:34:32] I will let you handle it, just saying another dbstore crashed before so if you need help, I may be able to [10:34:39] Thanks! :) [10:34:41] Appreciate it! [10:34:51] btw, will you clean up the test db from sanitarium or should I? [10:47:27] 10DBA, 10SDC Engineering, 10SDC General, 10Wikidata, and 2 others: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 (10jcrespo) First of all, I am only commenting because I have more information, but access handling is owned by the cloud team... [11:02:25] jynus: meeting time? [14:10:40] 10DBA: Purchase and setup remaining hosts for database backups - https://phabricator.wikimedia.org/T213406 (10jcrespo) [14:39:23] 10DBA, 10Operations, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10CDanis) p:05Triage→03High [14:52:50] 10DBA, 10Operations, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc1007 is now up and replicating. It is catching up. Tomorrow I will replace pc1010 with pc1007 for consistency with codfw... [15:01:23] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [15:36:05] what is the plan with pc hosts, 2 pc1 replicating one from other? [15:36:44] yeah the spare host is replicating from pc1, just for the sake of replicate from somewhere [15:37:03] yeah, no issue, just I didn't know what was the plan [15:37:07] yeah [15:37:11] I am going to promote pc1007 to master [15:37:14] to match codfw config [15:37:17] I guess any host would work [15:37:25] 1007 - pc1 [15:37:27] 1008 pc2 [15:37:30] 1009 pc3 [15:37:32] any shard I mean [15:37:32] 1010 pc1 [15:37:41] yeah, cool to me [15:37:54] right now on eqiad pc1, pc1010 is the master as that one was installed before [15:37:58] and pc2007 the same and replicating from 1007 ? [15:38:03] so I will promote pc1007 tomorrow so both are the same [15:38:27] another question to know what is pending [15:38:30] pc2007 replicates from pc1010 for now, and tomorrow will be done [15:38:47] did you change mw config/cron to increase usage or no change yet? [15:38:55] the retention period you mean? [15:38:58] yes [15:38:59] like the TTL [15:39:12] no, I was waiting for the holidays season to be done and now to get pc1007 in place [15:39:19] which was more reasonable [15:39:20] so I will push the TTL this week most likely [15:39:38] I just want to track what is missing so we do't forget [15:39:44] it can wait [15:39:49] yeah, it is all tracked on the tasks [15:40:07] https://phabricator.wikimedia.org/T208383#4877301 [15:40:15] but I would like to find a good balance between resource usage and room for accidents [15:40:28] https://phabricator.wikimedia.org/T210992#4873982 [15:40:38] ah, cool [15:40:43] so there was already a ticket [15:40:47] yep [15:40:49] and after that [15:40:52] we can start with [15:41:03] https://phabricator.wikimedia.org/T210725 [15:41:07] please involve me [15:41:14] if you want to do it yourself [15:41:22] I will involve you for sure [15:41:25] it is a delicate operation [15:41:29] I want to give a deep look at the past and current state [15:41:52] I will actively ask for a +1 for those changes, TTL and key changes anyways [15:42:01] I am not sure if to take over T210725 now [15:42:01] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [15:42:14] on one way, it is better to do it now [15:42:24] before increasing ttl? [15:42:30] on the other, we may be mixing 2 changes [15:42:41] yeah, I think we should isolate those [15:42:46] whichever we do first, I don't mind [15:42:56] I think TTL should be first though, as we "have been there" [15:42:57] let's send to SOS some request for help [15:43:17] to performance and core [15:43:23] for which change? [15:43:38] both in general "upcoming parser cache changes" [15:43:44] they are involved on both [15:43:50] I guess we need to decide which one to go first [15:43:51] also service ops will want to get involved [15:44:08] I think we should do the TTL first [15:44:12] so not sure if SOS, but let's highlight it [15:44:12] As it is a situation we know [15:44:17] something is happening soon [15:44:39] "be aware of this change(s)" [15:45:19] hey, do you guys want to do a meeting about the DBA JD tomorrow? [15:45:32] added to the SRE meeting [15:45:37] mark: fine by me [15:45:37] ok to me, marostegui [15:45:39] *mark [15:45:43] tomorrow afternoon? [15:45:46] ok [15:45:50] sounds good [15:45:56] from what time are you available after lunch? [15:46:14] for me normally 3 is ok [15:46:17] I am free after 3pm all the time [15:46:18] *16 CET [15:46:21] *15 [15:46:25] let's do 3 then [15:46:26] so 14 UTC [15:46:32] mark: ok for me [15:47:50] we discussed already some thing today manuel an me [15:47:55] *things [15:59:01] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [16:00:55] marostegui: do you have time for backups now or do I go for the s4 restart? [16:01:36] I have time [16:01:42] we can also do it tomorrow morning [16:01:45] whatever you prefer [16:03:11] it is just 5 minutes [16:03:20] famous last words! [16:03:25] let's do it then [16:03:47] it is 5 minutes to explain, the discussion it is out of my hand [16:03:51] haha [17:00:13] 10DBA, 10Operations, 10ops-eqiad: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Cmjohnson) @Marostegui I have to do move the DIMM to another slot and see if the error corrects itself moves with the DIMM or remains the same. Can you... [17:00:55] 10DBA, 10Operations, 10ops-eqiad: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Marostegui) Yep, we can do that! Just ping us when you are ready for it Thanks! [17:02:03] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Marostegui) a:05Papaul→03None Nothing for Papaul to do here for now. [17:46:51] looks like we need to prepare an s3 failover! [19:13:45] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:15:09] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:18:37] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:35:23] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:35:55] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:51:41] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) Please note the work has now been scheduled for Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT). As both the #dba team and the #analytics team have expressed interest in st... [19:58:03] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:58:45] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) @robh what's your plan with db1075 (the db master)? [19:59:58] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) >>! In T213748#4878612, @Marostegui wrote: > @robh what's your plan with db1075 (the db master)? @cmjohnson will take 1 of the 2 power supplies and cross-cable it into... [20:04:19] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) Awesome! Thanks for clarifying! [22:57:33] 10DBA, 10SDC Engineering, 10SDC General, 10Wikidata, and 2 others: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 (10Jdforrester-WMF) >>! In T197616#4876226, @MusikAnimal wrote: >> Don’t think anyone has filed a task requesting the views on...