[06:40:43] 10DBA, 10Cloud-Services: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Marostegui) [06:44:18] 10DBA, 10Cloud-Services: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012901 (10Marostegui) I am killing sleeping connections to nova database in a screen on db1009 for now. [06:45:01] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012902 (10Marostegui) p:05Triage>03High [06:45:12] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Marostegui) [06:48:31] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012915 (10Marostegui) [07:03:16] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012947 (10chasemp) ```root@labcontrol1001:~# OS_TENANT_NAME=admin-monitoring openstack server list +--------------------------------------+-----------------------+--------+---------------------+ | ID... [07:03:20] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10madhuvishy) Things seem a lot better now since 06:57 madhuvishy: Restart nova-conductor on labcontrol1001 06:59 chasemp: restart nova-api on labnet1001 [07:03:48] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012950 (10Marostegui) >>! In T188589#4012901, @Marostegui wrote: > I am killing sleeping connections to nova database in a screen on db1009 for now. This was stopped at 06:56AM as @madhuvishy started to res... [07:04:05] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012951 (10chasemp) @madhuvishy restared nova-conductor and I restarted nova-api shortly thereafter. nova-conductor restart seems to have calmed things down. I restarted nova-api as it has a tendency to "g... [07:16:31] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012970 (10chasemp) I took some patience but post restarts I cleand up nova-fullstack's mess ``` 2007 OS_TENANT_NAME=admin-monitoring openstack server list 2008 OS_TENANT_NAME=admin-monitoring openstack s... [07:18:11] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012971 (10Marostegui) p:05High>03Normal Decreasing the task back to Normal priority as things look stable and leaving it open as per: ``` ˜/chasemp 8:16> marostegui: no worries and let's leave it open ti... [07:21:59] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4012976 (10Marostegui) 05Open>03Resolved Let's see if this lasts for long this time ``` root@db2048:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E3350)... [07:25:23] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012978 (10chasemp) ```root@labcontrol1001:~# OS_TENANT_NAME=contintcloud openstack server list +--------------------------------------+----------------------------+--------+---------------------+ | ID... [07:26:33] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012981 (10madhuvishy) Some logs from nova-conductor corresponding to the time of incident, doesn't seem like the root cause but correlates with the db spike. https://phabricator.wikimedia.org/P6770 [10:11:30] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013230 (10jcrespo) This was warned in advance at T188210 [10:20:57] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013246 (10jcrespo) The immediate solution is T183469 [10:24:17] 10DBA, 10Wikimedia-Incident: Investigate why query killer didn't kill 1-hour long queries - https://phabricator.wikimedia.org/T188505#4013250 (10jcrespo) [11:50:32] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4013426 (10jcrespo) labsdb1011 is back up, but needs to be upgraded and catch up with replication. The next steps after that are: * pool labsdb1011 back * depool labsdb1010 * Recover it from generated... [12:16:33] 10DBA, 10Wikidata, 10Patch-For-Review, 10Technical-Debt: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#4013579 (10thiemowmde) p:05Triage>03Normal [12:17:39] 10DBA, 10Wikidata, 10Technical-Debt: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#1709272 (10thiemowmde) [13:03:41] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4013727 (10jcrespo) [13:23:29] jynus: marostegui: Hey, Wikidata team decided to pick this up: https://phabricator.wikimedia.org/T184485. What it means DB-wise is that logging table in Wikidata (and most wikis like commons) will be cut to either: 1- half 2- or one percent. That will free up lots of storage. ETA of happening it is the next month. Any considerations? will definitely let you know so you optimize the tables [13:23:29] The table itself is 600M rows with average of 180 bytes per row (not to count indexes) [13:23:47] I sent this yesterday too but you weren't around I guess [13:34:03] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Andrew) Sorry I slept through this last night! I'm catching up. A few facts: nova-api seems to connect directly to the database. Other than nova-api, nova-conductor is the service that marshals... [13:43:47] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013834 (10jcrespo) The problem I see is that each openstack (I think it is openstack) application, has its own pool of connections- which has some issues for our infrastructure- first because it "reserves" r... [13:49:33] we read it [13:49:53] it is just we don't have any actionables there- [13:50:04] no need to ping us, it is ok to write it here so we are in the loop [13:50:32] but with so many thing in our worklog, just create a ticket for defragmenting when there is an actionable [13:51:34] in fact, if you want us in the loop- but are not asking to own something, you can add the DBA tag with the "not dba team/blocked external" [13:51:39] column [13:52:00] that will tell us "hey, I am doing X, this is just a notification" [13:52:23] (this is only a suggestion, because IRC discussions gets lost) [13:57:11] Sure, will follow up on phabricator [14:02:14] Amir1: ORES filters on eswiki <3 <3 [14:23:59] 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#4013907 (10jcrespo) [14:35:27] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013954 (10chasemp) This was in my email twice from last night. I am suspicious of this cron, but unsure if (part of) cause or effect really. `12:40 AM (7 hours ago)` (my local time) > Cron 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013958 (10jcrespo) [14:52:34] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014007 (10chasemp) >>! In T188589#4013954, @chasemp wrote: > This was in my email twice from last night. I am suspicious of this cron, but unsure if (part of) cause or effect really. > > `12:40 AM (7 hours... [15:01:45] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014023 (10chasemp) [15:15:53] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014054 (10Marostegui) I agree with Jaime here - it is key to find what is causing this overload. Even though we have to replace this old host, we really need to find out what is causing this overload, otherw... [15:23:53] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4014082 (10Marostegui) [15:29:00] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014114 (10Marostegui) Is it worth to fix the current s7 replication breakage on labsdb1010 or you are planning to put it down soon? [15:32:03] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014122 (10jcrespo) The replication catch-up and copy will take at the very least >24 hours, always assuming there was no data loss forcing us to start from the beginning. [15:34:53] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014127 (10Marostegui) >>! In T186579#4014122, @jcrespo wrote: > The replication catch-up and copy will take at the very least >24 hours, always assuming there was no data loss forcing us to start from... [15:57:16] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014182 (10jcrespo) > if it is worth to fix it or not I don't know, that is why I was given information on how much time is left for labsdb1010 to be restored (>24 hours). [16:00:43] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014189 (10Marostegui) >>! In T186579#4014182, @jcrespo wrote: >> if it is worth to fix it or not > > I don't know, that is why I was giving information on how much time is left for labsdb1010 to be re... [16:02:40] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014192 (10jcrespo) I plan to transfer the compressed package first- I am not 100% sure it is in a good state- will try to start decompressing it without deleting anything for as long as there is space... [16:12:57] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014225 (10Marostegui) replication is now flowing on s7 [16:13:18] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014226 (10jcrespo) Thanks [16:24:30] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4014286 (10Marostegui) s2: pending the master, which I will alter on Monday [16:24:35] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4014288 (10Marostegui) s2: pending the master, which I will alter on Monday [16:25:25] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4014291 (10Marostegui) s2 is finished. Only pending the master, which will be done during the DC switch. [16:28:27] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4014298 (10Marostegui) [16:40:29] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014352 (10bd808) [16:46:52] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014404 (10Marostegui) I would like to do the following to be able to replace db1009. Get db1114 (512G) to replace db1073 (160G API in s1) as API in s... [16:47:07] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014408 (10jcrespo) "by idle connections to the nova database" I don't think that is accurate- that is making things worse, but probably not the root cause. max_conne... [16:49:02] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4014426 (10Marostegui) [16:49:52] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014428 (10jcrespo) I was thinking of making db1114 a multi-misc. But at this point, anything goes as long as we get rid of db1009- the problems is the... [16:52:02] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014447 (10jcrespo) We need to update the original plan: https://gerrit.wikimedia.org/r/#/c/399792/3/wmf-config/db-eqiad.php [16:52:13] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014448 (10Marostegui) >>! In T183469#4014428, @jcrespo wrote: > I was thinking of making db1114 a multi-misc. But at this point, anything goes as long... [16:53:43] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014452 (10jcrespo) +1. If you have time to update a db-eqiad.php, that would be great to know pending steps. If not, I will do it when I have the time... [16:54:23] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014453 (10Marostegui) >>! In T183469#4014452, @jcrespo wrote: > +1. If you have time to update a db-eqiad.php, that would be great to know pending ste... [16:55:36] 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014473 (10Andrew) > > Is this something that would be more safely done with `keystone-manage token_flush` or is that unrelated? These are two different things. To... [16:59:06] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014490 (10Marostegui) Let's go for mariadb 10.1+ stretch for db1114 so we can have a 10.1 as API in s1 - (we already have the two rc slaves as 10.1) [17:18:51] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4014579 (10Marostegui) Thanks Chris! ``` root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 6% in 8 Minutes. ``` [17:23:09] I am doing a full backup run on es2001, so we can see of someting fails ahead of next weeks programmed one [17:23:41] cool! [17:24:12] things are looking good- but there will be typical errors like wrong hosts, permissions,etc [17:24:47] yeah, the usual ones in that regard :) [17:25:02] there is 3.5T free, so that should be enough to leave it over the night [17:26:58] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014605 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1114.eqiad.wmnet'] ``` Th... [17:44:34] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014652 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1114.eqiad.wmnet'] ``` and were **ALL** successful. [17:55:35] copying data from dbstore1001 -> labsdb1010 ETA 4:18:25 [17:56:24] I am not sure if all data will be on those 1.8 TB compressed, we'll see [17:56:46] that is why I didn't want to remove the data already [19:13:48] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4015262 (10jcrespo) This is still happening, at least on labsdb1009: ``` | 118438410 | s51772 | 10.64.37.14:50886 | enwiki_p | Execute | 264724 | User sleep... [19:14:57] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4015263 (10jcrespo) apparently not because bug, query killer was disabled. [19:17:17] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4015265 (10Anomie) This seems to be moving along reasonably quickly so far, although s4 with the huge `com... [19:55:44] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4015403 (10Marostegui) It worked this time! ``` root@db1068:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Prim... [23:33:52] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#3891821 (10Reedy) Ok, so let's see where we're at here. This extension is basically copying the core user_preference...