[06:40:43] <wikibugs>	 10DBA, 10Cloud-Services: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Marostegui)
[06:44:18] <wikibugs>	 10DBA, 10Cloud-Services: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012901 (10Marostegui) I am killing sleeping connections to nova database in a screen on db1009 for now.
[06:45:01] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012902 (10Marostegui) p:05Triage>03High
[06:45:12] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Marostegui)
[06:48:31] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012915 (10Marostegui)
[07:03:16] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012947 (10chasemp) ```root@labcontrol1001:~# OS_TENANT_NAME=admin-monitoring openstack server list +--------------------------------------+-----------------------+--------+---------------------+ | ID...
[07:03:20] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10madhuvishy) Things seem a lot better now since  06:57 madhuvishy: Restart nova-conductor on labcontrol1001 06:59 chasemp: restart nova-api on labnet1001
[07:03:48] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012950 (10Marostegui) >>! In T188589#4012901, @Marostegui wrote: > I am killing sleeping connections to nova database in a screen on db1009 for now.  This was stopped at 06:56AM as @madhuvishy started to res...
[07:04:05] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012951 (10chasemp) @madhuvishy restared nova-conductor and I restarted nova-api  shortly thereafter.  nova-conductor restart seems to have calmed things down.  I restarted nova-api as it has a tendency to "g...
[07:16:31] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012970 (10chasemp) I took some patience but post restarts I cleand up nova-fullstack's mess  ``` 2007  OS_TENANT_NAME=admin-monitoring openstack server list  2008  OS_TENANT_NAME=admin-monitoring openstack s...
[07:18:11] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012971 (10Marostegui) p:05High>03Normal Decreasing the task back to Normal priority as things look stable and leaving it open as per: ``` ˜/chasemp 8:16> marostegui: no worries and let's leave it open ti...
[07:21:59] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4012976 (10Marostegui) 05Open>03Resolved Let's see if this lasts for long this time ``` root@db2048:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337E3350)...
[07:25:23] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012978 (10chasemp) ```root@labcontrol1001:~# OS_TENANT_NAME=contintcloud openstack server list +--------------------------------------+----------------------------+--------+---------------------+ | ID...
[07:26:33] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012981 (10madhuvishy) Some logs from nova-conductor corresponding to the time of incident, doesn't seem like the root cause but correlates with the db spike. https://phabricator.wikimedia.org/P6770
[10:11:30] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013230 (10jcrespo) This was warned in advance at T188210
[10:20:57] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013246 (10jcrespo) The immediate solution is T183469
[10:24:17] <wikibugs>	 10DBA, 10Wikimedia-Incident: Investigate why query killer didn't kill 1-hour long queries - https://phabricator.wikimedia.org/T188505#4013250 (10jcrespo)
[11:50:32] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4013426 (10jcrespo) labsdb1011 is back up, but needs to be upgraded and catch up with replication. The next steps after that are:  * pool labsdb1011 back * depool labsdb1010 * Recover it from generated...
[12:16:33] <wikibugs>	 10DBA, 10Wikidata, 10Patch-For-Review, 10Technical-Debt: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#4013579 (10thiemowmde) p:05Triage>03Normal
[12:17:39] <wikibugs>	 10DBA, 10Wikidata, 10Technical-Debt: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#1709272 (10thiemowmde)
[13:03:41] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4013727 (10jcrespo)
[13:23:29] <Amir1>	 jynus: marostegui: Hey, Wikidata team decided to pick this up: https://phabricator.wikimedia.org/T184485. What it means DB-wise is that logging table in Wikidata (and most wikis like commons) will be cut to either: 1- half 2- or one percent. That will free up lots of storage. ETA of happening it is the next month. Any considerations? will definitely let you know so you optimize the tables
[13:23:29] <Amir1>	 The table itself is 600M rows with average of 180 bytes per row (not to count indexes)
[13:23:47] <Amir1>	 I sent this yesterday too but you weren't around I guess
[13:34:03] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Andrew) Sorry I slept through this last night!  I'm catching up.  A few facts:  nova-api seems to connect directly to the database.  Other than nova-api, nova-conductor is the service that marshals...
[13:43:47] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013834 (10jcrespo) The problem I see is that each openstack (I think it is openstack) application, has its own pool of connections- which has some issues for our infrastructure- first because it "reserves" r...
[13:49:33] <jynus>	 we read it
[13:49:53] <jynus>	 it is just we don't have any actionables there-
[13:50:04] <jynus>	 no need to ping us, it is ok to write it here so we are in the loop
[13:50:32] <jynus>	 but with so many thing in our worklog, just create a ticket for defragmenting when there is an actionable
[13:51:34] <jynus>	 in fact, if you want us in the loop- but are not asking to own something, you can add the DBA tag with the "not dba team/blocked external"
[13:51:39] <jynus>	 column
[13:52:00] <jynus>	 that will tell us "hey, I am doing X, this is just a notification"
[13:52:23] <jynus>	 (this is only a suggestion, because IRC discussions gets lost)
[13:57:11] <Amir1>	 Sure, will follow up on phabricator
[14:02:14] <Hauskatze>	 Amir1: ORES filters on eswiki <3 <3
[14:23:59] <wikibugs>	 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#4013907 (10jcrespo)
[14:35:27] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013954 (10chasemp) This was in my email twice from last night.  I am suspicious of this cron, but unsure if (part of) cause or effect really.  `12:40 AM (7 hours ago)` (my local time)  > Cron <root@labcontro...
[14:36:30] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013958 (10jcrespo)
[14:52:34] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014007 (10chasemp) >>! In T188589#4013954, @chasemp wrote: > This was in my email twice from last night.  I am suspicious of this cron, but unsure if (part of) cause or effect really. >  > `12:40 AM (7 hours...
[15:01:45] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014023 (10chasemp)
[15:15:53] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014054 (10Marostegui) I agree with Jaime here - it is key to find what is causing this overload. Even though we have to replace this old host, we really need to find out what is causing this overload, otherw...
[15:23:53] <wikibugs>	 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4014082 (10Marostegui)
[15:29:00] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014114 (10Marostegui) Is it worth to fix the current s7 replication breakage on labsdb1010 or you are planning to put it down soon?
[15:32:03] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014122 (10jcrespo) The replication catch-up and copy will take at the very least >24 hours, always assuming there was no data loss forcing us to start from the beginning.
[15:34:53] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014127 (10Marostegui) >>! In T186579#4014122, @jcrespo wrote: > The replication catch-up and copy will take at the very least >24 hours, always assuming there was no data loss forcing us to start from...
[15:57:16] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014182 (10jcrespo) > if it is worth to fix it or not  I don't know, that is why I was given information on how much time is left for labsdb1010 to be restored (>24 hours).
[16:00:43] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014189 (10Marostegui) >>! In T186579#4014182, @jcrespo wrote: >> if it is worth to fix it or not >  > I don't know, that is why I was giving information on how much time is left for labsdb1010 to be re...
[16:02:40] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014192 (10jcrespo) I plan to transfer the compressed package first- I am not 100% sure it is in a good state- will try to start decompressing it without deleting anything for as long as there is space...
[16:12:57] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014225 (10Marostegui) replication is now flowing on s7
[16:13:18] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#4014226 (10jcrespo) Thanks
[16:24:30] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4014286 (10Marostegui) s2: pending the master, which I will alter on Monday
[16:24:35] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4014288 (10Marostegui) s2: pending the master, which I will alter on Monday
[16:25:25] <wikibugs>	 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4014291 (10Marostegui) s2 is finished. Only pending the master, which will be done during the DC switch.
[16:28:27] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4014298 (10Marostegui)
[16:40:29] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014352 (10bd808)
[16:46:52] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014404 (10Marostegui) I would like to do the following to be able to replace db1009.  Get db1114 (512G) to replace db1073 (160G API in s1) as API in s...
[16:47:07] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014408 (10jcrespo) "by idle connections to the nova database"  I don't think that is accurate- that is making things worse, but probably not the root cause. max_conne...
[16:49:02] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4014426 (10Marostegui)
[16:49:52] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014428 (10jcrespo) I was thinking of making db1114 a multi-misc. But at this point, anything goes as long as we get rid of db1009- the problems is the...
[16:52:02] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014447 (10jcrespo) We need to update the original plan: https://gerrit.wikimedia.org/r/#/c/399792/3/wmf-config/db-eqiad.php
[16:52:13] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014448 (10Marostegui) >>! In T183469#4014428, @jcrespo wrote: > I was thinking of making db1114 a multi-misc. But at this point, anything goes as long...
[16:53:43] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014452 (10jcrespo) +1. If you have time to update a db-eqiad.php, that would be great to know pending steps. If not, I will do it when I have the time...
[16:54:23] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014453 (10Marostegui) >>! In T183469#4014452, @jcrespo wrote: > +1. If you have time to update a db-eqiad.php, that would be great to know pending ste...
[16:55:36] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014473 (10Andrew)  >  > Is this something that would be more safely done with `keystone-manage token_flush` or is that unrelated?  These are two different things.  To...
[16:59:06] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014490 (10Marostegui) Let's go for mariadb 10.1+ stretch for db1114 so we can have a 10.1 as API in s1 - (we already have the two rc slaves as 10.1)
[17:18:51] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4014579 (10Marostegui) Thanks Chris! ``` root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL  Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 6% in 8 Minutes. ```
[17:23:09] <jynus>	 I am doing a full backup run on es2001, so we can see of someting fails ahead of next weeks programmed one
[17:23:41] <marostegui>	 cool!
[17:24:12] <jynus>	 things are looking good- but there will be typical errors like wrong hosts, permissions,etc
[17:24:47] <marostegui>	 yeah, the usual ones in that regard :)
[17:25:02] <jynus>	 there is 3.5T free, so that should be enough to leave it over the night
[17:26:58] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014605 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1114.eqiad.wmnet'] ``` Th...
[17:44:34] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014652 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1114.eqiad.wmnet'] ```  and were **ALL** successful.
[17:55:35] <jynus>	 copying data from dbstore1001 -> labsdb1010 ETA 4:18:25
[17:56:24] <jynus>	 I am not sure if all data will be on those 1.8 TB compressed, we'll see
[17:56:46] <jynus>	 that is why I didn't want to remove the data already
[19:13:48] <wikibugs>	 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4015262 (10jcrespo) This is still happening, at least on labsdb1009:   ``` | 118438410 | s51772          | 10.64.37.14:50886 | enwiki_p           | Execute |  264724 | User sleep...
[19:14:57] <wikibugs>	 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#4015263 (10jcrespo) apparently not because bug, query killer was disabled.
[19:17:17] <wikibugs>	 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4015265 (10Anomie) This seems to be moving along reasonably quickly so far, although s4 with the huge `com...
[19:55:44] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4015403 (10Marostegui) It worked this time! ``` root@db1068:~# megacli -LDPDInfo -aAll  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name                : RAID Level          : Prim...
[23:33:52] <wikibugs>	 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#3891821 (10Reedy) Ok, so let's see where we're at here.  This extension is basically copying the core user_preference...