[01:10:53] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4138212 (10Papaul) [01:11:11] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#4138214 (10Papaul) [01:11:16] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3978461 (10Papaul) 05Open>03Resolved [04:56:01] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138339 (10Marostegui) @jcrespo what do you feel it is being missed? [05:07:04] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138360 (10Marostegui) Changing the network cable didn't have any effect. Errors are still there [05:11:02] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138365 (10Marostegui) [05:22:08] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138369 (10Marostegui) RX buffers changed: ``` root@db1114:~# ethtool -g eno1 Ring parameters for eno1: Pre-set maximums: RX: 2047 RX Mini: 0 RX Jumbo: 0 TX: 511 Current... [05:25:00] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138370 (10Marostegui) [05:30:22] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4138372 (10Marostegui) 05Open>03Resolved This is all done! [05:30:41] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 10MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#4138377 (10Marostegui) [05:30:44] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4138375 (10Marostegui) 05Open>03Resolved All done! [05:38:14] 10Blocked-on-schema-change, 10DBA, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4138402 (10Marostegui) a:03Marostegui [05:55:44] 10Blocked-on-schema-change, 10DBA, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4138406 (10Marostegui) [05:55:58] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018): Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4138407 (10Marostegui) [05:56:11] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4138408 (10Marostegui) [06:11:26] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138424 (10Marostegui) For the record, this is the amount of dropped packets per server, of all the servers that are on that switch: ``` ores1008 RX errors 0 dropp... [06:32:02] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138427 (10Marostegui) [06:58:23] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138448 (10jcrespo) Were the right interfaces disabled after the revert? [07:00:00] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138459 (10Marostegui) >>! In T191193#4138448, @jcrespo wrote: > Were the right interfaces disabled after the revert? Yeah: >>! In T191193#4136764, @ayounsi wrote: > asw-c6-... [07:03:00] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138462 (10jcrespo) 05Open>03Resolved Okey, I feel we should check what went wrong (was it the clarity of the communication, was it a one-time mistake that will unlikely hap... [07:22:36] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138484 (10jcrespo) For example, as a procedure, could activity be checked on the port before being disabled to check the host is down/moved away? [07:24:16] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138486 (10Marostegui) >>! In T191193#4138484, @jcrespo wrote: > For example, as a procedure, could activity be checked on the port before being disabled to check the host is do... [07:45:31] <_joe_> jynus marostegui volans should we talk about db data on etcd? [07:46:28] _joe_: can you give me 15 minutes? [07:47:13] <_joe_> I wanted to schedule a time for an IRC talk [07:47:20] <_joe_> not do it now necessarily [07:47:29] <_joe_> actually, I'd prefer to take some time tomorrow [07:48:04] ok, propose a time [07:48:09] will answer later [07:56:53] I have managed to stop the drops in db1114 apparently [07:56:57] although connection errors are still there [07:57:09] I will update the task with some more findings in a bit [07:57:59] _joe_: sure, let me know when and I'll be there [08:29:39] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138553 (10Marostegui) >>! In T191996#4129814, @ayounsi wrote: > > And switch is now seeing received MAC pause frames. Which confirms that the server is receiving busts of... [08:32:01] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138555 (10Marostegui) [08:57:21] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4138587 (10Marostegui) After the two servers that were decommissioned yesterday. This is the last host to decommission in codfw as part of T176243 \o/ [09:05:38] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4138615 (10Marostegui) [09:05:43] <_joe_> I sent an invitation for tomorrow [09:05:50] 10Blocked-on-schema-change, 10DBA, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4138616 (10Marostegui) [09:06:05] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018): Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4138617 (10Marostegui) [09:07:33] 10Blocked-on-schema-change, 10DBA, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4108322 (10Marostegui) This is how the table looks like after the alter: ``` CREATE TABLE `recentchanges` ( `rc_id` int(8) N... [09:11:18] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018): Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4138635 (10Marostegui) This is how the tables look like after all the schema changes: ``` ====Table:actor==== ********... [09:42:14] I'm deleting things in small dblist, is it fine? [09:42:48] fine by me [10:13:56] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138764 (10Marostegui) I have captured iostat ouput during two bursts of errors. And there is some reads and cpu spike on both of them, but nothing too worrying or two mass... [10:15:35] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138768 (10Marostegui) [10:54:57] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138836 (10Marostegui) [10:55:44] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126641 (10Marostegui) [13:20:53] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4139166 (10Marostegui) s5 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1095 [] dbstore1002 [... [13:20:58] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4139167 (10Marostegui) s5 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1095 [] dbstore... [13:21:13] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4139168 (10Marostegui) s5 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1095 [... [13:23:12] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4139173 (10Marostegui) [13:23:26] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4139174 (10Marostegui) [13:23:46] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4139175 (10Marostegui) [13:32:56] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139205 (10Marostegui) For the record, the irq for eno1 is balanced across CPUs, so I don't think it is the bottleneck here: ``` root@db1114:/srv/tmp# for i in `cat /proc/i... [13:33:31] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139207 (10Marostegui) [14:03:01] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4139368 (10Anomie) >>! In T188299#4138635, @Marostegui wrote: > This is how the tables look like... [14:04:14] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4139381 (10Marostegui) >>! In T188299#4139368, @Anomie wrote: >>>! In T188299#4138635, @Marosteg... [14:24:48] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139429 (10Marostegui) During the errors spike I have captured the CPU stats and it is interesting to see that some sys or usr CPU get totally overloaded some seconds befor... [14:41:29] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139494 (10Marostegui) At the time of the errors (14:30:10), this is what I saw running for a couple of seconds before the errors: 14:30:06 ``` 9476 root 0 -20 29... [15:23:51] marostegui: until we have proper rotated binary backups, should we delay dbstore200* hosts 1 day? [15:24:07] Yeah, I wouldn't mind that [15:24:40] it is not perfect, but it could be better than nothing? [15:24:42] If not a day, maybe 6h or 12h [15:24:48] until a proper solution is in place [15:24:51] But I wouldn't mind a delay for now, no [15:25:55] ... as long as it is not done with events [15:26:03] +1000000 [15:26:04] XD [15:43:42] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4139692 (10Anomie) \o/ Thanks! [15:48:14] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139734 (10ayounsi) That's some great investigation! >>! In T191996#4138553, @Marostegui wrote: > @ayounsi does that mean that the switch is the one not being able to cope... [15:50:55] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139750 (10Marostegui) >>! In T191996#4139734, @ayounsi wrote: > That's some great investigation! > >>>! In T191996#4138553, @Marostegui wrote: >> @ayounsi does that mean... [16:04:48] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4139793 (10Anomie) \o/ Thanks! [16:31:05] 10DBA, 10Operations, 10ops-eqiad: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4139871 (10Cmjohnson) [19:43:27] 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team, 10Patch-For-Review: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4140554 (10Bstorm) When are these tables available in the prod dbs? Are they up now? I can run the ad... [20:39:44] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4140710 (10Cmjohnson) [23:52:49] 10DBA, 10Cloud-Services, 10Product-Analytics, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4141412 (10Jayprakash12345) 05stalled>03Open [23:53:14] 10DBA, 10Cloud-Services, 10Product-Analytics, 10User-Urbanecm: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T189124#4141416 (10Jayprakash12345) [23:53:17] 10DBA, 10Cloud-Services, 10Product-Analytics, 10Hindi-Sites, 10User-Jayprakash12345: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4141420 (10Jayprakash12345)