[01:00:53] 10DBA, 10Flow, 03Collab-Team-Q1-July-Sep-2016, 13Patch-For-Review: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509#2661115 (10Mattflaschen-WMF) [01:01:17] 10DBA, 10Flow, 03Collab-Team-Q1-July-Sep-2016, 13Patch-For-Review: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509#2321533 (10Mattflaschen-WMF) [05:17:10] jynus, marostegui : all EventLogging tables seem to be lagging currently ... from some minutes up to >7 hours [05:17:13] https://www.irccloud.com/pastebin/kFpNywXW/EventLogging%20tables%20update%20lags [05:17:31] e.g. MobileWikiAppFeed_15734713 (the last on this list, with 7.5h since the last update) is an active table with 2023 events yesterday (Sept 22) [05:36:24] ... Grafana on the other hand shows events still coming in with no interruptions, e.g. https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=SaveTiming [05:36:33] CC milimetric ^ [06:39:52] 10DBA, 06Operations, 10ops-eqiad: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2661404 (10Marostegui) [07:04:44] HaeB, https://gerrit.wikimedia.org/r/312471 [07:08:47] it will be fixed forever in a few minutes; I didn't realize that there was a deeper issue on dbstores [07:08:56] jynus: cool... so it was just a master/slave replication issue? [07:09:18] (shold have mentioned that the pastebin was from analytics-store.eqiad.wmnet ) [07:10:17] 07Blocked-on-schema-change, 06Community-Tech, 13Patch-For-Review, 07Schema-change, 05WMF-deploy-2016-09-13_(1.28.0-wmf.19): Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951#2661430 (10Marostegui) 05Open>03Resolved [07:11:11] HaeB, there was a permission issue- it was brought to my attention by Dan yesterday [07:11:34] but I didn't realize that the puppet code was forcing again the wrong config [07:11:44] now that I fixed puppet, it should never happen again [07:12:12] there was a refactoring of code for an unrelated security issue [07:12:24] so I broke some things by doing it in a rush [07:12:34] but everything looks good now [07:13:00] backfills are also working in case some events were skipped [07:17:00] 10DBA, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2661458 (10jcrespo) There was a refactoring of mariadb's puppet code... [07:21:51] marostegui, check this patch I merged yesterday: https://gerrit.wikimedia.org/r/301076 [07:22:06] checking [07:23:06] Wow, nice clean up [07:35:39] 10DBA, 10MediaWiki-extensions-ORES, 10Revision-Scoring-As-A-Service-Backlog, 15User-Ladsgroup: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2661490 (10jcrespo) Sorry, I checked and there is less than 1000 rows to delete per wiki; things should... [08:34:54] jynus, marostegui: I had a look at mysqld_safe wrapper, the hardening changes are only missing on systems, which are in the process of decomissioning: db1010 (T129395), es200[1-4] (T129452) and labsdb1002 (mentioned at T128821) [08:34:55] T129452: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T129452 [08:34:55] T128821: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 [08:34:55] T129395: Decommission db1010 - https://phabricator.wikimedia.org/T129395 [08:35:04] so I'll close the hardening task [08:35:26] es2001-4 are not mysql systems [08:35:26] moritzm: sounds good! thank you [08:35:41] labsdb1002 does not exist [08:37:24] db1010 should be down [08:37:55] jynus: labs1002 is still up and running [08:38:00] what? [08:38:01] no [08:38:08] it was literaly unracked [08:38:09] labsdb1002 I meant [08:38:18] db1010 is running as well [08:38:43] and db1010 also has a running mysqld (but probably not replicating etc.) [08:38:56] apparently not [08:41:31] ah, es200[2-4] are now prometheus hosts, but they still have mysqld running in the wmf-mariadb10 version, so should we apply mariadb::custom_mysqld_safe there as well? [08:41:39] no [08:41:59] 10DBA: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2661547 (10Marostegui) I have been educated by Elukey about the differences between the two scripts to reimage and we have realised that I need to be in the pwstore to be able to use Ricciardo's wrapper. Once my key is signed we wil... [08:42:02] es2001-4 are hosts without roles [08:42:37] they are there to store data on the filesystem, until we have more space for backups [08:43:05] ok [08:43:16] we can apply it there just in case [08:43:34] but they should not have mysql running [08:44:12] I will shutdown the others [08:44:32] which I supposed had been take off the rack [08:45:40] 10DBA: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2661565 (10Marostegui) ˜/jynus 9:28> my home only has an old backup when I reimported x1 there, can be killed now [08:45:55] ok! [08:55:09] 10DBA, 06Operations: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T129452#2661591 (10jcrespo) [08:55:42] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2661595 (10jcrespo) [08:55:44] 10DBA, 06Operations: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T129452#2106136 (10jcrespo) [08:56:27] This was wrong, I fixed it and merged (it was already done, and es2001-4, reused with the same name) https://phabricator.wikimedia.org/T129452 [09:19:07] https://gerrit.wikimedia.org/r/#/c/312485/ moritzm this should clarify the es2001-4 status [09:20:03] thanks, +1ed in gerrit [09:20:51] I have created https://phabricator.wikimedia.org/T146455 and will shut it down [09:21:12] and I will shutdown db1010 [09:21:30] both of which should no longer be there [09:26:14] thanks, db1010 already has a ticket, BTW: https://phabricator.wikimedia.org/T129395 [09:26:49] 10DBA: Decommission db1010 - https://phabricator.wikimedia.org/T129395#2661696 (10jcrespo) a:03jcrespo [12:08:13] sorry I should've sent an update when I saw the same thing yesterday, HaeB [12:09:02] it should be fixed forever now [12:09:11] I didn't realize my fix was only temporal [12:09:19] sorry for the issues [13:32:17] 10DBA, 06Operations, 10ops-eqiad: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2661404 (10Cmjohnson) Replaced the failed disk [13:36:48] 10DBA, 06Operations, 10ops-eqiad: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2662173 (10Marostegui) Thanks - it is rebuilding now! ``` root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL Rebuild Progress on Device at Enclosure 32, Slot 4 Completed 2% in 5 Minutes. Exit C... [13:39:01] 10DBA, 06Operations: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#2662175 (10Aklapper) Does anyone plan to fix those pages? Or is this task rather low priority? [13:48:53] 10DBA, 06Operations: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#2662225 (10jcrespo) > My opinion on this is that there could be data loss, but all of them in 2012 of before, it just happens that the page was "touched" recently. This makes this issue less of an imminent problem... [15:08:47] 10DBA, 06Operations: db1034 lag - https://phabricator.wikimedia.org/T139280#2662497 (10jcrespo) This was not hardware related. However, it is on the list of soon-to decom servers. Stealing it for now. [15:09:25] 10DBA, 06Operations, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2662499 (10jcrespo) [15:09:54] 10DBA, 06Operations: db1034 lag - https://phabricator.wikimedia.org/T139280#2662501 (10jcrespo) [15:10:00] 10DBA, 06Operations, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10jcrespo) [15:10:16] 10DBA, 06Operations: db1034 decommission - https://phabricator.wikimedia.org/T139280#2425354 (10jcrespo) [18:26:33] 10DBA, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2663057 (10dduvall) LGTM. Thanks, @jcrespo! [18:26:53] 10DBA, 06Performance-Team, 07Availability, 07Epic, and 2 others: MASTER_POS_WAIT() alternative that works cross-DC - https://phabricator.wikimedia.org/T135027#2663065 (10dduvall) [18:26:55] 10DBA, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2663062 (10dduvall) 05Open>03Resolved