[06:52:37] I am going to disable puppet on all databases and deploy https://gerrit.wikimedia.org/r/450314 [06:58:37] ok [07:02:51] good morning [07:04:00] I am going to reclone db1092 [07:04:19] ok, can you share your plan? [07:08:33] I find the perfect donor [07:08:53] I check which alerts has to be disabled [07:09:00] log the hosts [07:09:15] stop the mariadb on the done [07:09:19] *donor [07:09:25] *do the recloning* [07:09:33] restart mariadb on donor [07:09:41] "I find the perfect donor" is not a plan but a wish [07:09:41] https://tools.wmflabs.org/bash/quip/AWYZ3Is2fM03vZ1oCizI [07:09:55] find it, then report back [07:10:16] lol @legoktm [07:10:26] we start the db1108 patching a bit earlier [07:10:28] 9:30 [07:10:56] banyek: also, check if the donor needs kernel/mariadb upgrade once you've found which one you'll use [07:11:15] :)) [07:13:11] I'd say db1104 for a donor, because that host needs an upgrade (mariadb at least) [07:13:59] banyek: db1104 isn't part of s4 [07:14:45] 1092 is part of s8 [07:15:06] as well as db1104 [07:15:14] what look I wrong? [07:15:17] Oh sorry, I was confused with db1091 [07:15:18] marostegui: buuuuuh [07:15:29] you are fired, marostegui [07:15:46] how you dare to confuse db1091 and db1092! [07:15:59] we only have 200 of these! [07:16:01] * marostegui goes to assign all his tasks to jynus [07:16:19] * jynus goes to assign all his tasks to marostegui [07:16:42] * volans disable his phab account, just in case [07:16:55] * marostegui goes to assign either his tasks and jynus' tasks to banyek [07:17:17] banyek: db1104 looks good [07:17:21] * banyek banyek just start thinking about a new carrier as a barista [07:17:39] cool :) [07:17:58] I mute all the notifications for db1104 and db1092 as well [07:18:23] db1092 I believe are disabled - jaime did it yesterday when it crashed, but double check [07:18:50] That's my intention too, but better look twice [07:19:49] I did it on code [07:19:57] so it will need a commit to remove those [07:20:55] db1104 muted [07:20:58] db1108 muted [07:21:46] ping me if you reboot servers [07:21:54] because of puppet being disabled [07:23:32] ok [07:23:43] where can I check which dbproxy is for m4? [07:24:21] found it [07:30:42] dbproxy1004 and dbproxy1009 - I muted them in Icinga [07:36:28] did you see any issue with databases on eqiad? [07:36:37] I think the change is a noop everywhere [07:37:05] so I may deploy everywhere unless you see any traffic or connection error [07:37:55] Sorry, I was busy with the parsercache [07:38:00] I didn't see anything on logtash [07:38:18] Maybe deploy it on eqiad first? [07:38:35] I already did that, you were indeed distracted [07:40:08] I think I have like 30 tabs open now XD [07:41:19] we may have issues with network for es1014 [07:41:30] lots of binary log I/O errors [07:41:44] uhh [07:42:02] although they stopped at Sep 26 22:37:01 [07:42:39] it shows as unpollable for prometheus [07:48:16] sorry, I am back, a package arrived [07:48:29] I've reported it at https://phabricator.wikimedia.org/T201139#4621337 [08:03:27] legoktm: how are the: https://tools.wmflabs.org/bash/top selected? take a look at the one from jynus hahaha [08:09:06] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) [08:20:45] maintenance of db1108 completed [08:21:08] elukey is checking if there will any problem emerge, but probably won't [08:21:18] I go back to db1092 recloning [08:21:25] banyek: good job! [08:21:58] first I'll upgrade the db1104 (db1092 has newer mysql than db1104 and so I upgrade the donor first) [08:21:59] to be fair, the abraham lincoln saying is quite true [08:22:31] everything looks good, thanks! [08:22:35] thank you [08:23:07] 10DBA, 10User-Banyek: Maintenance M4 cluster - https://phabricator.wikimedia.org/T205288 (10Banyek) 05Open>03Resolved The maintenance on M4 is completed [08:23:32] banyek: haproxy reloaded too? [08:23:57] I am not sure the proxy is related here, it is a master proxy, not a replica one? [08:24:40] it complained yesterday about db1107 being down [08:24:55] So I assume it would have complained about db1108 being down but as it was downtimed... [08:25:07] I dod not reload it, because the db1108 is a secondary host, so there was no failover [08:25:11] *did [08:25:14] *them [08:25:31] but I muted them [08:25:32] ah, db1108 is the replica, true [08:25:43] so nevermind! [08:27:26] the next new hire, we will have to troll him with running mysqld with the --run-faster parameter [08:30:23] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Banyek) The donor host will be db1104 for recloning, and I update that first [08:31:17] I think my favorite mysql parameter is --i-am-a-dummy [08:31:49] yes, that is the inspiration [08:31:56] but more absurd [08:32:27] first because a --run-faster option shouldn't exist, and if it did, it shoudl be de default [08:33:06] maybe you wouldn't want to enable it to force developers to optimize their queries and not only rely on —run-faster [08:33:19] do you remember the 'TURBO' switch on the IBM AT-s? [08:33:52] but that slowed down the server to make it more original IBM pc compatible, it didn't make it faster [08:34:04] to be XT compatible [08:35:16] https://media1.tenor.com/images/992ace05caa1ba90852787824f17d1da/tenor.gif [08:35:45] in gawker on the cli tools we made always used '--iddqd' instead of '--force' it felt way better to type it. :( [08:35:58] and not idbehold?! [08:36:32] HAH I didn't knew that [08:36:51] stopping replication on db1104 [08:37:25] I always remember iddqd, idkfa and idbehold [08:37:38] stopping mariadb on db1104 [08:39:41] when I worked as an office IT guy I decided that it is time to quit when I was typing the windows and office keys from my head because I remembered them :( [08:40:01] (actually later I used those keys as passwords) [08:40:09] mariadb is stopped [08:40:57] CRITICAL 2018-09-27 08:39:57 0d 0h 0m 31s 1/3 We could not find any completed backup for s8 at eqiad [08:43:41] I've just invented Test-driven sysoping. I create the icinga check before the work is done or the hardware purchased^ [08:43:47] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Patch-For-Review: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 (10Marostegui) labtestwiki wasn't done. I just did it. [08:44:15] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) [08:44:20] I am going to now deliver conferences and a whole movement about that [08:44:33] updates installed, now I am rebooting the host db1104 [08:44:57] marostegui: banyek: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1115 [08:45:03] Just saw it :-) [08:45:31] Goal completed! :p [08:45:40] lol [08:46:02] hm did I broke something? [08:46:09] ah no [08:46:49] db1104 is back [08:46:58] (I mean the host) [08:49:25] marostegui: I would honestly remove the "Backups for s2 at eqiad are up to date" [08:49:34] it is redundant to the name [08:49:55] and mostly to the "color" [08:50:10] what do you mean? [08:50:25] if it is green, is an explanation why, if it is yellow or red, it is an explanation of what failed [08:50:30] ah [08:50:54] what do you think about making the text more readable? [08:50:59] the upgrade finished, not I am starting to reclone [08:51:09] don't know, I don't think it hurts to leave an explanation of why it is green [08:51:38] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) [08:51:59] what about making it more human- time in a human readable format "7 days ago" and size like "128G" [08:52:17] Yeah, that also works [08:52:30] Or "backups within limits, 7 days ago and 128G" [08:52:32] or something like that [08:52:59] I don't like "backups within limits" (which limits?) [08:53:10] a week older [08:53:11] maybe "fresh" or not too old [08:53:13] or something [08:53:24] "recent" [08:53:53] recent to me is similar to "up to date" :) [08:53:55] Backups for m1 at codfw are recent: Last one taken 2 days ago (200GB) [08:54:18] I accept alternative suggetions [08:54:19] yeah, but then we need to express what is the limit, is 6 days recent? [08:54:36] "Last one taken 2 days ago, expected not older than 8" [08:54:38] or something like that? [08:54:42] backup for m1 at codfw is newer than 8 days [08:54:49] that works [08:54:52] yea^ [08:55:03] last one taken 3 days ago (4GB) [08:55:52] last one taken from dbstore1001 3 days ago (4GB) [08:56:06] not sure about the size though [08:56:16] if fails if size is too small [08:56:18] As I have no idea if I should expect 100GB or 190GB [08:56:28] Yeah, I mean on the description [08:56:32] backup for m1 at codfw is newer than 8 days and larger than 10GB [08:56:37] yeah [08:56:54] that would give a better idea, to me at least [08:59:37] stopping slave on db1104 again [08:59:45] mariadb instances shutting down [08:59:57] banyek: you didn't depool db1104? [09:01:29] fsck [09:01:30] no [09:01:36] db1092 is alerting [09:02:11] now on -operations [09:02:40] didn't it have notificaitons disabled? [09:03:28] it did [09:03:37] it has them disabled for everything but not for the lag, io and sql thread :| [09:03:39] aparrently something happened, maybe puppet restarte it? [09:04:11] but it should have restarted all the notifications, not just those 3 [09:05:52] you can see ir for yourself cat hieradata/hosts/db1092.yaml [09:05:58] yeah I did [09:06:03] banyek: maybe you enabled them by mistake? [09:06:06] even if it was restarted, by default, all services shoudl start them [09:06:09] with them [09:06:18] I only see a manual enable or something [09:06:26] or an icinga bug [09:06:51] Modified Attributes: notifications_enabled [09:06:57] so it was a manual enable [09:07:30] the others have "Modified Attributes: None" with it disabled [09:10:16] I didn't touched the notifications on db1092 at all [09:13:40] that's weird, only those 3 are enabled [09:15:40] I mean, not that our icinga doesn't have know bugs [09:15:47] but yes, those are very specific [09:16:17] however, if it was one of us, we would have changed the read only and mysql process as we do many times [09:17:59] I am checking icinga logs for db1092, but don't see anything [09:18:10] (relevant) [09:19:26] me neither [09:19:30] commands are not logged [09:19:35] only status changes [09:56:48] I silenced db1092 until tomorrow [09:57:19] (I don't know how long the reclone will be because of the cache battery error) [09:57:36] shouldn't take too long, maybe 2h or 3h [09:58:05] I am also thinking on enabling the write cache regardless the state of the BBU - the host is out from business anyways, so I can enable it for the reclone and disable after [09:58:30] I don't think it will have so much impact on the transfer [09:58:50] Just check the transfer rate and then do some calculations about how long it will take, so we can see, but I assume 2-3h of transfer [09:58:59] ok [09:59:29] I also remove the sqldata directory on db1092 from /srv, because it seems it doesn't have enough free space to have that duplicated [09:59:40] yep [10:03:18] I also expanded the downtime for db1104 - I'll finish it when repool [10:05:31] stopping slave on db1104 [10:06:27] mariadb stopping on db1092 and db1104 [10:07:52] removing sqldata directory on db1092 [10:08:11] (not enough space to have it twice) [10:09:41] running ```transfer.py --no-encrypt --no-checksum db1104.eqiad.wmnet:/srv/sqldata db1092.eqiad.wmnet:/srv``` [10:10:49] the copy started [10:11:21] I think I'll have a break soon, and continue after the clone [10:29:44] "Backup for s5 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2018-09-25 23:14:00 from db1102.eqiad.wmnet:3315 (54 GB)" https://gerrit.wikimedia.org/r/463228 [10:36:04] 10DBA, 10Patch-For-Review: Create Icinga alerts on backup generation failure - https://phabricator.wikimedia.org/T203969 (10jcrespo) ``` root@db1115:~$ for section in s1 s2 s3 s4 s5 s6 s7 s8 x1 m1 m2 m3 m5; do sudo -u nagios python3 check_mariadb_backups.py -s $section -d codfw -f1000000; done Backup for s1 at... [10:51:18] did we have a spare db? could we maybe setup backups for the remaining hosts on that idle one until we get more hardware? [10:53:51] yes we do [10:54:17] https://phabricator.wikimedia.org/T196376 [10:54:24] either that or db1118 :-) [11:29:53] I think I am goint to take that, leave db1118 for mariadb 10.3 testing [11:30:15] there is more hw coming anyway soon [11:30:24] so no reason to keep it idle [11:31:38] there is still some TBs available on dbstore1002 for more backups [11:31:43] *dbstore1001 [11:36:12] still copying [11:36:21] however it's almost done [11:49:37] the copy finished [12:01:08] db1092 replicating again [12:02:06] db1104 replicating again [12:12:52] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Banyek) recloning finished, the hosts are replicating again [12:15:54] 10DBA, 10Goal: Monitor backup generation for failure or incorrect generation - https://phabricator.wikimedia.org/T198447 (10jcrespo) [12:15:59] 10DBA, 10Patch-For-Review: Create Icinga alerts on backup generation failure - https://phabricator.wikimedia.org/T203969 (10jcrespo) 05Open>03Resolved a:03jcrespo Some icinga checks are on db1115, everthing will need refactoring soon, but ok for a first iteration. [12:23:38] marostegui: fyi there was a small increase in the rate of those PC rejections 11:00 and 11:11, but I guess all is fine, it went from around 5k per minuite to 8k per min and dropped again [12:27:48] I need to vanish again for like 15 minutes brb [12:28:07] * banyek runs errands [12:35:54] addshore: all fine on this front [12:38:12] 10DBA, 10Goal: Monitor backup generation for failure or incorrect generation - https://phabricator.wikimedia.org/T198447 (10jcrespo) [12:38:17] 10DBA, 10Patch-For-Review: Gather statistics about the backups on a database - https://phabricator.wikimedia.org/T198987 (10jcrespo) 05Open>03Resolved The zarcillo database (temporary code name for what will be the tendril replacement) now has 3 extra tables: - backups - backup_files - backup_objects (not... [12:40:24] As a homework, we need to add an index to optimize https://phabricator.wikimedia.org/P7598 [12:40:54] haha homework [12:41:18] give a guess and have a chance to win huge prizes! [12:41:29] XDDDDDDD [12:41:45] where's the ceremony? percona live? [12:42:40] I do this all day, so I am tired [12:44:26] I am not sure (X, Y, Z, start_date) will really work [12:44:33] because of the conditions [12:44:46] except with some specific optimizations [12:45:15] we could do start_date only inverse, as normally the scaning would be quite low [12:45:37] but it would be bad for queries with no results [12:59:26] can you do better than a 4th grade DBA? https://phabricator.wikimedia.org/P7598#44446 [13:01:09] 10DBA, 10JADE, 10Operations, 10MW-1.32-release-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Marostegui) >>! In T202596#4604545, @awight wrote: > Thanks for all the attention given... [13:16:04] 10DBA, 10Patch-For-Review: Create Icinga alerts on backup generation failure - https://phabricator.wikimedia.org/T203969 (10jcrespo) Index optimization: P7598#44446 [13:21:24] BTW, banyek proxysql was considered, but rejected for labs due to being an L7 proxy [13:21:36] while we needed something more transparent [13:21:40] as we don't control the client [13:22:03] so we would face client issues - we do everytime we upgrade mariadb! [13:22:40] mostly because people don't upgrade drivers and stuff [13:24:40] :( Ok, that makes sense [13:25:10] that is why it is more likely to use it on production than on labs [13:26:32] hm [13:26:45] the other thing I am wondering is the etcd [13:27:13] ? [13:27:15] if can change the service to pull some config down from etcd and then we can attach those roles to the server [13:27:58] I just don't think it is worth it, I think those servers are upgraded twice a year at most [13:30:23] it would only worth in that way if etcd could be the single source of thruth [13:30:25] truth [13:47:51] I forgot to merge the repool : [13:47:53] :/ [13:48:00] ah right [13:48:01] I was asking [13:48:04] :) [13:49:38] http://buttersafe.com/comics/2008-10-23-Detour.jpg [13:59:43] 10DBA, 10Patch-For-Review: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376 (10jcrespo) a:03jcrespo I am going to take the spare host and use it to finish eqiad backups (temporary) setup. [13:59:58] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Banyek) [14:02:13] iirc we got the green light from bstorm, but I am not sure if I want to deploy the change today, not even on friday is that ok, to postpone it to Monday? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458810/ [14:03:07] sounds sane to me [14:03:33] I also commented that maybe even disable puppet on the three hosts, merge and then run it manually on one, check the output and all that [14:03:47] yes, I've seen that as well, and agree [14:04:02] I think I start with this on Monday [14:14:43] 10DBA, 10Analytics, 10User-Banyek: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 (10Banyek) [14:20:04] 10DBA, 10Analytics, 10User-Banyek: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 (10Banyek) I start the conversion with /srv/sqldata/srwiki/pagelinks.ibd it is 51G let's see what we get at the end [14:29:07] 10DBA: Document clearly the mariadb backup and recovery setup - https://phabricator.wikimedia.org/T205626 (10jcrespo) p:05Triage>03Normal [14:31:09] 10DBA: Document clearly the mariadb backup and recovery setup - https://phabricator.wikimedia.org/T205626 (10jcrespo) [14:36:18] 10DBA: Purge old metadata for the mariadb backups database - https://phabricator.wikimedia.org/T205627 (10jcrespo) p:05Triage>03Normal [14:43:32] 10DBA: Handle object metadata backups and compare it with stored database object inventory - https://phabricator.wikimedia.org/T205628 (10jcrespo) p:05Triage>03Normal [14:44:03] 10DBA: Handle object metadata backups and compare it with stored database object inventory - https://phabricator.wikimedia.org/T205628 (10jcrespo) [14:44:12] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [14:50:51] I muted the replication lag for s3 in dbstore1002 as long as the tokuDB conversion runs. [14:51:43] Today I'll leave at 5 but I'll check the state of the conversion later. [14:56:11] 10DBA, 10Epic, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [14:56:13] 10DBA, 10Goal: Monitor backup generation for failure or incorrect generation - https://phabricator.wikimedia.org/T198447 (10jcrespo) 05Open>03Resolved a:03jcrespo With the disclaimers that {T205626} {T205627} and {T205628} were not done (but also were not part of the original scope either), we do have a... [15:02:06] bye [17:05:58] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Cmjohnson) @Marostegui The disk on slot 7 has been replaced, please resolve after rebuild [17:06:27] marostegui: if you log in, there are up and down arrows to vote with :) [17:06:39] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) the HP required AHS log has been uploaded to their dropbox. Waiting on their response. [17:06:52] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Banyek) thanks, I'll look after this [17:07:02] thanks Banyek! [17:07:28] I thank YOU Chris :) [18:38:29] Lag on dbstore1002 in gone. [20:04:43] 10DBA: Handle object metadata backups and compare it with stored database object inventory - https://phabricator.wikimedia.org/T205628 (10Peachey88) [21:12:25] 10DBA, 10Core Platform Team, 10SDC Engineering, 10Wikidata, and 4 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10Jdforrester-WMF)