[00:12:27] 10DBA, 10SDC Engineering, 10Wikidata, 10Core Platform Team Goals (MCR: Uncategorized), and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10CCicalese_WMF) [02:45:56] 10DBA, 10SDC Engineering, 10Wikidata, 10Core Platform Team (MCR: Uncategorized), and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10CCicalese_WMF) [03:15:43] 10DBA, 10SDC Engineering, 10Wikidata, 10Core Platform Team (MCR: Tech Debt), and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10CCicalese_WMF) [05:20:45] 10DBA: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) As I did with T203709 I will deploy this change on some enwiki slaves of codfw to see if we catch queries with any of those indexes hardcoded. [07:04:43] I debugged yesterday the wmf-pt-kill's problem, and I can fix it easily [07:04:45] yay [07:05:05] https://phabricator.wikimedia.org/T203674 [07:21:10] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) s1 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore2002 [] dbstore1002 [] dbstore1001 [] db2094 [] db2092 [] db2088 [] db2085 [] db2072 [] db2071 [] db2070 [] db2062 [] db2055 []... [07:25:08] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [07:47:47] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) As of 2018-10-02: ``` ls -rw-r--r-- 1 dump dump 3.2G Sep 25 21:37 mgwiktionary.gz.tar -rw-r--r-- 1 dump... [07:47:51] ^FYI [07:49:41] Thanks - yesterday the change_tag issue made us realise there is another schema change needed: https://phabricator.wikimedia.org/T205913 so I am focusing on that today and probably tomorrow and thursday, so if you can lead the wiki movement, that'd be nice. I will obviously help, but if you can do the planning that'd help [07:50:00] I am going to start with test-imporing those wikis at db1123 [07:50:05] *importing [07:50:12] ah great <3 [07:50:25] to make a call if we can do it in the limited time available [07:50:32] great [07:50:37] thanks so much for taking it [07:51:48] actuall, not that one [07:51:55] but one on s5 [07:52:33] db1110 [07:54:06] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [07:56:20] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) @Ladsgroup I have altered the following active slaves in codfw: db2088 - recentchanges db2071 - api+small main traffic db2072 - main traffic The idea is to monitor those in the next few hours... [07:56:22] Amir1: ^ [07:56:28] qq for you guys - I'd like to add the prometheus mysqld exporter to analytic1003 and matomo1001, where we (analytics) are running mariadb. Then poll those exporters from the analytics prometheus instance, and then add a datasource in https://grafana.wikimedia.org/dashboard/db/mysql to reuse all these graphs. Is it reasonable? [07:59:40] elukey: of course [08:00:14] we should add them to tendril/zarcillo too [08:00:45] elukey: you can add them as misc servers or create a new group for them [08:01:14] jynus: thanks! I'll check later on, never played with tendril [08:01:38] (the reasoning to add them to zarcillo is that soon puppet will not be the source of truth for the grouping, the metadata daabase will) [08:01:55] soon(TM) == at some point in the future [08:01:59] :) [08:02:56] also query monitoring is disabled on prometheus due to queries could contain sensitive data, so taht is done on tendril/zarcillo [08:04:49] ah very nice! [08:06:21] make sure performance_schema is enabled on new setups or after a restart to get proper monitoring, and not only long running queries [08:08:04] marostegui: hey, I just woke up. On my way to the office [08:08:15] Will check asap [08:08:38] Amir1: sure, thanks, it was a heads up mostly, if you can help monitoring it, it'll be nice :) [08:09:28] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [08:10:31] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Petar.petkovic) [08:17:54] Attempting to recover "dump.s3.2018-09-25--18-57-33" ... [08:20:43] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) Running to check timing and correctness: ``` # time recover_section.py s3 --database enwikivoyage --host d... [08:22:54] banyek: I have a proposal for you due to contraints of import- disabling backups (temp.) only on codfw for hw maintenance [08:23:36] as I may need fresh logical backups on eqiad ASAP [08:29:34] jynus: I disable the backups then [08:36:27] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [08:44:14] jynus: done [08:44:29] marostegui: Thanks! [09:12:59] jynus: hey, given this https://phabricator.wikimedia.org/T201009#4631202 do you think it's okay to continue running the script? [09:13:09] I monitor lag [09:14:13] where is that lag from? [09:14:18] eqiad or codfw? [09:15:26] codfw [09:15:30] master dc [09:16:09] if it is codfw it is not ok becase T180918 [09:17:46] tell tgr to either reduce the batch size or otherwise slowdown the edits [09:18:09] let me check if the batch size is confugrable [09:18:11] actually [09:18:13] let me see [09:18:18] because it may be missleading [09:18:24] as long as only core is 0 is ok [09:18:33] I ran it on default batch size [09:18:35] it may be 3 on a dbstore [09:18:39] which is ok [09:19:07] I can reduce the batch size in case it's not a dbstore [09:19:17] you should check https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer?orgId=1 instead [09:19:33] if the lag there is <1 you should be ok [09:20:22] this is yesterday's incident: https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer?panelId=1&fullscreen&orgId=1&from=1538390088439&to=1538392585199 [09:21:28] Ok. Noted. Thanks [09:21:32] the mysql lag is an ops view [09:21:41] it may include non-mw hosts we don't care about [09:21:49] the one you used? [09:21:58] the link I used is mws view [09:22:10] which is the only that matters for mw application [09:22:56] I see db2056 with 6.5 seconds of lag [09:23:22] but that is s2 [09:27:21] the db1110 import is talking a long time because it is most of the time stuck importing the 20M text table serially [09:28:18] I am also generating some seconds of delay ONLY in eqiad [09:30:36] not sure if I should try to start s3 multisource replication next or try to import the large cebwiki [09:31:15] I would start it, you can always stop it alter and import cebwiki [09:31:31] but if something breaks, maybe detect it before wasting time on cebwiki (which is the largest?) [09:39:01] jynus: I used the myself-aggereated dashboard [09:39:15] but I can find out what host had the highest lag [09:41:18] if you use that, make sure you select only the core group [09:44:08] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [09:44:29] Amir1: regarding change_tag, so far no errors on enwiki on any of those hosts, I might alter another one to have more hosts so we can catch issues faster [09:44:58] marostegui: nice, thanks. I put the patch that fixes the error to be deployed at one pm [09:49:17] great! [09:49:42] Amir1: can you maybe git grep for those 4 indexes we are dropping now? to see if something pops up? [09:50:19] banyek: did dbstore2002:s2 finished already? [09:50:45] sure [09:50:50] thank you [09:51:40] no, there are 2 tables to go, but they're huge, and I was not sure if the compression ends before the hw maintenance and so I stopped compression, and resumed replication [09:51:47] cool! [09:51:49] thanks [09:51:52] np [09:53:07] and I see there are a bunch of new tables there [09:54:23] those are probably the ones created on a train [09:54:28] if they are related to ipblocks [09:54:33] they are empty [09:54:42] Nothing popped up [09:54:54] Amir1: Cool! [09:55:09] I will keep monitoring the logs for enwiki for another 24h [09:55:21] Who knows.. [10:27:06] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) Altered also db2092 on enwiki, so we have another host to look for errors with hard coded queries (if any) : https://logstash.wikimedia.org/goto/2695c9487fb884e9001c8155dbce12e5 [10:30:28] Amir1: ^ another host as a "canary" [10:33:05] Cool, thanks! [10:38:10] I'm running the script on commonswiki with 500 as the batch, it's okay [10:45:50] I'll merge the wmf-pt-kill related changes (and build the package) only after lunch, I am getting hungry [11:54:59] can I ask for a code review on this: https://gerrit.wikimedia.org/r/#/c/operations/debs/wmf-pt-kill/+/463931/ [12:10:45] I added the grants for wmf-pt-kill and merged to puppet, but I am no sure if the user got created, because I don't find if the file 'modules/role/templates/mariadb/grants/wiki-replicas.sql' is used anywhere [12:12:15] Also I am not sure if it wouldn't be better to create that user with the package's postinst (and remove with the postrm) script, because if the wmf-pt-kill will be used outside the wikireplicas, then it needs more files to maintaing [12:12:19] *maintain [12:13:42] tbh I was thinking to put that system user to production.sql to have it everywhere, but I don't think it is a good idea because that user has SUPER privilege, which maybe too much to have it on those hosts where the wmf-pt-kill package won't used [12:14:09] I am a little puzzled now [12:16:35] jynus: There is no SWAT until the next 11 hours so I can't deploy the fix: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/463943 is it fine or I should deploy right now given that it's fataling a user facing special page [12:32:48] Amir1: sorry again, my previous comment still applies- I just report the issue, don't have any special interest on what or when unless it is blocked by me (not that I don't care, just that I really don't have an opinion) [12:33:28] you should ask releng as they should be the ones handling releases and are affected by errors, etc. [12:34:12] banyek: let's wait for a production wide deployment to check it works as intended on labsdb [12:34:28] deployment to core may need more work [12:34:43] and the killing is already handled there by existing events [12:34:58] e.g. we had had issues with memory leaks in the past [12:35:07] with pt-kill [12:37:19] jynus: ok, noted [12:37:39] that doesn't mean we won't use it [12:38:02] just it is not a high prioriority , and needs resources we are low on :-) [12:38:16] HUMAN RESOURCES!!! :D [12:38:39] I was going to propose however, to implement a similar thing to that but for pt-hearbeat [12:39:00] a systemd service for easier handling of start and stop [12:39:11] + wikimedia patches [12:39:28] ^^ I would be happy to work on that, I like that kind of work [12:39:33] but I don't want you to confuse you now [12:39:40] and I didn't know if you liked pt-kill [12:39:49] or hated that kind of work [12:39:59] I loved [12:40:08] but it would make sense to be you as you are already familiar with the process [12:40:18] and it is very very similar but for another utility [12:40:54] it is also a key part of the mediawiki infrastructure [12:41:05] The thing I prefer to do the most when I am able to sew components together as building system/tools [12:41:28] let's wait for more comments on your patch [12:41:37] and when it is finished I can tell you more about that [12:41:46] cool! [12:42:08] So, what about the user creation then? Shall I do it from the postinst file (`mysql --skip-ssl -e "GRANT..."`) or I need to apply the wiki-replicas.sql ? [12:42:33] yeah, that is not automated on purpose [12:42:54] it will be, but needs the inventory and manual deploy etc. [12:43:18] for now just apply the grant (only the new staff) locally on the 3 wikireplicas [12:43:29] log when grants change, too [12:43:53] OK [12:43:55] the idea is eventually move the grants handling to a mysql database [12:44:25] so we can create dynamicly diffs and detect account errors [12:45:03] while keeping the passwords on the private puppet repo [12:46:21] Amir1: I agree with Jaime, up to you and releng. There are not many errors (138 in 24h), So whatever you decide is good with us. In regards with the other 4 indexes removal, I am not seeing any errors, we are good in that front. I am going to fully deploy it in eqiad [12:47:00] Nice thanks! [12:50:41] banyek: riccardo may have given you some insightful comments on your patch, but he is on vacations right now [12:51:19] hopefuly moritz can take a quick look as he is way better ar debian packaging than many of us [12:52:13] banyek: one more question, /var/log/wmf-pt-kill is written directly by the process? [12:53:10] `/var/log/wmf-pt-kill/wmf-pt-kill.log`: the /var/log is not writeable as user, but the /var/log/wmf-pt-kill directory is chown'd to wmf-pt-kill system user [12:54:35] not complaining about that [12:54:39] but about rotation [12:54:54] aka "disk is full" because logs [12:58:35] ^^ I was thinking about we need to set up logrotate, but I wasn't get there so far [13:20:47] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) [13:23:47] 10DBA, 10SDC Engineering, 10Wikidata, 10Core Platform Team (MCR: Deployment), and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10CCicalese_WMF) [13:26:42] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [13:29:05] I start to prepare dbstore2002 for power off at 4pm [13:29:16] great [13:29:22] 10DBA, 10Core Platform Team Kanban, 10SDC Engineering, 10Wikidata, and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10CCicalese_WMF) [13:48:47] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [13:54:32] marostegui: so the idea is to rename the tables ahead of time (or delete them) and put a replication filter while we are ar codfw [13:54:37] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [13:54:52] that way the switch dc procedure is not modified [13:55:14] jynus: Rename them where? [13:55:16] or we could replicate from s5 in a "loop" [13:55:33] so remove (or rename) cebwiki from s3 [13:55:45] because it is already replicating on s5 [13:55:53] from codfw [13:55:55] so you mean renaming those tables on s3 but only on eqiad [13:55:57] right? [13:56:05] yes [13:56:20] right, so putting a filter to ignore those wikis and then rename those on db1075 [13:56:23] right? [13:56:26] remove/rename anything that avoids a split brain [13:56:33] yes [13:56:37] now I get it [13:56:38] +1 [13:56:48] the same but without touching the dc procedure [13:56:56] there is still some issues [13:57:06] like easy recover on failure [13:57:22] but I think that should have worked the same [13:57:28] e.g. rename -> rename back [13:57:34] remove filters [13:57:37] so db1070 multi source to s3 codfw, with a replication filter to replicate those wikis, and then on db1075 filter to ignore and then rename tables [13:57:47] I will start the preparation of db2002 soon, but late a few minutes [13:57:49] I think yes [13:58:03] I am going to test it on db1110 [13:58:06] to be 100% sure [13:58:13] and leave it replicating for a day [13:58:23] jynus: that should work yes, the only draw back is that if we have to move to eqiad let's say, on saturday for an emergency, we would need to reimport those wikis, which is unlikely anyways [13:58:26] and maybe do some compare.py [13:58:39] ? [13:58:44] I didn't get that last part [13:58:58] no, rename and filter the same day [13:59:05] Ah ok [13:59:06] THen yes [13:59:09] in advance, but not too much in advance [13:59:13] yeah [13:59:13] just not blocking the read only [13:59:17] That should work yes [13:59:18] like the warmup [14:00:24] and if something goes bad, we replicate the missing hours just for those tables [14:00:27] then reenable them [14:00:43] the problem is what if we do a switch again [14:00:51] and what happens with codfw [14:01:01] We'd need to apply binlogs for those wikis [14:01:01] another filter? [14:01:20] I mean to not break replication [14:01:26] on s5 [14:01:29] yeah, filter is the only thing we can do [14:01:43] so 99% of the wikis at least should be unafffected [14:02:07] and as long as we only have one read-write server, we can recover the others [14:02:12] on the other hand, those are the bigger ones, so probably the ones with more traffic [14:02:14] or do the inverse switch [14:04:40] so if we do codfw -> eqiad and stay there for let's say 5 minuts and then go back to codfw [14:04:50] We'd have lost 5 minutes of writes to s3 codfw [14:05:06] Which we could just replicate with binlogs [14:05:23] (for those wikis) [14:07:10] or with multi-source [14:07:39] we could setup also codfw master s3 with multisource to replicate only certain changes? [14:08:16] from s5? [14:08:23] but that can only happen after migration [14:08:29] after being passive [14:08:44] and that will not work with mediawiki [14:08:58] (heartbeat replication control) [14:09:04] but will keep the data fresh [14:09:22] yes, from eqiad s5 [14:09:27] now that I think about it, s5 eqiad will have multi source, will that mess up with the dc checks (I remember there is a check to check if all the masters are up to date) [14:09:54] we will need to ask v0lans [14:09:55] which checks? [14:10:14] isn't there a check during the failover to check if the passive masters are up-to-date, so no lag? [14:10:15] the dc-switch [14:10:25] yes [14:10:32] it shouldn't because it should check SHOW SLAVE STATUS only, but we can check [14:10:50] ok I am here [14:11:00] so on s5 you'll create an additional replication channel but also leave the main one? [14:11:05] So I start to preopare dbstore2002 [14:11:06] so show slave status will still work? [14:11:07] I winder also if circular replication will be a problem too [14:11:18] marostegui: I need to check the code [14:11:44] if it does a show slave status and we don't create s3 and s5 channels, and only s3 channel [14:11:47] that should keep working [14:11:51] but that's to be checked [14:12:04] I put the host to downtime in icinga first [14:13:19] now I disable puppet [14:14:01] banyek: check if theres a kernel upgrade needed [14:14:09] good call [14:14:27] yeah, it would be nice to get a mariadb upgrade too [14:14:34] yep [14:14:36] although not a huge priorirty [14:14:59] marostegui: let me finish the import to a replica of s5 [14:15:02] ok [14:15:09] and I will test replication [14:15:17] and then I can do a detailed plan [14:15:23] great, thank you [14:16:30] banyek: it is running 10.1.35 so if you do the full-upgrade you'll get 10.1.36 too [14:16:44] I'll do it then [14:17:08] it gets shut down and powered off, it would be a mistake for not doing that [14:17:29] btw. I am searching for the user who creates the backups but I don't find it [14:17:36] dumps [14:17:48] not that one [14:17:55] ```at /etc/passwd | grep dumps [14:17:55] root@dbstore2002:~#``` [14:18:01] it is called dump [14:18:17] it's not `at` it's `cat` [14:18:20] but dont grep it here :-) [14:18:46] why do you need it banyek? [14:18:50] Or what do you need to check? [14:18:53] ```root@dbstore2002:~# sudo -u dump crontab -l [14:18:53] sudo: unknown user: dump [14:18:53] sudo: unable to initialize policy plugin``` [14:19:10] banyek: that is on es2001, not on dbstore2001 [14:19:15] *dbstore2002 [14:19:21] what are you trying to check? [14:19:22] on dbstore2002 you will only have a mysql user [14:19:30] dbstore is a dbstore role host [14:19:33] my bad [14:19:37] es2001 is a backup role host [14:19:55] dbstore == source backup == temporary storage [14:20:14] then there is dbstore1001 which is both [14:20:35] then I miss a point [14:20:43] banyek: when in doubt, check manifests/site.pp which might help [14:20:49] (or confuse you more) [14:21:02] where runs the backup script itself which does the backup on dbstore2002 [14:21:10] banyek: es2001 [14:21:18] I mean I'll double check it ofc. but I don't want to fsck this up [14:21:31] the crontab you are looking for is on es2001 [14:21:38] banyek: you just disabled the backups before, or at leasy you told me so [14:21:46] I told you [14:21:49] I checked, and it is disabled yes [14:22:05] I asked to do that to allow them running on eqiad unchanged [14:22:16] banyek: check the crontab on es2001 [14:22:23] that is what you disabled [14:22:41] Disabling puppet on dbstore2002 I don't think it is needed [14:23:00] well, I do it on reboot because I unmount /srv [14:23:04] but that is just me [14:23:14] and puppet writes on /srv/sqldata* [14:23:35] yes, I checked, it is disabled indeed [14:23:42] not saying you should do that, just explaining why I do it [14:24:14] for a dbstore host I would definitely do it [14:24:16] Jynus: after the bbu is changed and the host is boot back up, I can enable it, and start, right? (I mean you finished what you needed earlier?) [14:24:38] yeah, I just wanted backups on eqiad [14:24:50] ok [14:24:56] so I have them tomorrow morning for our little side project here [14:25:00] so I am going forward [14:25:18] did you stop all instances? [14:25:35] not yet, I am stopping replication for all the sections on dbstore2002 first [14:25:36] (or you are about to?) [14:25:40] sure [14:25:45] and then stopping the instances [14:25:55] I didn't know if forward was shutdown or what? XD [14:26:07] banyek: remember the upgrade [14:26:11] forward was to do the next step [14:26:42] banyek: basically the steps I shared with you are a good base [14:26:49] db2058 failed finally [14:26:55] yaaay [14:26:55] not even a week as you said, marostegui [14:27:13] maybe not replication works well [14:27:15] *now [14:27:18] xdddd [14:27:22] https://www.irccloud.com/pastebin/jRAbv3z2/ [14:27:26] 10DBA, 10Operations, 10ops-codfw: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement complete. [14:27:30] ha! [14:27:36] it didn't fail! [14:27:49] shutting down instanes [14:27:51] shutting down instances [14:27:56] jynus: ^! [14:27:57] :) [14:28:08] lol [14:28:21] 10DBA, 10Operations, 10ops-codfw: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) [14:28:43] the bug identifies rebuilding as failed [14:28:46] *the bot [14:28:57] because the raid is technically degraded [14:29:05] sure [14:29:43] 10DBA, 10Operations, 10ops-codfw: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) Thanks - I see it rebuilding: ``` rroot@db2058:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DC560) Port Name: 1I Port Na... [14:29:59] Recovering, 2% complete [14:30:28] do we have news about db1092 BBU? [14:30:42] nope no, no updates on the ticket [14:31:02] hopefuly we get some updates before the end of the week [14:31:31] This is the last thing: https://phabricator.wikimedia.org/T205514#4622889 [14:31:36] waiting for their response [14:34:19] it's not fast, I am still waiting for s1 to flush buffer pool [14:34:52] nope, it will take time [14:35:15] banyek: HDs on a very loaded host [14:35:36] I think s1 takes around 20 minutes to stop [14:35:46] the others are faster [14:36:08] well I hope 1 hour is enough for shutting down the whole server - 35 minutes already passed [14:36:09] probably compression also affects the slowdown [14:37:00] not to be done in a hurry :-) [14:39:35] only 2 instances left [14:41:36] only s1 left [14:41:45] (I can see on the monitoring) [14:41:58] thank good it is systemd doing this parallel [14:42:36] actually it would worth to check once if this is faster one-by-one. I mean maybe the storage has a bit overloaded [14:42:45] yeah, it could be [14:42:50] too much io [14:42:59] on a host that really doesn't have much [14:44:03] hear, hear [14:44:14] peak on writes: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=6&fullscreen&orgId=1&var-server=dbstore2002&var-datasource=codfw%20prometheus%2Fops&from=now-6h&to=now-1m [14:45:04] Oct 02 14:44:44 dbstore2002 systemd[1]: Stopped mariadb database server. [14:45:57] cool [14:46:24] I am doing the apt upgrade now [14:48:06] https://www.irccloud.com/pastebin/9jOnXKJj/ [14:49:14] ask papaul [14:50:01] to very the idrac for you [14:51:36] in the meantime I unmounted /srv to see there's nothing not flushed there [14:52:02] actually I can shut down the host, and he can power it back, but what happens if it doesn't boots up? [14:52:14] As I said, ask papaul to check it [14:52:23] (I already did) [14:52:56] he probably needs to shut the server down to check/power cycle the idrac [14:53:06] or drain it or something [14:55:00] I shut down the machine and then he can do everything he needs to do [14:55:21] great [14:56:18] so es2001 is the one needind the serial port power drain [14:56:29] dbstore2002 in theory only needed the battery change? [14:56:53] 10DBA, 10MediaWiki-Watchlist, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit - https://phabricator.wikimedia.org/T171898 (10kostajh) a:03kostajh [15:01:14] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Banyek) @papaul I stopped the machine you can work on it. Btw. I was not able to access the idrac console , so maybe you could take a look on that too: ``` banyek ~ $ ssh dbstore2002.mgmt.codfw.wmnet -lroot Unable... [15:03:39] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Papaul) @banyek having problem with my irssi server so can not connect to IRC rebooting my server now will ping you when i get on IRC [15:04:11] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Banyek) ok [15:07:28] the error you are having is probably a mac-only error [15:08:49] not that it is mac's fault (those mgm ssh are horrible) [15:09:35] but try connecting to neodymium/sarin and then to the host to workaround it [15:09:49] let me check that [15:10:23] FYI, it works ok on my stretch [15:10:44] yes, it works from neodymium [15:10:55] but I saw sometimes mac users complaining about certain combinations of ssh configs [15:11:00] at wmf infra [15:11:16] either because they are outdated or on the other side, the server is outdated [15:19:32] the server is powered down and papaul told me that he'll upgrade the server firmware too - hopefully that will fix the ssh issue. I think I go for afk now, and finish this when the server is available again [15:21:04] I am moving too, returning when my import finishes in case you need help [15:21:41] if I need something I'll ping you, but hopefully no need for that [15:21:52] I will log here what I am doing/done [15:58:43] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Marostegui) @papaul I just saw your irc ping but I'm not next to my computer. Please talk to @banyek who is coordinating this. Thanks! [15:59:18] we already working on that [16:00:21] Banyek you around? [16:00:27] yes [16:00:35] Papaul pinged me on the orher channel [16:00:39] we are already working on that [16:00:42] Ah cool [16:00:45] I answered that call [16:00:47] Im from my mobile [16:00:52] the server is not get back [16:00:55] Not sure why he pinged me :) [16:00:59] and I can't login via ilo [16:01:10] but no worries [16:01:13] I'll solve this [16:01:14] :) [16:01:25] :-) [16:01:29] Good luck! [16:01:34] Im doing paperwork crap [16:01:47] that's the worst. [16:02:00] I knooow [16:02:13] See you tomorrow! [16:07:42] it's weird. the host is up, but the network seems unaccessible [16:16:57] ok, the reason was that the ethernet were not connected yet, but checking the controller doesn't comforts me: [16:17:04] https://www.irccloud.com/pastebin/QaRM2MHF/ [16:39:03] I am swapping failed disk on db1067 [16:39:07] FYI ^ [16:41:54] 10DBA, 10Operations, 10ops-eqiad: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10Cmjohnson) The disk has been swapped [16:42:43] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) Waiting on the part still [16:55:40] there are now 5 extra databases at db1110 replicating from s3, I wonder if I should also alow the changes from heartbeat [17:01:43] 10DBA, 10Operations, 10ops-eqiad: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10jcrespo) Thank you! Our alerting detects the rebuild as a down so I had worried at first without context :-) Will close when I can assure the rebuild completed successfully. [17:03:51] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) The full import process took around 7-8 hours (although it was quite inefficient to re-import just one database at a time, as the last minutes ar... [17:04:39] 10DBA, 10Operations, 10ops-eqiad: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10jcrespo) [17:09:54] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Papaul) Replacing the Server BBU with the one in db2064 didn't fix the problem. I had to put the original BBU back in the server and after doing that the error went away. This doesn't make sense for me. Can we please... [17:16:40] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) Command executed, for posterity: ``` root@db1110.eqiad.wmnet[(none)]> CHANGE MASTER 's3' TO ... root@db1110.eqiad.wmne... [17:17:05] dbstore2002 is back again, now I start the mariadb instances on that [17:19:48] thanks, I will be around for some minutes if you need help, but otherwise be mostly idle [17:20:38] BTW, did you talk with papaul, on the ticket it said he wasn't very convinced it was fixed [17:21:12] all the instances are now back up, starting mysql_upgrade [17:22:06] yes, we did talk, he removed the old BBU put back the spare one, the cache was down, and then put back the old one and the hosts now says "it's ok" I have a bad feeling about this too. [17:23:29] also SSH dbstore2002.mgmt seems down now? [17:24:16] yes, but I have a different error message :D [17:24:34] (checking from neod.) [17:24:48] well, it worked before [17:24:55] at least it responded [17:24:57] and does not work now [17:25:05] https://www.irccloud.com/pastebin/ylaDhmQB/ [17:25:59] ferm seems also it failed -restart it when you have finished the maintenance [17:26:43] I am just regretting that I run the mysql_upgrade without screen it takes a lot's of time [17:27:05] (on s3 iirc) [17:27:06] yep, I learned that too :-) [17:27:30] it is over 100K objects [17:31:36] thank god I am at phase 4 now [17:32:01] (but still at amwiktionary) [18:17:38] it's done! [18:20:01] I restarted the replication [18:20:08] ```root@dbstore2002:~# for socket in /run/mysqld/*; do mysql --skip-ssl --socket=$socket -e "SHOW SLAVE STATUS\G" | grep Seconds ; done [18:20:08] Seconds_Behind_Master: 15222 [18:20:08] Seconds_Behind_Master: 13963 [18:20:08] Seconds_Behind_Master: 13979 [18:20:08] Seconds_Behind_Master: 13936 [18:20:08] Seconds_Behind_Master: 13128``` [18:20:32] now I go to the family and start backup later [18:42:10] 10DBA, 10Operations, 10ops-eqiad: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10Marostegui) 05Open>03Resolved The alert cleared and the RAID is back to optimal! ``` root@db1067:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0... [18:43:18] 10DBA, 10Operations, 10ops-codfw: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) 05Open>03Resolved All good! Thank you! ``` root@db2058:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DC560) Port Name: 1I... [18:46:14] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Marostegui) This is not the first time I see a BBU behaving like that after a reboot or a power drain, the error clears for a few days or even weeks before failing again. Sometimes it lasts for a few hours and other... [19:55:31] 10DBA, 10MediaWiki-Watchlist, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit - https://phabricator.wikimedia.org/T171898 (10kostajh) Looking more closely at the logs I linked to above, I believe the only prob... [21:11:44] I re-enable and start backups: almost all caught up - except s1. But s1 is a bit more forward that it would be when if the backup started in the 'normal' time [21:17:34] and backup is running on es2001 [21:17:39] now I go to sleep [21:18:14] tomorrow I'll show up a little later, I have errands to run with my son, I'll be around between 10:00 and 10:30 hopefully. You can reach me on irc [22:24:00] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Banyek) the host is back in replication, and the backups were enabled