[03:21:51] 10DBA, 10SRE: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10RLazarus) p:05Triage→03High [03:26:25] 10DBA, 10SRE: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10colewhite) `racadm getsel` ` Record: 11 Date/Time: 02/11/2021 01:38:37 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B3. -----------------------------------... [03:28:12] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10colewhite) [08:33:48] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) <3 the response, you not only did exactly with manuel would have done (depool from traffic), you also discovered the core reason why mysql failed (hw memory errors). Thank you a lot! [09:03:46] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable communication between orchestrator and clouddb hosts - https://phabricator.wikimedia.org/T273606 (10Kormat) 05Open→03Resolved a:03Kormat Patch merged, clouddb101[5,9] showed up on orchestrator as soon as puppet ran. \o/ [09:12:36] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) db1134 is likely to be unavailable for a long period of time due to T274472#6821332. It was the candidate master, which means we have to choose other one for that. * ~~db1083~~ * ~~db2112~~ (and... [09:18:30] jynus: uff. we'd have to reimage a host back to 10.1 :/ [09:18:40] not sure about that [09:18:57] or I would say, I would let manuel take that decision [09:19:38] i guess it depends on how long we expect it to take to get db1134 back in services vs the chances that the s1 master dies [09:19:44] for now I will mark a new 10.4 host as candidate [09:19:58] and if master dies, assume the upgrade to 10.4 [09:20:03] or we say that labsdb* are acceptable casualties [09:20:06] exactly [09:20:12] * kormat nods [09:20:20] not as a long term decision, but as a "now" decision without manuel [09:20:39] and next week with him, he can take that decision [09:21:02] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Kormat) ` $ mw-section-groups s1 eqiad db1083 0 db1084 200 api db1099:3311 50 contributions,logpager,recentchanges,recentchangeslinked,watchlist db1105:3311... [09:21:11] I also feel a bit "constrained" with resources on s1 to do very long changes right now [09:21:23] ack [09:21:42] I was thinking of making db1118 a 10.4 candidate [09:22:06] it is the most reliable one, other than db1084, but that is going to be decommissioned [09:22:22] and thoughts? [09:22:39] I won't even depool it, just change row -> statement in a hot way [09:22:44] i'd suggested db1118 on the task, too [09:22:50] ah, sorry [09:22:54] I didn't see that yet [09:23:17] np :) [09:23:23] that means we are 100% in sync [09:23:51] what is this command mw-section-groups ? [09:23:56] looks very useful [09:24:36] and I would really like to have an extra host on s1 with none :-( [09:24:47] which would fix our 10.1 issues [09:25:01] do we happen to have any not productionized yet? [09:26:08] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) db1118 it is, then, seems also the most reliable one of the list (based on no past crashes/longevity serving traffic). [09:31:58] jynus: it's ~kormat/bin/mw-section-groups on cumin1001. it's an adhoc script using a bunch of other adhoc scripts i've written [09:31:58] kormat: https://gerrit.wikimedia.org/r/c/operations/puppet/+/663531 [09:32:12] thanks, kormat, I will steal it :-D [09:32:30] or please upload it to operations/software/dbtools, it seems super useful [09:32:59] (we used that repo is for adhoc scripts) [09:33:12] i've put a number of my existing scripts in there :) [09:33:18] cool, thanks [09:33:33] so please help me with the candidate issue [09:33:49] i've been kinda waiting around for volans to release a better version of cumin, so i can stop doing terribly hacky things like ~kormat/bin/nodelist [09:33:53] my plan is to deploy 663531 [09:34:09] and then hot-change db1118 to statement [09:34:32] and we won't be in an ideal situation [09:34:55] but at least something until next week [09:35:00] +1'd [09:35:52] and rather than downgrading db1118, I would like to see if we have a spare host to pool with 10.1 meanwhile [09:36:17] i don't know the status of any new machines that aren't in use yet. maybe sobanski has a better idea? [09:36:55] yeah, if he doesn't know, we may have to do some discovery :-D [09:37:34] I may need help because I am rusty with db handling [09:37:44] specially mw side/dbctl [09:38:59] buf, at peak time db1134 did 21K QPS, so I would prefer to have something extra [09:40:55] for example, kormat, could you edit the dbctl json to move the candidate master property on dbctl once it is deployed on the host? [09:41:13] sure [09:41:15] I am probably less confortable with it than me [09:41:19] *you [09:41:39] let me deploy and I will tell you when to commit [09:50:08] Some of the hosts Manuel is working on in https://phabricator.wikimedia.org/T258361 are not in use yet [09:50:53] there's one there that's slated to replace the s1 master [09:51:05] ah, that would be a good backup! [09:55:31] ok, so I will do the changes live, I need someone on logs/alerts to check I am not breaking the host [09:56:49] let me know when ready [09:56:50] we could always depool the host, make the change, and repool it. [09:57:06] i can watch for alerts; i honestly have no idea how to interpret the logstash stuff [09:57:47] ha he, really we don't either => more logs == worry [09:59:03] ok :) ready [09:59:26] Is it just a logstash search for db1118 or something more complex? [09:59:43] nah, just using the DBError dashboard should be enough [09:59:51] sobanski: there are 3 dashboards that manuel looks at: [09:59:55] https://logstash.wikimedia.org/goto/356e11dc66d9bceefce54e1e9aef819a [09:59:59] ip for db1118 is [09:59:59] https://logstash.wikimedia.org/goto/53f2c7bb0430903f1ad7bfe949438117 [10:00:04] https://logstash.wikimedia.org/goto/13f58d07671f48992ccdcfda0c5b6921 [10:00:14] Thanks [10:00:21] 10.64.16.12 [10:00:30] as many times the app logs ips, not hostnames [10:00:38] 10DBA, 10SRE: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10MoritzMuehlenhoff) [10:02:16] sobanski: for watching icinga, i recommend this url: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29 [10:02:27] ok, will log at #operations , about to do it [10:03:21] checking the change went through [10:06:15] nope, it is doing row format still [10:07:36] even after flushing them [10:08:47] so we may have to depool after all or wait for sessions to expire [10:11:31] :-( [10:11:36] i can handle that [10:12:44] I think because it has so much load, it always has ongoing temporary tables [10:13:10] so you want to take care of that, and I try to restore the "new" server? [10:13:40] a depool or a depool + restart should do it [10:13:57] By "new" server, do you mean db1163? [10:14:17] I am not going to say confidentely yes, but something like that [10:16:18] db1163 has buster, I reimage it to stretch, recover an s1 backups and then we see from there [10:16:37] I accept better plans :-) [10:18:12] jynus: sounds reasonable to me [10:18:22] No objections here [10:18:39] i'll take care of db1118 as above [10:19:37] thanks to both! [10:25:28] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) a:05Marostegui→03jcrespo I am taking db1163 to, at least temporarily, substitute db1134 due to T274472. [10:27:18] jynus: how were you checking to see if the binlog_format change took effect? [10:27:29] sorry, I didn't tel [10:27:40] mysqlbinlog --start-datetime="$(date +"%Y-%m-%d %H:%M:%S")" /srv/sqldata/db1118-bin.002700 | less [10:27:55] I have an alias for mysqlbinlog [10:28:59] alias mysqlbinlog='mysqlbinlog -vv --base64-output=DECODE-ROWS --skip-ssl' [10:29:58] if it outputs @1= commands that is ROW format decoded, otherwise they should be "normal SQL statements" [10:32:14] great, thanks [10:32:22] i needed to do `stop slave; start slave` [10:32:27] and now it's taken effect [10:32:28] oh! [10:32:29] cool! [10:32:38] I didn't think about that, I am silly [10:32:45] it was just a guess on my side :) [10:33:13] we should document all that [10:33:30] ok, this gives us a ballon of air [10:33:35] :-) [10:35:11] see why I say I am rusty with day-to-day db operations? [10:43:47] Team work FTW :) [11:06:21] I am forgetting something ? https://gerrit.wikimedia.org/r/c/operations/puppet/+/663549 [11:06:31] *Am I [11:06:59] I guess we will also need later a dbctl patch [11:12:54] I really need help with reviews ^ for this, I don't do 100s like manuel, and when I do I normally ask for his review [11:17:32] These look sane, looking at the rest of the config but I can't authoritatively say if it's everything. [11:17:59] yeah :-(, ok deploying and we can patch it later [11:18:04] kormat: any thoughts? [11:18:14] or I can wait [11:19:45] looking [11:20:23] Thanks [11:20:26] LGTM [11:20:35] FWIW they are in line with what Manuel did previously in the ticket [11:20:41] BRB [11:20:41] cool [11:21:08] I would like to have something, not necesarilly in production, but production-ready by peak time [11:21:16] just in case [11:26:52] 10DBA, 10SRE, 10ops-eqiad, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` db1163.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021021... [11:32:44] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [11:33:22] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) I asked filippo to delay the maintenance 1 week due to unexpected workload on my side, which would prevent me to be ready by next week. [11:36:01] oh, this happened again today, prometheus monitoring showing lots of (ficticious) lag, this time on db1106: 4.169e+05 ge 2 [11:36:51] could be mariadb, could be exporter or grafana [11:38:47] https://phab.wmfusercontent.org/file/data/blhuxhn5dovevdtctxew/PHID-FILE-35lgnomig7rvzfmdw2qj/Screenshot_from_2021-02-11_12-37-48.png [11:41:47] jynus: do you want me to create a task for this? [11:42:11] not sure how bad it really is [11:42:37] definitely not very important, wonderind if imporatant enough to track on a task [11:43:30] I see it's recovered now [11:43:38] I guess it doesn't hurt to create one, and close it if we are not really going to work on it any time soon? [11:43:47] Exactly [11:43:51] +1 then [11:44:23] that way if it alerts once, we can point other members to it [11:45:08] I've seen in the past show slave status returning something like max_int before [11:45:38] se we could patch to have a ceiling (e.g. 50 years of lag) and in the future move to pt-heartbeat [11:47:29] although 416900 is almost 5 days [11:47:59] 10DBA: Investigate intermittent replica lag alarms - https://phabricator.wikimedia.org/T274513 (10LSobanski) [11:48:13] 10DBA: Investigate intermittent replica lag alarms - https://phabricator.wikimedia.org/T274513 (10LSobanski) 05Open→03Resolved p:05Triage→03Low a:03LSobanski [11:48:23] 10DBA: Investigate intermittent replica lag alarms - https://phabricator.wikimedia.org/T274513 (10LSobanski) 05Resolved→03Open [11:48:32] Oops ;) [11:48:51] 10DBA, 10SRE, 10ops-eqiad, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1163.eqiad.wmnet'] ` and were **ALL** successful. [11:48:55] 10DBA: Investigate intermittent replica lag alarms - https://phabricator.wikimedia.org/T274513 (10LSobanski) a:05LSobanski→03None [11:51:17] 10DBA: Investigate intermittent replica lag alarms - https://phabricator.wikimedia.org/T274513 (10jcrespo) a:03LSobanski [11:51:49] thank you, sobanski, I have added info about the 2 times I noticed it [11:52:07] 10DBA: Investigate intermittent replica lag alarms - https://phabricator.wikimedia.org/T274513 (10jcrespo) [11:52:09] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10jcrespo) [11:56:38] 2021-02-11 11:54:29 INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s1.2021-02-10--19-00-01.tar.gz from dbprov1002.eqiad.wmnet to ['db1163.eqiad.wmnet']:['/srv/sqldata'] (424536075870 bytes) [11:56:56] provisioning is ongoing, will take the time it takes for lunch [11:57:59] if someone is available while I am out, I know how to add to tendril and zarcillo, but never done an etcd host addition [11:59:17] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1613044190000&to=1613048352306&var-server=db1163&var-datasource=thanos&var-cluster=mysql&refresh=5m [12:26:10] This process: https://wikitech.wikimedia.org/wiki/Dbctl#Add_a_new_host_(ie:_a_new_provisioned_host)_to_a_section? [12:57:48] sobanski: yep [12:57:50] jynus: i can do it [12:58:44] I have a dbconfig patch ready to send for review [12:58:55] sobanski: have you made the prereq puppet change? [12:59:22] looks like no [12:59:35] sobanski: in the puppet repo, the new host needs to be added to conftool-data/dbconfig-instance/instances.yaml [12:59:52] That's what I'm talking about [13:00:00] I only got that far :) [13:00:02] ahh i see [13:00:13] in that case, set me as reviewer :) [13:02:05] it will take some extra time to have db1163 ready, it will have to catch up replication, too [13:03:34] Looks like I failed to add the task number correctly :( [13:06:04] that has easy amend, see my question also [13:25:16] he: commit-message-validator: Line 7: Unexpected blank line [13:29:35] This took significantly more effort than if any of you made the change. I appreciate it :) [13:30:07] the entertainment value was also significantly higher, so it balances out :) [13:33:53] so next step I guess will be: https://wikitech.wikimedia.org/wiki/Dbctl#Add_a_new_host_(ie:_a_new_provisioned_host)_to_a_section [13:34:02] I have added it to zarcillo [13:34:13] I cannot add it it to tendril until it is up [13:37:34] I like how the instruction says "Fill out all the data (template provided)" and provides no template. I'm assuming one is opened in the editor when the command is executed? [13:38:09] I hope so [13:45:14] sobanski: yes [13:45:22] it's dynamically created on the fly [13:46:44] Do you mind if I change the description to "FIll out the data using the auto-generated template that will open in your $EDITOR"? [13:46:45] it's not necessarily a _good_ template [13:47:29] sure [13:47:39] sobanski: always good to improve the docs [13:47:48] thx [13:48:20] sobanski: let me know when you've completed the template (and on which machine you're editting it) [13:48:25] i can look at the diff [13:48:35] whoever edits is, make sure you add it depooled and with low percentage [13:48:40] *it [13:49:03] sobanski: do you have access to merge your puppet CR? [13:54:42] Merged and deployed [13:55:39] sobanski: i'd `run-puppet-agent` on both cumin hosts [13:56:52] starting mysql, no errors [14:02:25] kormat: was that a statement of activity or a suggestion of activity? [14:03:00] sobanski: yes [14:03:23] (the latter) [14:10:21] added on tendril now: https://tendril.wikimedia.org/host/view/db1163.eqiad.wmnet/3306 [15:42:20] db1163 is caught up, sobanski told me [15:43:10] buffer pool is not yet fully loaded, but maybe we can pool it with 1% load to test it? [15:44:28] sure, on it. [15:44:46] then I will be checking for errors [15:45:05] done [15:45:38] if there is some missconfiguration, we shoudl see it at: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&var-server=db1163&var-port=9104&from=1613047525973&to=1613058325973 [15:46:37] and I think we have, lets depool [15:49:16] I did it [15:52:34] ahh. no grants? [15:53:20] yeah, fixing that [15:53:42] 👍 [15:59:05] I think we should be ok now [15:59:23] let me try to log in from mwdebug [16:01:41] it says 'Error: Host not configured: "db1163"' [16:02:04] but it works with other hosts [16:02:15] do we need a mw deployment? [16:03:39] I think there must be other issue aside from the grants [16:05:49] huh. i'm at a loss there [16:06:49] look: https://phabricator.wikimedia.org/P14322 [16:07:57] who knows what the heck "Host not configured" means.. [16:08:14] I will check grants again, maybe I missed something [16:09:05] oh, maybe that is normal if it is depooled [16:09:19] which would make sense [16:09:36] let me double check the grants and we can try again? [16:09:44] 👍 [16:12:16] webrequests and admin passwords look good [16:12:25] so let's do it, you do it or I do? [16:12:53] i'll pool it now, at 1% again [16:13:09] cool, be ready to depool too :-) [16:13:11] done [16:13:16] checking [16:13:42] not instant errors so far [16:14:04] and sql works now [16:14:11] it was because it was depooled :-) [16:14:17] phew. nice catch! [16:14:32] ok, let's keep it with very low traffic [16:14:40] until it warmes up [16:16:04] low error rate (normal rate) so far [16:16:24] 0 errors on server log [16:16:35] 🤞 [16:16:51] we learn by doing :-D [16:22:57] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) [16:23:15] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) a:05jcrespo→03Marostegui [16:25:47] we need to wait until hit rate improves/buffer pool is full and we can increase weight: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1613052733344&to=1613060660438&var-server=db1163&var-port=9104&viewPanel=13 https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=1&orgId=1&from=1613052733344&to=1613060660438&var-server=db1163&var-port=9104 [16:26:16] weight as in "weight * percentage pooled" [16:26:31] we're in no rush. we can leave it run over night [16:26:52] yeah [16:27:09] as long as we have an emergency option, we are now healty [16:27:23] having it was the only priority [16:27:49] I will now send db1134 to dc-ops [16:28:11] 👍 [16:28:26] ah, and enable notifications on new host! [16:28:33] (doing) [16:29:29] question, when pooling it, did you pool it with comment candidate? [16:29:44] what is the status of that? [16:29:55] (on etcd) [16:34:23] as in, should I comment it on db1118 or db1163? [16:34:31] let me check [16:40:10] buf, it is not all in sync, so I will just make a note for manuel to give it a look next week [17:27:11] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [17:46:35] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) a:03wiki_willy Chris or John, please help us with this- Based on hw logs, it seems a typical case of memory stick going wrong. Host is depooled from service and can be rebooted/serviced in any... [17:49:28] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) [17:51:12] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10wiki_willy) a:05wiki_willy→03Cmjohnson @Cmjohnson /@jclark-ctr - just a heads up, this is higher priority and the server is still under warranty, through November 2021. Thanks, Willy [18:01:36] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [18:02:50] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [18:08:39] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [20:27:27] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Cmjohnson) pasting system event log Record: 10 Date/Time: 02/10/2021 23:35:50 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B3. -------... [20:33:23] 10DBA, 10SRE, 10ops-eqiad: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Cmjohnson) a ticket has been created with Dell for a new DIMM. Ticket number SR1051489398 [23:45:52] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10aaron) >>! In T269324#6815030, @Marostegui wrote: > Thanks @Kormat > @Krinkle @aaron - let's go for the x1 approach but with local masters being writable then? LGTM.