[01:44:20] 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) I will be onsite tomorrow [02:57:08] 10DBA: Improve output message readabiliy of transfer.py - https://phabricator.wikimedia.org/T252802 (10Privacybatm) 05Open→03Resolved [02:57:10] 10DBA, 10Google-Summer-of-Code (2020): GSoC 2020 Proposal: Improve the framework to transfer files over the LAN - https://phabricator.wikimedia.org/T248256 (10Privacybatm) [04:24:39] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) So looks like the restart is the one causing issues. The stop slaves reported no issues, neither did the stop mysql. Starting mysql brought no issues, but as soon as I... [04:37:44] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) I have repooled the host and the queries are arriving. The errors stopped and they are definitely not happening as fast as they used to. This was the last one: ` May... [04:43:23] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) Still no more errors, the only new message was, interestingly: ` May 26 04:41:29 labsdb1011 mysqld[19806]: 2020-05-26 4:41:29 0 [Note] InnoDB: Buffer pool(s) load com... [04:51:51] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) Interesting, I just saw this: https://jira.mariadb.org/browse/MDEV-22497 which was closed a few days ago This is exactly the error we are seeing and seems to be fixe... [04:57:14] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) >>! In T249188#6164194, @Marostegui wrote: > Interesting, I just saw this: > https://jira.mariadb.org/browse/MDEV-22497 which was closed a few days ago > > This is e... [05:01:38] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) Situation as of now: I have repooled labsdb1011, it keeps having some of those errors, but it is not crashing. I want to see what happens once it... [05:36:09] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) [06:23:32] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) [06:24:41] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) [06:26:07] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) [06:43:36] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) [06:44:49] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) [06:47:35] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) [07:01:55] i'm trying to make a cumin host pattern that matches all the db hosts, but so far i'm failing. "O:mariadb::core" doesn't catch labsdb or parsercaches [07:03:15] check aliases.yaml.erb line 64 [07:03:36] ohh, look at that. [07:04:09] * marostegui hopes the command to execute won't be a poweroff [07:06:35] why do db2093 and db1115 have software raid? [07:06:48] (zarcillo and tendril hosts) [07:07:20] cause we didn't really want to spend lots of money on them [07:07:29] just to store tendril [07:07:31] ah - so they are specially set aside for this [07:08:31] 10DBA, 10Operations: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Kormat) Scan of db fleet complete: ` kormat@cumin1001:~(0:98)$ sudo cumin 'A:db-all-codfw or A:db-all-eqiad' "lvs -o lv_layout --noheadings" 185 hosts will be targeted: db[2071-2092,2094-2101,21... [07:09:03] kormat: yep [07:12:37] mm. `db-all-codfw` doesn't match db2093 [07:13:08] ah maybe cause it has the mariadb::misc::zarcillo role? [07:13:12] and that's not included? [07:13:21] I think that's relatively a new role [07:13:27] only misc::tendril? [07:13:49] ahh. `O:mariadb::misc::tendril_and_zarcillo` [07:13:50] probably it can be added to line 66 [07:13:52] the alias uses the old name [07:14:02] cause there I don't see the zarcillo one by itself [07:14:03] i'll send a CR [07:14:07] thanks :* [08:36:05] 10DBA, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) The host keeps performing fine, serving queries and having no lag -it has had no crashes despite the fact that it keeps logging those errors from time to... [08:59:46] marostegui: hey, can I open this paste up? It doesn't have any private info: https://phabricator.wikimedia.org/P9611 [09:00:21] Amir1: yep [09:00:28] Thanks! [09:00:30] :* [10:09:24] 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) >>! In T252492#6164093, @Papaul wrote: > > I will be onsite tomorrow Will stop backup processes and stop the server. [10:21:12] 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) ` $ ssh db2097.mgmt User:root logged-in to ILOMXQ91304KD.(10.193.2.204 / FE80::8230:E0FF:FE3E:F9A2) iLO Standard 1.40 at Feb 05 2019 Server Name: Server Power: Off ` host is down... [10:25:52] heads up that db2097 is down for maintenance ^, meaning codfw db backups of s1 and s6 will fail until it is back up [10:26:14] I made sure that the one running today ended up correctly, though [10:27:55] kormat: re zarcillo, recently db2093/db1115 roles were changed, was that on the spicerack repo? [10:28:23] I thought I greped all usages on puppet, and operations/software, but I guess I missed that one [10:33:14] jynus: this was on the puppet repo [10:33:36] modules/profile/templates/cumin/aliases.yaml.erb [10:36:52] did you do the CR already? [10:37:22] then my fault, I thought I checked all hiera keys and files referencing it, but I must have missed that one [10:37:47] tendril_and_zarcillo disappeared and now there is tendril and zarcillo separatelly [10:38:13] as you know what will be that in the end is in a limbo [10:39:00] yep, already done [10:39:30] (https://gerrit.wikimedia.org/r/598674) [10:41:28] thanks [10:41:28] jynus: i have a bacula question for you: is it possible to have the console give me the details about a job? e.g. 231444 was the test restore we did. `status dir` just says it was a restore, nothing else. i can see the details in the logs (source, destination, etc), but i'm wondering if the console can also display that [10:41:39] yes to both [10:42:07] one thing you can do is [10:42:12] check_bacula.py dbprov2001.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest [10:42:20] and will give you a summary [10:43:17] which underneath it just does list job [10:43:47] example "llist job jobid=230734" [10:44:10] that is what the icinga and prometheus checks use underneath [10:45:10] I did the check_bacula.py --verbose + check_backupa.py jobname which normally is enough for a quick overview [10:46:34] hmm. `llist jobid=xxx` is close to what i'm looking for [10:47:13] everything structured is easy to check on bconsole/mysql [10:47:21] for actual errors messages, you need the log [10:48:18] status dir & status storage etc are just the "current status" [10:48:36] for historics you have to check the job details [10:58:06] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Schema-change, 10Technical-Debt: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) Is this ready to go? [11:45:18] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Schema-change, 10Technical-Debt: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Reedy) [11:48:06] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Schema-change, 10Technical-Debt: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) Thanks you @Reedy [12:04:01] marostegui: in site.pp some sections have comments like "see also db1090, db1103, db1105 below", and some do not [12:04:23] i'd like to make that consistent. do you have a preference for whether i add it where it's missing, or remove it? [12:04:59] kormat: The sections that do not have it is because they don't have multi-instance hosts (or it should be the only reason why they don't) [12:05:10] We added it because it was pretty easy to miss mult-instance hosts [12:05:15] marostegui: s1 doesn't have it [12:05:21] it should then [12:05:26] ok. i'll fix that. [12:05:34] the only reason that doesn't have multi-instance in core is s3 [12:05:38] as far as I remember [12:06:33] i'll also check the existing comments for correctness [12:06:41] thanks :) [12:57:14] 10DBA: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 (10Marostegui) I have been playing with an event, similar to the one that purges `global_status_log_5m` with kinda the same options and seems to keep `general_status_log` table... [14:21:00] btw kormat did you see https://gerrit.wikimedia.org/r/c/operations/software/conftool/+/597631 ? [14:21:14] don't look at the code, just the pastes in the commit message ;) [14:21:47] I am going to release this today or tomorrow [14:21:58] i have indeed. <3 [14:22:16] I also filed a bug with Python upstream, but it hasn't gotten any attention yet [15:03:55] there is still a high rate of INSERT statement since the train rolled https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1122&var-port=9104 [15:04:27] but I dont see any lag on https://dbtree.wikimedia.org/ [15:04:44] so I guess we can let it flow and claim that the train is a success / close the blocker task ? [15:04:45] marostegui: ^ :] [15:13:46] he is away [15:14:00] but yes, the issue is with the deployment/old code [15:14:05] not with the new code [15:29:54] jynus: thx ;) I am just closing the deployment blocker task, but leaves the INSERT rate task open [15:31:56] 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) 05Open→03Resolved memory replacement and firmware upgrade complete Return label information below {F31842805} [15:33:02] cdanis: the level of hackiness of that extension is growing by the day ;) [15:38:29] volans: hey, icdiff was your idea ;) [15:38:37] but yes, it needs a refactor at some point [15:44:15] cdanis: what's worries me about this patch (that I totally understand) is that the paste is often used to get a review of the diff from someone else [15:44:32] so if I get the nice icdiff but then I have to show the weird difflib to someone else for approval [15:44:37] we didn't gain that much [15:45:02] the icdiff is much less readable on phab, because you don't get the terminal colorizing there, and instead have to rely on its syntax highlighting [15:45:48] yes I know [15:48:50] so, in practice, there's not any difference between how the two generate diff blocks [15:48:55] I just think it's a bit nicer to look at on a console [15:49:12] we shouldn't have the case where the icdiff looks wildly different from the unified_diff, they invoke a lot of the same code internally [15:50:01] wait, I might have miss a bit [15:50:12] the unified_diff is anyway the "fixed" version [15:50:14] not the ugly one [15:50:18] yes [15:50:21] this is a further improvement [15:50:24] to console output [15:50:31] where i think two columns is more readable anyway [15:50:40] ok fair enough, I thought for a moment that was alternative [15:50:44] but it builds on the already-merged change, where we recurse into every subsection, which seems to trick difflib [16:23:06] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Schema-change, 10Technical-Debt: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Krinkle)