[05:00:23] 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) Thanks @Jclark-ctr - I am going to depool this hots so it is ready for when you arrive to the DC. [05:08:53] 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) This being a backups source doesn't require depooling, but we need to check with @jcrespo when this host can be powered off. [05:13:14] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui) [05:23:08] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Marostegui) Thanks for reaching out, this buried into a pile of emails :-) So I am not... [05:57:51] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1084.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202007220557_marostegui_1700... [05:58:27] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [06:16:52] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1084.eqiad.wmnet'] ` and were **ALL** successful. [06:28:27] 10DBA: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 (10Marostegui) [07:39:17] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Marostegui) [07:39:31] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Marostegui) p:05Triage→03Low [07:45:32] 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) Backups were taken from db1145 today and the host put down. Please ping here when maintenance is complete. [07:45:55] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) I have done a quick check using mysqldump (and not my dumper), as the test was done from a 10.1 host to a 10.4 one: - Extracting the DB locally: 4 minutes... [07:46:46] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Privacybatm) Yeah, I think it is a good idea to have transfer time. Thank you! [07:47:04] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Privacybatm) a:03Privacybatm [07:51:53] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10jcrespo) Note that on the lastest HEAD version, soon to be packaged, the date is show at the start and completion of the backup, like this: ` root@cumin1001:~/transferpy$ mkdir ~/test; PYTHONPATH=. python3 trans... [07:54:23] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Marostegui) Yeah, it is something with very low priority (as stated at T258559#6325133). Having both timestamps is indeed useful [08:04:44] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10jcrespo) Not sure how large wikitech is, but 67 minutes looks to me like a long time, given I was able to import much larger wikis in less time in the past. Consider t... [08:08:27] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Yeah, the problem is at the `text` table from what I can see, so we end up with the usual issue: even though we have many threads, we end up just waiting f... [08:12:50] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Privacybatm) Sounds good to me. And yeah, I am currently concentrating on the Gerrit comments and documentation. Thank you! [08:15:18] 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Marostegui) 05Open→03Stalled Only the master pending - stalling this until the DC switchover is done and eqiad is stand by. [08:15:27] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [08:39:39] 10DBA, 10Operations, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Kormat) [08:39:57] marostegui: ^ needs some input from you [08:40:08] kormat: will check later, I am a bit overloaded atm [08:40:19] * kormat depools marostegui [08:40:23] (no problem :) [08:48:19] 10DBA, 10Operations, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat) [08:48:24] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Kormat) 05Open→03Resolved Monitoring is not properly in place, but going to track that in T258566. [09:07:59] db2102 host is completely broken [09:08:10] what happened? [09:08:12] ERROR 1356 (HY000) at line 7: View 'mysql.user' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them [09:08:30] ah that is the core test one [09:08:31] root@db2102[(none)]> select user, host from mysql.user where user='root'; [09:08:32] ERROR 1449 (HY000): The user specified as a definer ('mariadb.sys'@'localhost') does not exist [09:09:04] did I or any of you tried to upgrade it/import to it? [09:09:21] I don't think I have touched it in a long time I think [09:09:47] it replicates from s1 codfw? [09:10:37] yes it does [09:11:39] I think I am going to put it down, I cannot fix it [09:11:50] to rebuild it? [09:12:03] to at least operate with it somehow [09:12:09] yeah, makes sense [09:12:18] no idea what could've happened with it [09:12:19] the user table is broken, and I cannot handle accounts there [09:12:33] maybe it was used to test an upgrade or something [09:12:37] maybe yeah [09:13:51] I am going to run select on all all hosts on the mysql.user table to discard it is only that one [09:14:11] I don't want to find too late that the 10.1 -> 10.4 upgrade has an irreparable problem [09:14:21] +100 [09:14:54] most likely explanation is some import of 10.1 data into 10.4 host during a test [09:15:01] given it is the test host [09:17:23] a broken mysql.user? i swear it wasn't me 😅 [09:25:22] no other host presented that error, only db2102 [09:25:35] for now I will delete it and put it up empty [09:25:40] \o/ [09:25:41] pheeew [09:26:41] will take care of it is it is one of ""my"" hosts, big quotes [09:28:24] :) [09:28:59] not urgent to provision it as it is going to be used to test backups anyway [09:29:43] 10DBA, 10Operations, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Marostegui) > Do we need to cover the case where db1115 is the active tendril node, but db2093 is the active zarcillo one? If so i'm not sure we can easily use mariadb::monitor_readonly... [09:35:16] marostegui: ty for the feedback :) [09:35:37] in the meantime i started looking at the prometheus mysql puppet code, and.... 😿 [09:37:37] kormat: https://jynus.com/gif/calm.gifv [09:37:51] :D [09:48:18] you've a gift to pick the best part of our infra-code :-P [09:48:38] the word you're looking for is "curse" :) [09:51:55] lol [10:06:01] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10jcrespo) > Yeah, the problem is at the text table from what I can see I was able to export the table on smalled chunks by setting the backups like this: `name=backup... [10:10:06] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Those are excellent news! Good point, db2093 is the tendril host which has very low innodb buffer pool because of the OOM we had, that's a huge limiting fa... [10:11:24] marostegui: I would like to try to sell you the inhouse tools, but I am not sure how. I know that ours are not as well documented/externally used [10:11:39] 10DBA, 10Operations, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Kormat) p:05Triage→03Medium a:03Kormat [10:12:00] jynus: yeah, I just forgot we can use those for similar cases like wikitech [10:12:08] Because I don't operate them often enough :( [10:12:19] sure, my question is what is what I am failing at? [10:12:34] flexibility? habit? not matching use case? [10:12:41] I don't think you are failing, it is simply that we don't have many use cases on a daily basis [10:13:01] Like, we have moved wikis 2 times, one for wikidata, one for the split from s3 to s5 and now this [10:13:44] so my point was for awareness "hey, you can use these tools and they can be tuned for performance" [10:14:09] and also because I was worried about performance, because if labswiki took so much to import, other would take much more [10:14:21] yeah, I think db2093 was key there [10:14:21] but it just needed some extra tuning [10:14:51] the concurrency was needed too because mydumper generated a single multi-gigabyte file [10:14:55] by default [10:15:25] as part of the goal I can fine-tune performace to speed up load times [10:15:54] *my backup goal [10:17:13] I wonder how long it can take in production, with the normal replication traffic + reads [10:17:40] multiply by 2 as a rule of a thumb [10:17:53] it all depends on desired import concurrency and ongoing load [10:17:57] so still 1h RO would get us there [10:18:00] is this for m5? [10:18:07] sorry [10:18:09] s5? [10:18:25] marostegui: another factor might be that db2093 isn't using hw raid [10:18:50] jynus: I was thinking about s6 [10:18:50] s5 at low times load is almost none [10:18:52] kormat: definitely [10:18:58] same for s6 [10:19:13] jynus: I thought about s6 cause it only has 3 wikis (s5 doesn't have many more, but...) [10:19:16] although if I am not mistaken s6 [10:19:27] has different peak time due to ja + ru [10:19:34] probably, yeah [10:19:51] jawiki is very small though [10:20:02] note that in writes, s5 was still behind as it was the first to catch up very quickly on labsdb [10:20:22] before s6? [10:20:32] let me check [10:21:06] again, it depends for what you want to optimize [10:21:10] reads vs writes [10:21:30] but look for yourself: https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&from=1589174669941&to=1590538099087&var-server=labsdb1011&var-port=9104&orgId=1 [10:23:39] of course, wikitech is a special wiki, and I belive it is on group0 [10:24:13] so not having any opinion, just mentioning there is a few factors: read load, write load, deployment risk, "specialness" [10:24:21] I thought s5 wasn't that far away from s6 in writes [10:24:23] interestnig [10:24:37] it is labsdb replication, could be missleading [10:24:49] but I think it matches my experience on backup time [10:25:28] we could and probably should move more things from s3 there [10:25:42] s3 bottleneck is in objects [10:25:43] yeah, we need to get stuff out from s3 again [10:25:44] yeah [10:25:55] when did we move things out 3 years ago? [10:26:14] I wonder if we can make s5 the default for when new wikis are created actually [10:26:21] either s5 or s6 [10:26:29] so at least we don't place new stuff on s3 anymore [10:26:33] so the issue is that as mediawiki works now [10:26:45] the config is to put some things other palce [10:26:46] yeah, but I mean on the wiki's initilization script [10:26:50] but by default on s3 [10:26:55] to stop taking default => s3 [10:26:58] so if we move the default, we would have to list 800 wikis [10:27:06] :( [10:27:27] I thought if we can create wikis manually somewhere else then we only have to start listing those new ones [10:27:33] doable, and probably there could be a better option, but I am talking with current config setup/codebase [10:27:47] yeah, but listing 800 wikis isn't realistic [10:27:54] yeah, I think if you are going to create a wiki [10:27:55] But there's lots of black magic with the creation scripts [10:27:59] defining it manually is possible [10:28:04] so maybe there's an option to specify where [10:28:13] and not taking default values [10:28:14] just addit to the non-default config and it shoudl work [10:28:27] I am talking for existing ones [10:28:33] no, yeah, the existing ones is different [10:28:40] I was thinking about the new ones [10:28:41] https://phabricator.wikimedia.org/T184805 [10:28:44] so we stop creating them on s3 [10:29:01] wow 2018, I thought it was a lot older [10:30:02] some refactoring could be done, but don't expect it soon [10:30:19] there has been talks about making configuration more dynamic for a long time [10:30:34] should I create a task for MW devs? [10:31:08] so not against it but I would wait for us to decide what to do [10:31:24] what to do as in...where to place new ones? [10:31:26] as in, what is the problem and the proposed solution [10:31:40] specific ask I mean [10:31:48] ah [10:32:01] I don't think it is as urgent for current movement [10:32:02] I was thinking about very simple thing: can we stop getting _new_ wikis on s3 but on s5? [10:32:25] and that'll develop in multiple reffactoring XD [10:32:28] I am liking more a new section [10:32:46] a new section? [10:32:54] because something tells me that if we just move s3 to s5 [10:33:10] we will just move the problem and not really solve it :-D, ending up with 2 problems [10:33:25] Yeah, but the first step is to be able to create new things on different places than s3 [10:33:31] sure [10:33:40] we don't have to tell them where for now [10:33:44] just hey, we need this feature [10:33:56] like ./create_new_wiki --section sX [10:34:11] like that is already possible [10:34:23] is it? [10:34:42] Then maybe the procedure for creating new wikis should include asking us for which section [10:34:49] they just need to create them on a non-default section [10:35:00] yeah, something like that [10:35:18] I will start an email thread with the usual new wiki creators I think [10:35:19] I don't think there is really a reason not to do that, except being the default [10:35:21] Rather than a task [10:35:28] expose the problem [10:35:29] Do you want to be CC'ed? [10:35:35] sure, it interest me [10:35:40] for backups [10:35:41] cool [10:35:46] I will start it later [10:36:07] so what I mean that cannot be done now [10:36:14] is change the default wiki [10:36:20] *default section [10:36:21] yeah, that's fine [10:36:34] but I think there is nothing obliging to use the default for new wikis [10:36:54] yeah, I just don't know if --section sX really exists or not when creating a new one [10:36:59] we'll see what they say [10:37:10] yeah, they would just have to change the eqiad.php file [10:37:15] yep [10:37:16] and the dblist [10:37:36] so what I meant is starting a conversation was ok [10:37:58] but that we weren't sure how we really wanted done [10:38:08] other than "avoid overpopulate s3" [10:38:15] yes [10:39:29] last thing [10:39:59] db2087 IPMI Sensor Status UNKNOWN 2020-07-22 10:19:21 23d 8h 42m 29s 3/3 ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-db2087.localhost: internal IPMI error [10:40:24] yep, I am aware, I tried a soft reset and never worked [10:40:34] can you create a task for it? [10:40:36] ok [10:40:48] I guess it needs on-site love [10:41:04] just checking because I didn't search a history for that [10:41:08] will report [10:41:11] thanks! [12:08:06] 10DBA, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) I wouldn't be surprised if labsdb1009 crashes sometime "soon". I was checking it for some bad performance lately and I have seen this on the logs: ` Jul... [13:11:58] 10DBA, 10Wikimedia-General-or-Unknown: Refactor duplicate db-*.php sectionsByDB - https://phabricator.wikimedia.org/T258586 (10Reedy) [14:22:10] 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Jclark-ctr) @jcrespo maintenance is completed [14:42:57] there is a bug on 0ee48ea6 [14:44:38] find the 1 difference: "Version 10.1.44-MariaDB, Uptime 953s, read_only: True, read_only: True, 8615.95 QPS, connection latency: 0.002024s, query latency: 0.000531s" [14:46:30] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) 05Open→03Resolved Everything looking good. Thanks, @Jclark-ctr ! [14:46:59] have to go for a second, if nobody is faster than me I will prepare a patch [14:47:27] (when I come back) [14:49:36] jynus: fixed at https://gerrit.wikimedia.org/r/c/operations/puppet/+/615506/ [14:49:49] well, pending to merge :) [17:17:02] yay "read_only: True, event_scheduler: True"