[05:00:23] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) Thanks @Jclark-ctr - I am going to depool this hots so it is ready for when you arrive to the DC.
[05:08:53] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) This being a backups source doesn't require depooling, but we need to check with @jcrespo when this host can be powered off.
[05:13:14] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui)
[05:23:08] <wikibugs>	 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Marostegui) Thanks for reaching out, this buried into a pile of emails :-) So I am not...
[05:57:51] <wikibugs>	 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1084.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202007220557_marostegui_1700...
[05:58:27] <wikibugs>	 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui)
[06:16:52] <wikibugs>	 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1084.eqiad.wmnet'] `  and were **ALL** successful.
[06:28:27] <wikibugs>	 10DBA: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 (10Marostegui)
[07:39:17] <wikibugs>	 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Marostegui)
[07:39:31] <wikibugs>	 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Marostegui) p:05Triage→03Low
[07:45:32] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) Backups were taken from db1145 today and the host put down. Please ping here when maintenance is complete.
[07:45:55] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) I have done a quick check using mysqldump (and not my dumper), as the test was done from a 10.1 host to a 10.4 one:  - Extracting the DB locally: 4 minutes...
[07:46:46] <wikibugs>	 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Privacybatm) Yeah, I think it is a good idea to have transfer time. Thank you!
[07:47:04] <wikibugs>	 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Privacybatm) a:03Privacybatm
[07:51:53] <wikibugs>	 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10jcrespo) Note that on the lastest HEAD version, soon to be packaged, the date is show at the start and completion of the backup, like this:  ` root@cumin1001:~/transferpy$ mkdir ~/test; PYTHONPATH=. python3 trans...
[07:54:23] <wikibugs>	 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Marostegui) Yeah, it is something with very low priority (as stated at T258559#6325133). Having both timestamps is indeed useful
[08:04:44] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10jcrespo) Not sure how large wikitech is, but 67 minutes looks to me like a long time, given I was able to import much larger wikis in less time in the past. Consider t...
[08:08:27] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Yeah, the problem is at the `text` table from what I can see, so we end up with the usual issue: even though we have many threads, we end up just waiting f...
[08:12:50] <wikibugs>	 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10Privacybatm) Sounds good to me. And yeah, I am currently concentrating on the Gerrit comments and documentation. Thank you!
[08:15:18] <wikibugs>	 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Marostegui) 05Open→03Stalled Only the master pending - stalling this until the DC switchover is done and eqiad is stand by.
[08:15:27] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui)
[08:39:39] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Kormat)
[08:39:57] <kormat>	 marostegui: ^ needs some input from you
[08:40:08] <marostegui>	 kormat: will check later, I am a bit overloaded atm
[08:40:19] * kormat depools marostegui
[08:40:23] <kormat>	 (no problem :)
[08:48:19] <wikibugs>	 10DBA, 10Operations, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat)
[08:48:24] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Kormat) 05Open→03Resolved Monitoring is not properly in place, but going to track that in T258566.
[09:07:59] <jynus>	 db2102 host is completely broken
[09:08:10] <marostegui>	 what happened?
[09:08:12] <jynus>	 ERROR 1356 (HY000) at line 7: View 'mysql.user' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
[09:08:30] <marostegui>	 ah that is the core test one
[09:08:31] <jynus>	 root@db2102[(none)]> select user, host from mysql.user where user='root';
[09:08:32] <jynus>	 ERROR 1449 (HY000): The user specified as a definer ('mariadb.sys'@'localhost') does not exist
[09:09:04] <jynus>	 did I or any of you tried to upgrade it/import to it?
[09:09:21] <marostegui>	 I don't think I have touched it in a long time I think
[09:09:47] <marostegui>	 it replicates from s1 codfw?
[09:10:37] <jynus>	 yes it does
[09:11:39] <jynus>	 I think I am going to put it down, I cannot fix it
[09:11:50] <marostegui>	 to rebuild it?
[09:12:03] <jynus>	 to at least operate with it somehow
[09:12:09] <marostegui>	 yeah, makes sense
[09:12:18] <marostegui>	 no idea what could've happened with it
[09:12:19] <jynus>	 the user table is broken, and I cannot handle accounts there
[09:12:33] <jynus>	 maybe it was used to test an upgrade or something
[09:12:37] <marostegui>	 maybe yeah
[09:13:51] <jynus>	 I am going to run select on all all hosts on the mysql.user table to discard it is only that one
[09:14:11] <jynus>	 I don't want to find too late that the 10.1 -> 10.4 upgrade has an irreparable problem
[09:14:21] <marostegui>	 +100
[09:14:54] <jynus>	 most likely explanation is some import of 10.1 data into 10.4 host during a test
[09:15:01] <jynus>	 given it is the test host
[09:17:23] <kormat>	 a broken mysql.user? i swear it wasn't me 😅
[09:25:22] <jynus>	 no other host presented that error, only db2102
[09:25:35] <jynus>	 for now I will delete it and put it up empty
[09:25:40] <marostegui>	 \o/
[09:25:41] <marostegui>	 pheeew
[09:26:41] <jynus>	 will take care of it is it is one of ""my"" hosts, big quotes
[09:28:24] <marostegui>	 :)
[09:28:59] <jynus>	 not urgent to provision it as it is going to be used to test backups anyway
[09:29:43] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Marostegui) > Do we need to cover the case where db1115 is the active tendril node, but db2093 is the active zarcillo one? If so i'm not sure we can easily use mariadb::monitor_readonly...
[09:35:16] <kormat>	 marostegui: ty for the feedback :)
[09:35:37] <kormat>	 in the meantime i started looking at the prometheus mysql puppet code, and.... 😿
[09:37:37] <marostegui>	 kormat: https://jynus.com/gif/calm.gifv
[09:37:51] <kormat>	 :D
[09:48:18] <volans>	 you've a gift to pick the best part of our infra-code :-P
[09:48:38] <kormat>	 the word you're looking for is "curse" :)
[09:51:55] <volans>	 lol
[10:06:01] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10jcrespo) > Yeah, the problem is at the text table from what I can see  I was able to export the table on smalled chunks by setting the backups like this:  `name=backup...
[10:10:06] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Those are excellent news! Good point, db2093 is the tendril host which has very low innodb buffer pool because of the OOM we had, that's a huge limiting fa...
[10:11:24] <jynus>	 marostegui: I would like to try to sell you the inhouse tools, but I am not sure how. I know that ours are not as well documented/externally used
[10:11:39] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Kormat) p:05Triage→03Medium a:03Kormat
[10:12:00] <marostegui>	 jynus: yeah, I just forgot we can use those for similar cases like wikitech
[10:12:08] <marostegui>	 Because I don't operate them often enough :(
[10:12:19] <jynus>	 sure, my question is what is what I am failing at?
[10:12:34] <jynus>	 flexibility? habit? not matching use case?
[10:12:41] <marostegui>	 I don't think you are failing, it is simply that we don't have many use cases on a daily basis
[10:13:01] <marostegui>	 Like, we have moved wikis 2 times, one for wikidata, one for the split from s3 to s5 and now this
[10:13:44] <jynus>	 so my point was for awareness "hey, you can use these tools and they can be tuned for performance"
[10:14:09] <jynus>	 and also because I was worried about performance, because if labswiki took so much to import, other would take much more
[10:14:21] <marostegui>	 yeah, I think db2093 was key there
[10:14:21] <jynus>	 but it just needed some extra tuning
[10:14:51] <jynus>	 the concurrency was needed too because mydumper generated a single multi-gigabyte file
[10:14:55] <jynus>	 by default
[10:15:25] <jynus>	 as part of the goal I can fine-tune performace to speed up load times
[10:15:54] <jynus>	 *my backup goal
[10:17:13] <marostegui>	 I wonder how long it can take in production, with the normal replication traffic + reads
[10:17:40] <jynus>	 multiply by 2 as a rule of a thumb
[10:17:53] <jynus>	 it all depends on desired import concurrency and ongoing load
[10:17:57] <marostegui>	 so still 1h RO would get us there
[10:18:00] <jynus>	 is this for m5?
[10:18:07] <jynus>	 sorry
[10:18:09] <jynus>	 s5?
[10:18:25] <kormat>	 marostegui: another factor might be that db2093 isn't using hw raid
[10:18:50] <marostegui>	 jynus: I was thinking about s6
[10:18:50] <jynus>	 s5 at low times load is almost none
[10:18:52] <marostegui>	 kormat: definitely
[10:18:58] <jynus>	 same for s6
[10:19:13] <marostegui>	 jynus: I thought about s6 cause it only has 3 wikis (s5 doesn't have many more, but...)
[10:19:16] <jynus>	 although if I am not mistaken s6
[10:19:27] <jynus>	 has different peak time due to ja + ru
[10:19:34] <marostegui>	 probably, yeah
[10:19:51] <marostegui>	 jawiki is very small though
[10:20:02] <jynus>	 note that in writes, s5 was still behind as it was the first to catch up very quickly on labsdb
[10:20:22] <marostegui>	 before s6?
[10:20:32] <jynus>	 let me check
[10:21:06] <jynus>	 again, it depends for what you want to optimize
[10:21:10] <jynus>	 reads vs writes
[10:21:30] <jynus>	 but look for yourself: https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&from=1589174669941&to=1590538099087&var-server=labsdb1011&var-port=9104&orgId=1
[10:23:39] <jynus>	 of course, wikitech is a special wiki, and I belive it is on group0
[10:24:13] <jynus>	 so not having any opinion, just mentioning there is a few factors: read load, write load, deployment risk, "specialness"
[10:24:21] <marostegui>	 I thought s5 wasn't that far away from s6 in writes
[10:24:23] <marostegui>	 interestnig
[10:24:37] <jynus>	 it is labsdb replication, could be missleading
[10:24:49] <jynus>	 but I think it matches my experience on backup time
[10:25:28] <jynus>	 we could and probably should move more things from s3 there
[10:25:42] <jynus>	 s3 bottleneck is in objects
[10:25:43] <marostegui>	 yeah, we need to get stuff out from s3 again
[10:25:44] <marostegui>	 yeah
[10:25:55] <marostegui>	 when did we move things out 3 years ago?
[10:26:14] <marostegui>	 I wonder if we can make s5 the default for when new wikis are created actually
[10:26:21] <marostegui>	 either s5 or s6
[10:26:29] <marostegui>	 so at least we don't place new stuff on s3 anymore
[10:26:33] <jynus>	 so the issue is that as mediawiki works now
[10:26:45] <jynus>	 the config is to put some things other palce
[10:26:46] <marostegui>	 yeah, but I mean on the wiki's initilization script
[10:26:50] <jynus>	 but by default on s3
[10:26:55] <marostegui>	 to stop taking default => s3
[10:26:58] <jynus>	 so if we move the default, we would have to list 800 wikis
[10:27:06] <marostegui>	 :(
[10:27:27] <marostegui>	 I thought if we can create wikis manually somewhere else then we only have to start listing those new ones
[10:27:33] <jynus>	 doable, and probably there could be a better option, but I am talking with current config setup/codebase
[10:27:47] <marostegui>	 yeah, but listing 800 wikis isn't realistic
[10:27:54] <jynus>	 yeah, I think if you are going to create a wiki
[10:27:55] <marostegui>	 But there's lots of black magic with the creation scripts
[10:27:59] <jynus>	 defining it manually is possible
[10:28:04] <marostegui>	 so maybe there's an option to specify where
[10:28:13] <marostegui>	 and not taking default values
[10:28:14] <jynus>	 just addit to the non-default config and it shoudl work
[10:28:27] <jynus>	 I am talking for existing ones
[10:28:33] <marostegui>	 no, yeah, the existing ones is different
[10:28:40] <marostegui>	 I was thinking about the new ones
[10:28:41] <jynus>	 https://phabricator.wikimedia.org/T184805
[10:28:44] <marostegui>	 so we stop creating them on s3
[10:29:01] <marostegui>	 wow 2018, I thought it was a lot older
[10:30:02] <jynus>	 some refactoring could be done, but don't expect it soon
[10:30:19] <jynus>	 there has been talks about making configuration more dynamic for a long time
[10:30:34] <marostegui>	 should I create a task for MW devs?
[10:31:08] <jynus>	 so not against it but I would wait for us to decide what to do
[10:31:24] <marostegui>	 what to do as in...where to place new ones?
[10:31:26] <jynus>	 as in, what is the problem and the proposed solution
[10:31:40] <jynus>	 specific ask I mean
[10:31:48] <marostegui>	 ah
[10:32:01] <jynus>	 I don't think it is as urgent for current movement
[10:32:02] <marostegui>	 I was thinking about very simple thing: can we stop getting _new_ wikis on s3 but on s5?
[10:32:25] <marostegui>	 and that'll develop in multiple reffactoring XD
[10:32:28] <jynus>	 I am liking more a new section
[10:32:46] <marostegui>	 a new section?
[10:32:54] <jynus>	 because something tells me that if we just move s3 to s5
[10:33:10] <jynus>	 we will just move the problem and not really solve it :-D, ending up with 2 problems
[10:33:25] <marostegui>	 Yeah, but the first step is to be able to create new things on different places than s3
[10:33:31] <jynus>	 sure
[10:33:40] <marostegui>	 we don't have to tell them where for now
[10:33:44] <marostegui>	 just hey, we need this feature
[10:33:56] <marostegui>	 like ./create_new_wiki --section sX
[10:34:11] <jynus>	 like that is already possible
[10:34:23] <marostegui>	 is it?
[10:34:42] <marostegui>	 Then maybe the procedure for creating new wikis should include asking us for which section
[10:34:49] <jynus>	 they just need to create them on a non-default section
[10:35:00] <jynus>	 yeah, something like that
[10:35:18] <marostegui>	 I will start an email thread with the usual new wiki creators I think
[10:35:19] <jynus>	 I don't think there is really a reason not to do that, except being the default
[10:35:21] <marostegui>	 Rather than a task
[10:35:28] <jynus>	 expose the problem
[10:35:29] <marostegui>	 Do you want to be CC'ed?
[10:35:35] <jynus>	 sure, it interest me
[10:35:40] <jynus>	 for backups
[10:35:41] <marostegui>	 cool
[10:35:46] <marostegui>	 I will start it later
[10:36:07] <jynus>	 so what I mean that cannot be done now
[10:36:14] <jynus>	 is change the default wiki
[10:36:20] <jynus>	 *default section
[10:36:21] <marostegui>	 yeah, that's fine
[10:36:34] <jynus>	 but I think there is nothing obliging to use the default for new wikis
[10:36:54] <marostegui>	 yeah, I just don't know if --section sX really exists or not when creating a new one
[10:36:59] <marostegui>	 we'll see what they say
[10:37:10] <jynus>	 yeah, they would just have to change the eqiad.php file
[10:37:15] <marostegui>	 yep
[10:37:16] <jynus>	 and the dblist
[10:37:36] <jynus>	 so what I meant is starting a conversation was ok
[10:37:58] <jynus>	 but that we weren't sure how we really wanted done
[10:38:08] <jynus>	 other than "avoid overpopulate s3"
[10:38:15] <marostegui>	 yes
[10:39:29] <jynus>	 last thing
[10:39:59] <jynus>	 db2087 IPMI Sensor Status 	UNKNOWN 	2020-07-22 10:19:21 	23d 8h 42m 29s 	3/3 	ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-db2087.localhost: internal IPMI error
[10:40:24] <marostegui>	 yep, I am aware, I tried a soft reset and never worked
[10:40:34] <marostegui>	 can you create a task for it?
[10:40:36] <jynus>	 ok
[10:40:48] <marostegui>	 I guess it needs on-site love
[10:41:04] <jynus>	 just checking because I didn't search a history for that
[10:41:08] <jynus>	 will report
[10:41:11] <marostegui>	 thanks!
[12:08:06] <wikibugs>	 10DBA, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) I wouldn't be surprised if labsdb1009 crashes sometime "soon". I was checking it for some bad performance lately and I have seen this on the logs: ` Jul...
[13:11:58] <wikibugs>	 10DBA, 10Wikimedia-General-or-Unknown: Refactor duplicate db-*.php sectionsByDB - https://phabricator.wikimedia.org/T258586 (10Reedy)
[14:22:10] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Jclark-ctr) @jcrespo maintenance is completed
[14:42:57] <jynus>	 there is a bug on 0ee48ea6
[14:44:38] <jynus>	 find the 1 difference: "Version 10.1.44-MariaDB, Uptime 953s, read_only: True, read_only: True, 8615.95 QPS, connection latency: 0.002024s, query latency: 0.000531s"
[14:46:30] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) 05Open→03Resolved Everything looking good. Thanks, @Jclark-ctr !
[14:46:59] <jynus>	 have to go for a second, if nobody is faster than me I will prepare a patch
[14:47:27] <jynus>	 (when I come back)
[14:49:36] <marostegui>	 jynus: fixed at https://gerrit.wikimedia.org/r/c/operations/puppet/+/615506/
[14:49:49] <marostegui>	 well, pending to merge :)
[17:17:02] <jynus>	 yay "read_only: True, event_scheduler: True"