[04:06:07] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Add systemd timer to run `maintain-meta_p` daily on all Wiki Replica servers - https://phabricator.wikimedia.org/T246948 (10bd808) Announcement of a non-ready database is a problem that can happen even under manual control. I think we would want to add... [06:09:33] 10DBA, 10Cloud-Services, 10Core Platform Team, 10CPT Initiatives (Developer Portal): Prepare and check storage layer for dev.wikimedia.org - https://phabricator.wikimedia.org/T246946 (10Marostegui) Is this going to be a public wiki? Does it need to be replicated to labs? [06:24:03] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1103.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003050623_marostegui_244782.log`. [06:30:50] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) Due to the load issues we are experimenting on Wikidata I am not compressing the remaining hosts till the load has decreased or till we've done the DC switchover and eqiad becomes passive for a few weeks. Whatever come... [06:32:57] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) @leila would you be able to discuss this ticket with your team to try to find some suitable dates?. If your service is resilien... [06:41:10] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1103.eqiad.wmnet'] ` and were **ALL** successful. [07:33:49] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [07:35:09] 10DBA, 10Patch-For-Review: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) [07:35:17] 10DBA, 10Patch-For-Review: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) [07:48:06] o/ I upped the write rate of the terms migration and am keeping my eyes wide open [07:48:36] I did some maths and we could get this done much sooner than in 1 month :) [07:49:38] If codfw gets any lags I'll slow it down again :) [08:02:25] ok [08:02:40] addshore: much sooner meaning...? [08:02:56] writing could be caught up by the end of next week [08:04:04] so, some of the code improvements and fixes we made over the past weeks made the "work done" less and mean I have removed the sleep in between batches and no lag being created anywhere (im still watching though), and not leaving it run like this while I sleep :P [08:04:34] oh wow! [08:04:36] nice one [08:04:52] I ran it like this for a couple hours last night and figures that it averages doing 35000 itrems per 5 mins, ish [08:05:02] This morning we have 21648633 + 2000000 to go [08:05:11] let's be careful yeah [08:05:16] we are sensible to lag [08:05:16] yupp [08:05:40] the script already waits for eqiad lag, and if I see codfw lag I'll stop the script [08:05:57] but the # oif writes coming from the script is nothing compared with the changes in writes due to other things :) [08:06:47] Also, it makes sense that the migration should slowly get faster, as although its still migrating the same number of items, it has to write less rows and as a part of that, less rows including "text" / strings etc [08:07:40] so, once the migration is fully done, what's the next step? [08:08:29] so, the maths at this rate is actuall 2.5 days [08:08:50] Let me write the steps out quickly [08:09:56] 2.5 :o [08:10:00] yup [08:10:28] If mediawiki had a way of sleeping for lag in codfw, i expect we would have finished this weeks ago :P [08:11:04] haha [08:12:16] 1) Migrate items that have holes or were missed by the first migration script (2-4 days?) [08:12:16] 2) Increase new write mode to just behind the point of new item creation and migrate those remaining 2-3 items (1 day) [08:12:16] 3) Make new item creation write to the new store and slowly migrate the less than 1 million remaining items (0.2 of a day) [08:12:16] 4) Get reads up from Q25 million to max (+60 million) 1 day [08:12:16] 5) Stop writing to old stores [08:12:17] 6) drop [08:12:20] ^^ something like that [08:12:59] Again, before we re wrote a bunch of stuff we ended up with this chicken and egg problem, trying to catch up new item creation [08:13:13] but i have a hunch that with the rewrites of bits that we have done that will be less of an issue [08:13:29] so I'm going to try turning on writes for new item creation today, and that would again save us a day probablyu [08:14:15] Oh thanks, that helps understanding the path [08:15:04] np :) and the moment we stop having to run this migration script will be a great victory for all [08:15:56] yeah, that should decrease load too [08:15:57] I guess? [08:17:23] Yup [08:17:37] and then when we stop the old store inserts and deletes, thatll be the next win [08:18:12] yeah, just writing to the new store instead of both [08:18:17] And looking at read ops, the old store is at 20k and the new one is at 100-150k now, so that shouldnt be a painful migration [08:18:47] PRobably still want to do the cache warming to keep the cpu load down however, which takes time [08:19:37] yeah, but I think that's needed [08:19:41] yup [08:21:30] On a different note, regarding the deadlocks that we saw yesterday, they are indeed a symptom of the super high edit rate we were seeing, I'll backport a patch this morning that should make the situation better [08:21:50] marostegui: we as a team had a general sql question too, is there generally any / much overhead with transactions? [08:22:33] IE, is there anything bad about some super small transactions that do little but in a controlled way? [08:22:53] My view is no, we should probably be doing that more, but would appreciate a note from you / pointing to some reading :) [08:25:31] I don't get the question I think [08:25:42] You mean small fast transactions? [08:29:58] marostegui: yus [08:30:34] It is very specific to the type of transaction, but as a general rule, if a transaction is fast, that's generally good [08:30:45] ack! :) [08:30:48] Whether it is big or small it "doesn't matter" as long as it is fast [08:30:57] Generally smaller ones tend to be faster of course [08:32:19] So this cleanup ends up doing select from replica, lock master, select replica lock master, select replica lock master commit currently [08:32:27] I deployed the prometheus change and checked it solved the issue [08:32:41] and I imagine what was hapening during the edit spikes are the replicas get slower so the transaction and locks are longer [08:32:49] jynus: thank you so much <3 [08:32:59] step 1 i do that cycle but for 1 ID / starting cleanup row at a time [08:33:08] step 2 is have each part of that cycle in its own transaction [08:33:13] will prepare the single instance a a later time, but that is not affected because that uses the old password still [08:33:19] so each one would be replica select, lock, do work, commit [08:33:23] jynus: sounds good, thank you :) [08:33:29] even if the password no longer works- but needs checking [08:33:42] thank to you for your patience [08:34:01] I got subscribed to the debian bug, I am curious to see how that goes [08:34:20] shout if you see any instance prometheus data gone [08:34:26] haha will do [08:35:05] it should alarm on systemd if it fails anyway [09:36:12] o/ I was just looking at some of my wikidata related scaling documents again. Is there currently any automated way of watching how close to the auto increment limits we are on tables? [09:36:39] Nope [09:36:46] As in automated, no [09:36:56] Shall I make one? Should only take me 5 mins? :) [09:37:04] sending the data to graphite dily [09:37:08] *DAILY [09:37:10] sure thing! [09:37:14] great! [09:37:41] We normally query it from sys [09:37:42] I'll start with just commons and wikidata, and track things with an auto increment usage of 25% and higher? [09:56:50] meh, maybe the user that I normally run these scripts from cant get to INFORMATION_SCHEMA [09:57:22] yeah, we have this https://phabricator.wikimedia.org/T195578 [10:00:19] My patch was https://gerrit.wikimedia.org/r/#/c/analytics/wmde/scripts/+/577202/ , but that uses the user include ::passwords::mysql::research [10:04:43] I think we'd need to wait for the abvoe ticket to be resolved [10:06:20] please don't do a cron for that on your own- those queries can cause unwilling contention [10:07:09] (on production) [10:07:19] yeah, we should maybe query codfw [10:07:24] or whatever passive dc is [10:07:38] I just want me or manuel to be aware of it [10:07:56] e.g. I am literally now writing the logic to get that from the dumps [10:08:03] ooh [10:08:14] with no mysql extra load [10:08:36] primarily for es incrementals, but happens to work on all tables [10:08:40] do you plan to add it to zarcillo maybe? [10:08:48] so we can have reports or something [10:08:53] that is the thing I wanted to show you, do you have some minutes before meeting? [10:09:00] yeah! [10:09:02] (wanted yesterday) [10:09:11] now or before midday CET? [10:09:21] 11:30? [10:09:29] ok [10:09:43] is that ok? [10:10:33] yes [10:10:41] it shouln't take more than 10 minutes [10:11:16] cool [10:11:25] addshore: in particular querying information_schema can be very taxing on current versions of mariadb [10:11:45] I just tested the query on a codfw host, it took around 20 minutes for sys [10:11:51] it got better recently, at leasy on mysql by adding index and other stuff [10:11:57] yup, those queries would be running on one of the analytics replicas [10:12:06] but not all the tables are there [10:12:07] yeah, I think the sys one uses I_S too [10:12:10] ie: text [10:12:14] (same place all of my other slow taxing queries run :) [10:12:32] and there is still the security considerations- probably not for wikidata [10:12:40] but yes for small wikis [10:13:16] addshore: https://phabricator.wikimedia.org/T63111#5482762 [10:13:56] ack! [10:14:11] there is already a tash^ for communication [10:14:22] so just to be clear, we are not telling you to not do anything [10:14:32] for me on dbstore1005 (analytics) it returns for wikidata in 0.02s [10:14:37] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2109.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003051014_marostegui_24656.log`. [10:14:50] we are just asking to have communication because we may be solving it in a better way with more privs [10:15:04] Yupp! I'll link my patch to that ticket too [10:15:04] addshore: Ah, I thought you were using labs [10:15:06] sorry [10:15:07] or to avoid unnecesary taxing queries [10:16:02] addshore: example, dbstore1004 just got delayed [10:16:04] are you running it there? [10:16:28] nope [10:17:22] Something similar running from stat1007 [10:17:31] to dbstore1004:3313 [10:17:33] (s3) [10:17:58] * addshore looks at what s3 is [10:18:22] 800 wikis XD [10:18:27] :P not me ;) [10:31:59] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2109.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003051031_marostegui_27019.log`. [10:45:08] Going to increase reads from 25 million to 30 million in the next hour (cache warming already nearly done) I don't really expect any change in load for this config change [10:55:21] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2109.codfw.wmnet'] ` and were **ALL** successful. [11:10:05] config done, there was a spike in load, waiting for it to subside [11:10:38] again mainly on db1126 [11:11:25] meeting [12:24:26] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata-Campsite, and 3 others: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10jcrespo) > perhaps it might be an idea for more people to have access to this zarcillo thing? Stress on "The reasons why t... [12:54:08] * addshore continues cache warming, but in smaller batches and with some sleeps to keep the load as low as possible [12:56:59] scrap that, I think load is just a bit to high right now, so I'll not warm the caches and come back to it later [13:24:34] I might try and do this cache warming over EU night time next time .... seems that is when load is least [13:51:26] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1078.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003051351_marostegui_63201.log`. [14:54:38] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` and were **ALL** successful. [15:50:44] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) {F31667036} [16:19:03] marostegui: issues with tendril? [16:19:27] no updates in the last 1h [16:20:21] I see, I had it opened on the tree page [16:20:26] let's check [16:20:31] maybe db1115 needs its usual restart? [16:20:37] db1115 seems unusually idle [16:20:52] maybe [16:21:04] let me give it a go [16:21:41] the graphs shows a big drop all of a sudden [16:21:47] memory utilization dropped 1 h ago [16:22:00] let me restart it [16:22:13] ok, should I downtime it? [16:23:01] I did :) [16:23:13] ok, standing by letting you drive it [16:23:27] yep, I am doing an upgrade now [16:23:29] won't reboot [16:23:50] prometheus hosts may alert with a systemd error now [16:24:03] (just commenting in case it happens) [16:24:07] starting [16:25:46] I had to manually create /run/mysqld [16:25:48] mysql started [16:25:50] checking events [16:26:03] event scheduler running [16:26:13] oh [16:26:59] marostegui: that may be the bug of the old package [16:27:06] that deletes it on stop [16:27:19] but the new one shouldn't have the issue [16:27:27] weird :-/ [16:27:43] spike on queries [16:27:44] let's see [16:27:54] Looks like we are back [16:28:07] yep [16:28:16] I will try to see if there is something on logs [16:29:37] 10DBA: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 (10Marostegui) MySQL had to be restarted as tendril stopped updating its data. Memory activity {F31667108} [16:30:10] I did a quick check before restarting and didn't see anything obvious there for the events [16:32:08] ha [16:32:09] Mar 5 15:15:17 db1115 puppet-agent[14295]: (/Stage[main]/Exim4/Service[exim4]) Could not evaluate: Cannot allocate memory - fork(2) [16:32:09] Mar 5 15:15:17 db1115 puppet-agent[14295]: (/Stage[main]/Ulogd/Service[ulogd2]) Could not evaluate: Cannot allocate memory - fork(2) [16:32:18] the usual [16:32:26] mmm [16:32:33] the graphs didn't show much [16:32:39] Mar 5 15:15:15 db1115 puppet-agent[14295]: (/Stage[main]/Smart/Cron[export_smart_data_dump]) Could not evaluate: Failed to read root's records when prefetching them. Reason: Could not read crontab for root: Cannot allocate memory - fork(2) [16:32:42] it is full of that [16:33:15] 10DBA: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 (10Marostegui) At the time of the memory and connections drop, the logs show so it could be the usual memory issues: ` Mar 5 15:15:17 db1115 puppet-agent[14295]: (/Stage[main]/Exim4/Service[exim4]) Could not evaluate:... [16:33:43] I would like to have some better dashboard for memory and swapping [16:33:59] not only for db1115, but in general [16:34:35] I may create one for dba-focus if I cannot find one [16:34:57] or add more host metrics to the mysql one [17:06:33] * addshore is off for the day [17:07:16] * addshore is going to keep watching db load even while away [18:32:40] 10DBA, 10Cloud-Services, 10Core Platform Team, 10CPT Initiatives (Developer Portal): Prepare and check storage layer for dev.wikimedia.org - https://phabricator.wikimedia.org/T246946 (10bd808) 05Open→03Stalled Marking as stalled per T246945#5943263 in the parent task