[05:07:50] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:09:35] 10DBA, 10Operations, 10Patch-For-Review: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) [05:39:43] 10DBA, 10Patch-For-Review: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` dbproxy1017.eqiad.wmnet ` The log can be found in `/var/log/wmf-au... [06:12:37] 10DBA: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1017.eqiad.wmnet'] ` and were **ALL** successful. [07:32:52] 10DBA, 10Operations: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179 (10jcrespo) @TK-999 Please note that this is an infrastructure limitation, which means it is mostly related to Wikimedia servers, not mediawiki. As I see it, our main limitations are: * Compatibility f... [07:58:42] 10DBA, 10Operations: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) [07:59:33] 10DBA, 10Operations: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) p:05Triage→03Normal Nothing uses dbproxy1005, but I am going to stop haproxy and leave it stopped for some hours before fully decommissioning this host just in case. [07:59:48] 10DBA, 10Operations: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) [07:59:50] 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui) [08:02:51] 10DBA, 10Performance-Team, 10Wikimedia-Rdbms, 10Patch-For-Review: SHOW SLAVE STATUS as a health check should have a low timeout - https://phabricator.wikimedia.org/T129093 (10jcrespo) BTW, I consider this a smaller issue once replication control was migrated to heartbeat- I am guessing some show slave stat... [08:48:42] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [09:06:26] 10DBA, 10Operations, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Trizek-WMF) Added for Tech News, since Etherpad service is quite used, and 16:00 UTC is a common meetings hour. [09:08:13] 10DBA, 10Operations, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) >>! In T231403#5464235, @Trizek-WMF wrote: > Added for Tech News, since Etherpad service is quite used, and... [09:17:49] 10DBA, 10MediaWiki-File-management, 10MW-1.34-notes (1.34.0-wmf.20; 2019-08-27), 10Patch-For-Review, 10Performance-Team (Radar): Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195 (10Marostegui) [10:43:42] 10DBA: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 (10Marostegui) These were the figures before I stopped mysql over the last 2 (I gather data every 8 hours, so 3 times a day) days - we can see MySQL memory growing every day: ` 29136 mysql 20 0 0.107t 0.063t 25... [12:28:53] 10DBA: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 (10Marostegui) 12:27:04 ` 1630 mysql 20 0 67.313g 0.046t 24936 S 366.7 37.8 394:46.56 mysqld ` So almost 20GB more in less than 2 hours after enabling `event_scheduler` [14:57:55] marostegui: jynus https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All that's me running the script on up to Q2m, it probably will take two days and it's on a screen on mwmaint1002 in case you need to stop it [14:58:10] for the rest, I will start making a patch on puppet [14:58:49] I guess there is replication control, right? [15:00:21] jynus: yes plus +2 seconds on each 250 item batch for secondary replicas to catch up [15:00:30] let me know if 2 seconds is not enough [15:00:31] cool [15:00:40] then the only worry would be it affecting performance [15:00:52] but I don't know if there are any probes specifically for s8 [15:01:24] right now it's writes mostly but we will get there when we are flipping the switch on read [15:01:30] (If I understand you correctly) [15:01:39] oh, I was thinking of the batch job [15:01:51] as writes has some impact on the overal performance [15:01:57] that is ok [15:02:11] just it would be nice to have if it is significant [15:02:20] and a good metric for that [15:02:45] not only for that- I just don't know if there is already uncached wikidata.org metrics [15:03:17] look at: https://grafana.wikimedia.org/d/000000431/webpagereplay?refresh=15m&orgId=1 [15:03:27] there are for several wikis, but not for wikidata, I think [15:03:52] well, group1, but you get the idea [15:05:54] yeah [15:06:10] nothing actionable, just speaking my mind [15:08:18] I may create a ticket about commons and wikidata being on group1 not sure if it makes sense in all cases [20:24:41] 10DBA, 10Performance-Team, 10Wikimedia-Rdbms, 10Patch-For-Review: SHOW SLAVE STATUS as a health check should have a low timeout - https://phabricator.wikimedia.org/T129093 (10Krinkle) p:05Triage→03Normal