[06:49:42] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3004471 (10Marostegui) Thanks @jcrespo! Unfortunately the last thing I heard from @Papaul was that the HP technician didn't show up (he was still waiting for him) so I asked him to turn t... [07:03:04] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3004482 (10Marostegui) labsdb1011 has now commonswiki imported. I have restarted a couple of times MySQL there without any issues. Also I have done a SELEC... [07:29:43] 10DBA, 13Patch-For-Review: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300#3004574 (10Marostegui) Pending hosts: db2029 - codfw master (can be done if we downtime all the slaves in codfw) db1045 - eqiad master (cannot be done until we switch DCs) [08:14:51] 10DBA, 06Operations: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#3004681 (10Marostegui) Hey @Cmjohnson let's move db1073 to B3 on Wednesday if you'd have time? [08:53:15] 10DBA, 10Wikidata: Repeated reports of wikidatawiki (s5) API going read only - https://phabricator.wikimedia.org/T123867#3004724 (10Ladsgroup) I did [[https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/birthday_today&diff=443270676&oldid=443121815 | this edit]] that might help in the short te... [10:29:13] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3004834 (10Marostegui) I was doing some tests to refresh my mind and play with the new versions (10.1.21) as this is a critical change etc and I have filled this: https://jira.mariadb.org/browse... [10:32:17] jynus ^ :_( [11:09:15] marostegui: jynus: if you get some spare time, I could use a final review of mariadb puppet module change https://gerrit.wikimedia.org/r/#/c/331329/ [11:09:25] that adds puppet-lint / puppet parser validate etc on the module so CI runs it for you automagically [11:09:50] though Jaime mentioned that touches pt-heartbeat, it might be invasive / restart all dos :/ [11:10:06] all dos -> all dbs [11:10:50] the review is ok [11:10:57] what I need to do is to deploy it [11:11:01] and that takes some time [11:12:21] yeah I seen your comment about that patches touching pt-heartbeat [11:12:28] though I am not sure how it has that side effect :-D [11:12:55] pt-heartbeat controls the replication lag [11:13:15] if for any reason it crashes, all slaves will be lagged and not used anymore [11:13:23] so things will go into read only [11:13:26] or worse [11:13:31] saturate the master [11:14:16] I need to disable puppet on the eqiad masters [11:14:23] and deploy it everywhere else [11:14:33] then on the masters, one at a time [11:14:37] and that would be triggered by my change on the nrpe monitor "mariadb_slave_sql_lag_${name}" ? [11:14:44] no [11:14:59] by the change on pt-heartbeat [11:15:51] where is that change? [11:16:35] but the patch does seem to deal with pt-heartbeat, unless that is some magic side effect [11:16:55] marostegui: https://gerrit.wikimedia.org/r/#/c/331329/ [11:16:59] that one what can cause is to generate 500 pages to all ops [11:17:15] No, sorry, the actual pt-heartbeat change that jynus mentions :) [11:18:03] this is a linting change and can wait [11:18:16] because there are other more important changes pending [11:49:18] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3005005 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2046.codfw.wmnet'] ``` The log can be found in `/var/log/... [11:55:26] I think db1067 went a bit slow at first, but it seem ok now [12:17:15] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3005059 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2046.codfw.wmnet'] ``` and were **ALL** successful. [13:40:19] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3005360 (10Marostegui) All the tables have been imported to labsdb1010. Stopping and starting mysql worked fine, doing a SELECT over all the tables worked... [14:38:02] I am going to start logging queries on db1089 [14:38:09] for a performance report [14:38:21] ok! [14:38:49] I will log to /tmp to prevent filling up the data partition [14:40:01] you going to log all the queries? [14:40:29] no, 1/100 [14:40:31] of all [14:40:48] see methodology: https://wikitech.wikimedia.org/wiki/MariaDB/query_performance [14:41:28] ah nice and useful!! [14:41:39] this is only for a day [14:41:53] aaron asked for an updated report and I think it is due [14:43:07] we will have better monitoring when we setup history for the p_s [14:43:29] yeah, it will be a good improvement [15:07:15] Re: db1089- I will leave it there for 24 hours, it should take 4 GB [15:07:28] there is 31 GB free [15:07:33] ok [15:07:37] is that s1? [15:07:44] yes, and if it fills up / [15:07:49] it shouln't affect /srv [15:08:00] alerts enabled? [15:08:05] but just a heads up of something goes wrong [15:08:08] yeah [15:08:09] yes, as usual [15:08:12] ok :) [15:08:29] probably only look at warnings during your morning [15:08:37] haha "your morning" [15:09:11] SET GLOBAL slow_query_log := OFF; [15:09:16] in case [15:09:41] I've also put a calendar reminder for myself [15:10:03] one last thing [15:10:08] dbstore1001 [15:10:38] we plan for a reimage next between the end of this week and next one? [15:10:54] if dbstore2001 catches up... [15:10:58] after the latest backups are done and synced to dbstore2001 [15:11:00] ah, true [15:11:06] I forget about that [15:11:13] let me see [15:11:23] Seconds_Behind_Master: 167124 [15:11:33] the rest of the shards are up to date now [15:11:40] I checked in the morning and it was 180k delayed [15:11:42] so it is going fine [15:11:55] we should have the derivative of the seconds behind master there, too [15:13:03] 30K seconds in 12 hours [15:13:32] that is 3 days left [15:13:41] maybe for the following week [15:13:47] I checked the BBU by the way and it is fine XD [15:13:55] We definitely need to look into parallel replicaiton for this host [15:13:59] but let's make a plan with chris [15:14:45] https://phabricator.wikimedia.org/T153768 we can discuss there [15:14:49] with chris I mean [15:14:53] +1 [15:18:39] https://video.fosdem.org/2017/H.1309/ [15:18:55] oh, they are there already!! [15:18:58] some of them [15:19:02] yes, only some [15:38:30] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3005848 (10jcrespo) [15:40:27] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10jcrespo) Adding user notice. In theory, no end users should be affected, but if some tools have not been properly programmed to reconnect, t... [16:04:41] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3005962 (10Papaul) Unfortunately the HP tech didn't show up. I m following up with HP on the case. [16:09:58] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3005970 (10Marostegui) Thanks @Papaul - I will leave the server depooled so we can shut it down anytime once you've arranged another day and time [16:56:12] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3006101 (10Papaul) The service was canceled, according to HP they couldn't get in touch with me; which is not true because i didn't received any calls or emails from them. Another service... [17:22:09] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3006166 (10Marostegui) Thanks for the heads up! I will get the server ready by Thursday then! Thank you! [17:25:49] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3006176 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2036.codfw.wmnet'] ``` The log can be found in `/var/log/... [17:47:58] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3006248 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2036.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['db2036.codfw.wmnet']) ``` [18:25:37] inconclusively, our latest mariadb package could be 30% faster- https://grafana.wikimedia.org/dashboard/db/mysql?panelId=40&fullscreen&from=1486481720848&to=1486491243304&var-dc=eqiad%20prometheus%2Fops&var-server=db1022 [21:35:12] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3007368 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2043.codfw.wmnet'] ``` The log can be found in `/var/log/... [22:04:07] 10DBA, 06Operations, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3007460 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2043.codfw.wmnet'] ``` and were **ALL** successful.