[05:05:50] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) >>! In T249188#6143576, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/r6knJnIBj_Bg1xd3wtnZ} [202... [05:08:34] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) labsdb1011 finished catching up past Saturday and has been replicating fine for 5 days. I am copying its content to backup1002 so we have a binary copy just in case. O... [05:13:12] 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Marostegui) Actually taking a day isn't that bad if the script is that safe. 500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch,... [05:19:51] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Marostegui) This looks good on all the hosts: ` ----- OUTPUT of 'df -hT /srv' ----- Filesystem Type Size Used Avail Use%... [05:21:13] 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui) [05:21:30] 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui) p:05Triage→03Medium [05:28:55] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) >>! In T252761#6139643, @jcrespo wrote: > Disk latency mentrics are indeed higher, but client sql latency is significantly lower. Makes sense that the SQL client isn't seeing anythin... [05:46:20] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) Maybe this has something to do: The `sql_mode` by default is different on 10.4 and includes the `STRICT_TRANS_TABLES` ` -sql_mode NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION +sql_m... [06:08:58] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Thanks activities should be logged by CheckUser - https://phabricator.wikimedia.org/T252226 (10Marostegui) From what I can see these are the number of rows on `cu_changes`: enwiki: 18305102 commons: 24673959 wikidata: 67864207 So the numbers... [06:18:54] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Thanks activities should be logged by CheckUser - https://phabricator.wikimedia.org/T252226 (10DannyS712) >>! In T252226#6143629, @Marostegui wrote: > From what I can see these are the number of rows on `cu_changes`: > enwiki: 18305102 > comm... [06:44:20] 10DBA, 10Patch-For-Review, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2088.codfw.wmnet'] ` The log can be found in... [07:06:42] 10DBA, 10Patch-For-Review, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2088.codfw.wmnet'] ` and were **ALL** successful. [07:47:46] 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) 05Open→03Resolved a:03Marostegui This is confirmed fixed on 10.4.13. I have reimaged db2088 and installed 10.4.13 and after starting mysql: ` mysql:root@localho... [07:47:48] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [07:50:29] 10DBA, 10Operations: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Marostegui) p:05Triage→03Medium [07:58:26] 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10jcrespo) [08:45:06] 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [09:12:05] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: ipb_address_unique has an extra column in production but not in the code (WAS: ipb_address_unique has an extra column in the code but not in production) - https://phabricator.wikimedia.org/T251188 (10Marostegui) 05... [09:16:01] 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui) [09:25:40] 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Daimona) >>! In T252696#6143593, @Marostegui wrote: > Actually taking a day isn't that bad if the script is that safe. > 500 rows can probably be changed to 1000, but other than... [09:27:47] 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Marostegui) >>! In T252696#6144133, @Daimona wrote: >>>! In T252696#6143593, @Marostegui wrote: >> Actually taking a day isn't that bad if the script is that safe. >> 500 rows c... [09:32:18] 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui) a:03Kormat [10:01:48] kormat: meeting? [10:54:18] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) TODO: investigate if changing the version and config of prometheus-mysql-exporter has had a major impact on performance. [11:00:36] hey [11:00:43] in a toolforge daemon I see [11:01:03] /usr/local/sbin/maintain-dbusers[17979]: Could not connect to labsdb1011.eqiad.wmnet due to (2003, "Can't connect to MySQL server on 'labsdb1011.eqiad.wmnet' ([Errno 111] Connection refused)"). Skipping. [11:01:08] is this known? [11:01:11] yep [11:01:17] labsdb1011 is down for maintenance [11:01:22] it is depooled [11:01:27] shall we use other server meanwhile? [11:01:36] that host has been down for a long time [11:01:44] As it has been crashing [11:01:49] Mysql was up, but had no data [11:02:01] what is trying to use it? [11:02:13] it will be in a few hours I reckon [11:02:24] maintain-dbusers is the daemon that maintains toolforge accounts [11:03:10] arturo: How does it work if the host doesn't have data? (which was the case last week, mysql was up, but the data was being re-imported) [11:03:32] I'm not sure, I'm currently trying to understand https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore#maintain-dbusers [11:04:05] arturo: it must be hardcoded, cause that host was out for weeks already unfortunately [11:04:11] and only got mysql up around 10 days ago [11:04:15] (but without all the data) [11:04:27] ok [11:04:40] so maybe it was up but not even working? [11:04:41] what is the host that we should be using instead? [11:04:51] you can use labsdb1010 [11:04:57] ok [11:05:03] labsdb1011 will be up in a 2-3h [11:06:06] in modules/profile/manifests/wmcs/nfs/maintain_dbusers.pp there are several labsdb servers referenced [11:07:49] I think I will drop labsdb1011 from the list there [11:08:03] that list generates a config file that is later loaded by our daemon [11:08:16] but once it is back, we should revert it [11:08:21] sure [11:08:29] as I believe that populates all the dev accounts [11:10:24] arturo: this is the task to track this maintenance https://phabricator.wikimedia.org/T249188 [11:10:33] so maybe you want to get subscribed so you know when labsdb1011 is back [11:10:34] ack [11:11:50] wait, the user refreshed the page and now it works [11:12:01] I was preparing this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597040 [11:12:49] i guess it fails over to another host or something? [11:12:51] don't know [11:17:59] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) I have done a quick check and looks like this is what prometheus executes: ` 158534 Connect prometheus@localhost as anonymous on 158535 Connect prometheus@localhost as anonymous on... [11:18:58] yes, reading the code again (and the error message) it just skips a db host if it can't connect [11:25:35] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) Another interesting test to see what happens to the disk latency and query latency would be on pc2007: - stop slave - leave it for a few hours (pt-heartbeat will be running unless we... [11:52:13] 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) a:03Marostegui [11:53:24] 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) package updated on db1122 and db1109. [12:13:42] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) I don't know if it's relevant, but mariadb changed the default value of `lock_wait_timeout` in 10.2: https://mariadb.com/docs/reference/es/system-variables/lock_wait_timeout/ pc2007 (bus... [12:21:07] 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) Procedure: Maintenance day: - Silence all hosts in s2 and s8 - Set read only on s2 and s8: ` dbctl --scope eqiad section s2 ro "Maintenance on s2 and s8 T251981"... [12:24:10] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) >>! In T252761#6144456, @Kormat wrote: > I don't know if it's relevant, but mariadb changed the default value of `lock_wait_timeout` in 10.2: https://mariadb.com/docs/reference/es/sys... [13:48:29] 10DBA, 10AbuseFilter, 10Patch-For-Review: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Daimona) >>! In T252696#6144137, @Marostegui wrote: > If you can add it to: https://wikitech.wikimedia.org/wiki/Deployments that's enough. But also give us... [13:48:37] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) The copy to backup1002 is done. I have started mysql again on labsdb1011, once it has caught up, I will pool it back. Fingers crossed. [13:49:23] 10DBA, 10AbuseFilter, 10Patch-For-Review: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Marostegui) Thank you @Daimona :-) [13:56:10] I will depool several read-only es1XXX hosts when I start the backups, probably leaving them running during the afternoon [13:57:37] sounds good [13:59:08] I think I will run the ro once during the night and do the others, which take much less time for tomorrow morning [15:12:07] jynus: it looks like transfer.py does an md5sum of all source files before it starts copying. could something like rsync (that does the checksuming and copying in a single process) be used instead? [15:15:24] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) labsdb1011 has caught up. I am waiting for @bstorm to drop the wb_terms views (T251598) before repooling this host back finally! If the host replicates well during th... [15:26:09] kormat: the student is implementing that, you can do --no-checksum and --no-encrypt within the dc to speedup the process [15:28:18] "all source files"? what are you cloning? [15:28:22] mm. might be worth killing it and restarting it with those options [15:28:34] kormat: yeah, definitely [15:28:34] jynus: i'm copying /srv/sqldata from db2073 to db2136 [15:29:25] I see, tranfer.py is optimized to do a decompress from dbprov, cloning will take some time [15:29:59] kormat: this is what I normally do: ./transfer.py --no-encrypt --no-checksum source_host dest_host [15:31:54] the destination dir handling of transfer.py is... "interesting" :P [15:32:02] why? [15:34:05] my first attempt: transfer.py db2073.codfw.wmnet:/srv/sqldata/ db2136.codfw.wmnet:/srv/sqldata/ [15:34:13] which created /srv/sqldata/sqldata [15:34:27] yes, its modeled after cp -R [15:34:35] i was kinda assuming it would honor the rsync convention of the trailing `/` on the source determine how the copy works [15:34:41] it is a copy command, not an rsync [15:35:14] then i tried with `/srv` as the destination, but it errored out, saying the dest directory already exists [15:35:24] then i tried with /srv/clone to have it make a new dir and i'll rename it after, [15:35:33] but it won't proceed if the remote dir doesn't exist [15:35:40] yes, that is intentional [15:35:42] so eventually i gave up and rm'd `/srv/sqldata` on the destination [15:35:58] you backup the original sqldata to sqldata.bak [15:36:02] then copy it [15:36:11] it prevents accidents [15:36:57] if you don't like it, I can always assign to you :-D https://phabricator.wikimedia.org/T156462 [15:38:09] rsync over ssh was very slow and doesn't allow streaming backups [15:40:02] what do you mean by streaming backups? [15:40:10] xtrabackup [15:40:39] we need to use a unix pipe [15:41:00] how do checksums work in that mode? [15:41:16] they don't do [15:41:21] ah ok [15:41:45] but we may be able to tee them into something, as I said, our GSoC is working on that [15:41:55] 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Cmjohnson) [15:41:57] 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson) [15:43:09] * kormat nods [15:43:10] rsync was very slow for hundreds of thousands of files [15:43:45] just to be clear, we recommend you to use that, but if you propose (and implement) something better, I am totally open to that :-D [15:44:33] transfer.py was just a productionization of nc oneliners [15:44:42] understood :) [15:44:54] so it could saturate the link [15:45:03] other options based on ssh didn't [15:45:27] and worked for the functionalities we needed (streaming, port handling, service handling ,etc.) [15:47:39] to be fair, gzip should have integrated crc checking [15:55:16] So normally, the way to provision a host (in an emergency) is to use the pre-compressed dbprov tarballs [15:55:55] if you think that way, (1 source file precompressed) transfer.py (--decompress) will make more sense [15:56:01] jynus: for 3 of the hosts i'll be doing that method [15:56:20] I belive manuel wanted to clone the host 1:1 as it is not an emergency [15:56:33] or let me not thing on behalf of it [15:56:37] *him [15:56:44] So I proposed to try each method [15:56:47] I can see a db -> db clone makes sense [15:56:54] A normal cloning db -> db for the master and the sanitarium master [15:56:56] to avoid corruption [15:57:03] and then using xtrabackup from the backups for the other hosts [15:57:12] so he can try both ways [15:57:14] sure, a variety of methods :-D [15:57:25] correct [15:57:27] what I mean is that I would agree to copy from different source hosts [15:57:39] precisely to prevent the same issue as labsdb1001 on production [15:57:51] different sources, less option for corruption [15:58:06] but in an emergency, the pre-packaged backups is the fastest way [15:58:20] yes, that's why I wanted him to try both ways [15:58:30] db to db and then dbprov to db [15:58:35] cool [15:58:55] as I said above, I suggested master and sanitarium master to do a clone, the others, a dbprov population [16:01:21] yeah, last thing we want is a single source of corruption infecting everything else [16:01:57] also rsync won't have integrated hot mysql backup :-D [17:29:40] 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10jcrespo) > The remote execution module of this framework has kill_job function and it does not kill/close the port used by the netcat instantly. This ticket... [17:39:36] 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) Case Reference ID: 5347351645 Status: Case is generated and in Progress Product: HPE ProLiant DL360 Gen10 8SFF Configure-to-order Server Product number: 867959-B21 Serial number: Su... [17:46:28] 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10Privacybatm) Actually I'm talking about the kill_job function. I will give you a context to understand my question correctly. ` def _kill_use_ports(self, h... [18:00:33] 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) Will be receiving the DIMM tomorrow. The HP engineer recommended to update the firmware after the DIMM has been replaced. [18:04:43] 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10jcrespo) > The reason behind that is, the remote_executor.kill_job() does not close the port instantly (takes more than 30s in my machine). Interesting, I w... [18:14:42] 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10jcrespo) I've checked and both a manual "kill -9" and a "kill -15" should make the port available almost instanatly, so probably it is not that. Maybe kill_j... [18:27:42] 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) Hello Papaul, Greetings from Hewlett Packard Enterprise! As discussed , as per the AHS logs : Memory Failure is seen on Proc 2 DIMM 4. Uncorrectable Machine Check exception is s... [19:27:18] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10bd808) >>! In T251598#6145566, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/SaytKHIBj_Bg1xd3eZOe} [2020-05-...