[05:05:50] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) >>! In T249188#6143576, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/r6knJnIBj_Bg1xd3wtnZ} [202...
[05:08:34] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) labsdb1011 finished catching up past Saturday and has been replicating fine for 5 days. I am copying its content to backup1002 so we have a binary copy just in case. O...
[05:13:12] <wikibugs>	 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Marostegui) Actually taking a day isn't that bad if the script is that safe.  500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch,...
[05:19:51] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Marostegui) This looks good on all the hosts: ` ----- OUTPUT of 'df -hT /srv' ----- Filesystem            Type  Size  Used Avail Use%...
[05:21:13] <wikibugs>	 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui)
[05:21:30] <wikibugs>	 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui) p:05Triage→03Medium
[05:28:55] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) >>! In T252761#6139643, @jcrespo wrote: > Disk latency mentrics are indeed higher, but client sql latency is significantly lower.  Makes sense that the SQL client isn't seeing anythin...
[05:46:20] <wikibugs>	 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) Maybe this has something to do:  The `sql_mode` by default is different on 10.4 and includes the `STRICT_TRANS_TABLES`  ` -sql_mode NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION +sql_m...
[06:08:58] <wikibugs>	 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Thanks activities should be logged by CheckUser - https://phabricator.wikimedia.org/T252226 (10Marostegui) From what I can see these are the number of rows on `cu_changes`: enwiki: 18305102 commons: 24673959 wikidata: 67864207  So the numbers...
[06:18:54] <wikibugs>	 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Thanks activities should be logged by CheckUser - https://phabricator.wikimedia.org/T252226 (10DannyS712) >>! In T252226#6143629, @Marostegui wrote: > From what I can see these are the number of rows on `cu_changes`: > enwiki: 18305102 > comm...
[06:44:20] <wikibugs>	 10DBA, 10Patch-For-Review, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2088.codfw.wmnet'] ` The log can be found in...
[07:06:42] <wikibugs>	 10DBA, 10Patch-For-Review, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2088.codfw.wmnet'] `  and were **ALL** successful.
[07:47:46] <wikibugs>	 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) 05Open→03Resolved a:03Marostegui This is confirmed fixed on 10.4.13.  I have reimaged db2088 and installed 10.4.13 and after starting mysql: ` mysql:root@localho...
[07:47:48] <wikibugs>	 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui)
[07:50:29] <wikibugs>	 10DBA, 10Operations: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Marostegui) p:05Triage→03Medium
[07:58:26] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10jcrespo)
[08:45:06] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn)
[09:12:05] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: ipb_address_unique has an extra column in production but not in the code (WAS: ipb_address_unique has an extra column in the code but not in production) - https://phabricator.wikimedia.org/T251188 (10Marostegui) 05...
[09:16:01] <wikibugs>	 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui)
[09:25:40] <wikibugs>	 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Daimona) >>! In T252696#6143593, @Marostegui wrote: > Actually taking a day isn't that bad if the script is that safe.  > 500 rows can probably be changed to 1000, but other than...
[09:27:47] <wikibugs>	 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Marostegui) >>! In T252696#6144133, @Daimona wrote: >>>! In T252696#6143593, @Marostegui wrote: >> Actually taking a day isn't that bad if the script is that safe.  >> 500 rows c...
[09:32:18] <wikibugs>	 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Marostegui) a:03Kormat
[10:01:48] <marostegui>	 kormat: meeting?
[10:54:18] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) TODO: investigate if changing the version and config of prometheus-mysql-exporter has had a major impact on performance.
[11:00:36] <arturo>	 hey
[11:00:43] <arturo>	 in a toolforge daemon I see
[11:01:03] <arturo>	  /usr/local/sbin/maintain-dbusers[17979]: Could not connect to labsdb1011.eqiad.wmnet due to (2003, "Can't connect to MySQL server on 'labsdb1011.eqiad.wmnet' ([Errno 111] Connection refused)").  Skipping.
[11:01:08] <arturo>	 is this known?
[11:01:11] <marostegui>	 yep
[11:01:17] <marostegui>	 labsdb1011 is down for maintenance
[11:01:22] <marostegui>	 it is depooled
[11:01:27] <arturo>	 shall we use other server meanwhile?
[11:01:36] <marostegui>	 that host has been down for a long time
[11:01:44] <marostegui>	 As it has been crashing
[11:01:49] <marostegui>	 Mysql was up, but had no data
[11:02:01] <marostegui>	 what is trying to use it?
[11:02:13] <marostegui>	 it will be in a few hours I reckon
[11:02:24] <arturo>	 maintain-dbusers is the daemon that maintains toolforge accounts
[11:03:10] <marostegui>	 arturo: How does it work if the host doesn't have data? (which was the case last week, mysql was up, but the data was being re-imported)
[11:03:32] <arturo>	 I'm not sure, I'm currently trying to understand https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore#maintain-dbusers
[11:04:05] <marostegui>	 arturo: it must be hardcoded, cause that host was out for weeks already unfortunately
[11:04:11] <marostegui>	 and only got mysql up around 10 days ago
[11:04:15] <marostegui>	 (but without all the data)
[11:04:27] <arturo>	 ok
[11:04:40] <marostegui>	 so maybe it was up but not even working?
[11:04:41] <arturo>	 what is the host that we should be using instead?
[11:04:51] <marostegui>	 you can use labsdb1010
[11:04:57] <arturo>	 ok
[11:05:03] <marostegui>	 labsdb1011 will be up in a 2-3h
[11:06:06] <arturo>	 in modules/profile/manifests/wmcs/nfs/maintain_dbusers.pp there are several labsdb servers referenced
[11:07:49] <arturo>	 I think I will drop labsdb1011 from the list there
[11:08:03] <arturo>	 that list generates a config file that is later loaded by our daemon
[11:08:16] <marostegui>	 but once it is back, we should revert it
[11:08:21] <arturo>	 sure
[11:08:29] <marostegui>	 as I believe that populates all the dev accounts
[11:10:24] <marostegui>	 arturo: this is the task to track this maintenance https://phabricator.wikimedia.org/T249188
[11:10:33] <marostegui>	 so maybe you want to get subscribed so you know when labsdb1011 is back
[11:10:34] <arturo>	 ack
[11:11:50] <arturo>	 wait, the user refreshed the page and now it works
[11:12:01] <arturo>	 I was preparing this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597040
[11:12:49] <marostegui>	 i guess it fails over to another host or something?
[11:12:51] <marostegui>	 don't know
[11:17:59] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) I have done a quick check and looks like this is what prometheus executes: `  158534 Connect prometheus@localhost as anonymous on   158535 Connect prometheus@localhost as anonymous on...
[11:18:58] <arturo>	 yes, reading the code again (and the error message) it just skips a db host if it can't connect
[11:25:35] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) Another interesting test to see what happens to the disk latency and query latency would be on pc2007:  - stop slave - leave it for a few hours (pt-heartbeat will be running unless we...
[11:52:13] <wikibugs>	 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) a:03Marostegui
[11:53:24] <wikibugs>	 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) package updated on db1122 and db1109.
[12:13:42] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) I don't know if it's relevant, but mariadb changed the default value of `lock_wait_timeout` in 10.2: https://mariadb.com/docs/reference/es/system-variables/lock_wait_timeout/  pc2007 (bus...
[12:21:07] <wikibugs>	 10DBA: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 (10Marostegui) Procedure:   Maintenance day: - Silence all hosts in s2 and s8 - Set read only on s2 and s8: ` dbctl --scope eqiad section s2 ro "Maintenance on s2 and s8 T251981"...
[12:24:10] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) >>! In T252761#6144456, @Kormat wrote: > I don't know if it's relevant, but mariadb changed the default value of `lock_wait_timeout` in 10.2: https://mariadb.com/docs/reference/es/sys...
[13:48:29] <wikibugs>	 10DBA, 10AbuseFilter, 10Patch-For-Review: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Daimona) >>! In T252696#6144137, @Marostegui wrote: > If you can add it to: https://wikitech.wikimedia.org/wiki/Deployments that's enough. But also give us...
[13:48:37] <wikibugs>	 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) The copy to backup1002 is done. I have started mysql again on labsdb1011, once it has caught up, I will pool it back. Fingers crossed.
[13:49:23] <wikibugs>	 10DBA, 10AbuseFilter, 10Patch-For-Review: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Marostegui) Thank you @Daimona :-)
[13:56:10] <jynus>	 I will depool several read-only es1XXX hosts when I start the backups, probably leaving them running during the afternoon
[13:57:37] <marostegui>	 sounds good
[13:59:08] <jynus>	 I think I will run the ro once during the night and do the others, which take much less time for tomorrow morning
[15:12:07] <kormat>	 jynus: it looks like transfer.py does an md5sum of all source files before it starts copying. could something like rsync (that does the checksuming and copying in a single process) be used instead?
[15:15:24] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) labsdb1011 has caught up. I am waiting for @bstorm to drop the wb_terms views (T251598) before repooling this host back finally!  If the host replicates well during th...
[15:26:09] <jynus>	 kormat: the student is implementing that, you can do --no-checksum and --no-encrypt within the dc to speedup the process
[15:28:18] <jynus>	 "all source files"? what are you cloning?
[15:28:22] <kormat>	 mm. might be worth killing it and restarting it with those options
[15:28:34] <marostegui>	 kormat: yeah, definitely
[15:28:34] <kormat>	 jynus: i'm copying /srv/sqldata from db2073 to db2136
[15:29:25] <jynus>	 I see, tranfer.py is optimized to do a decompress from dbprov, cloning will take some time
[15:29:59] <marostegui>	 kormat: this is what I normally do: ./transfer.py --no-encrypt --no-checksum source_host dest_host
[15:31:54] <kormat>	 the destination dir handling of transfer.py is... "interesting" :P
[15:32:02] <jynus>	 why?
[15:34:05] <kormat>	 my first attempt: transfer.py db2073.codfw.wmnet:/srv/sqldata/ db2136.codfw.wmnet:/srv/sqldata/
[15:34:13] <kormat>	 which created /srv/sqldata/sqldata
[15:34:27] <jynus>	 yes, its modeled after cp -R
[15:34:35] <kormat>	 i was kinda assuming it would honor the rsync convention of the trailing `/` on the source determine how the copy works
[15:34:41] <jynus>	 it is a copy command, not an rsync
[15:35:14] <kormat>	 then i tried with `/srv` as the destination, but it errored out, saying the dest directory already exists
[15:35:24] <kormat>	 then i tried with /srv/clone to have it make a new dir and i'll rename it after,
[15:35:33] <kormat>	 but it won't proceed if the remote dir doesn't exist
[15:35:40] <jynus>	 yes, that is intentional
[15:35:42] <kormat>	 so eventually i gave up and rm'd `/srv/sqldata` on the destination
[15:35:58] <jynus>	 you backup the original sqldata to sqldata.bak
[15:36:02] <jynus>	 then copy it
[15:36:11] <jynus>	 it prevents accidents
[15:36:57] <jynus>	 if you don't like it, I can always assign to you :-D https://phabricator.wikimedia.org/T156462
[15:38:09] <jynus>	 rsync over ssh was very slow and doesn't allow streaming backups
[15:40:02] <kormat>	 what do you mean by streaming backups?
[15:40:10] <jynus>	 xtrabackup
[15:40:39] <jynus>	 we need to use a unix pipe
[15:41:00] <kormat>	 how do checksums work in that mode?
[15:41:16] <jynus>	 they don't do
[15:41:21] <kormat>	 ah ok
[15:41:45] <jynus>	 but we may be able to tee them into something, as I said, our GSoC is working on that
[15:41:55] <wikibugs>	 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Cmjohnson)
[15:41:57] <wikibugs>	 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson)
[15:43:09] * kormat nods
[15:43:10] <jynus>	 rsync was very slow for hundreds of thousands of files
[15:43:45] <jynus>	 just to be clear, we recommend you to use that, but if you propose (and implement) something better, I am totally open to that :-D
[15:44:33] <jynus>	 transfer.py was just a productionization of nc oneliners
[15:44:42] <kormat>	 understood :)
[15:44:54] <jynus>	 so it could saturate the link
[15:45:03] <jynus>	 other options based on ssh didn't
[15:45:27] <jynus>	 and worked for the functionalities we needed (streaming, port handling, service handling ,etc.)
[15:47:39] <jynus>	 to be fair, gzip should have integrated crc checking
[15:55:16] <jynus>	 So normally, the way to provision a host (in an emergency) is to use the pre-compressed dbprov tarballs
[15:55:55] <jynus>	 if you think that way, (1 source file precompressed) transfer.py (--decompress) will make more sense
[15:56:01] <kormat>	 jynus: for 3 of the hosts i'll be doing that method
[15:56:20] <jynus>	 I belive manuel wanted to clone the host 1:1 as it is not an emergency
[15:56:33] <jynus>	 or let me not thing on behalf of it
[15:56:37] <jynus>	 *him
[15:56:44] <marostegui>	 So I proposed to try each method
[15:56:47] <jynus>	 I can see a db -> db clone makes sense
[15:56:54] <marostegui>	 A normal cloning db -> db for the master and the sanitarium master
[15:56:56] <jynus>	 to avoid corruption
[15:57:03] <marostegui>	 and then using xtrabackup from the backups for the other hosts
[15:57:12] <marostegui>	 so he can try both ways
[15:57:14] <jynus>	 sure, a variety of methods :-D
[15:57:25] <marostegui>	 correct
[15:57:27] <jynus>	 what I mean is that I would agree to copy from different source hosts
[15:57:39] <jynus>	 precisely to prevent the same issue as labsdb1001 on production
[15:57:51] <jynus>	 different sources, less option for corruption
[15:58:06] <jynus>	 but in an emergency, the pre-packaged backups is the fastest way
[15:58:20] <marostegui>	 yes, that's why I wanted him to try both ways
[15:58:30] <marostegui>	 db to db and then dbprov to db
[15:58:35] <jynus>	 cool
[15:58:55] <marostegui>	 as I said above, I suggested master and sanitarium master to do a clone, the others, a dbprov population
[16:01:21] <jynus>	 yeah, last thing we want is a single source of corruption infecting everything else
[16:01:57] <jynus>	 also rsync won't have integrated hot mysql backup :-D
[17:29:40] <wikibugs>	 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10jcrespo) > The remote execution module of this framework has kill_job function and it does not kill/close the port used by the netcat instantly. This ticket...
[17:39:36] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) Case Reference ID: 5347351645 Status: Case is generated and in Progress Product: HPE ProLiant DL360 Gen10 8SFF Configure-to-order Server Product number: 867959-B21 Serial number:  Su...
[17:46:28] <wikibugs>	 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10Privacybatm) Actually I'm talking about the kill_job function. I will give you a context to understand my question correctly.   ` def _kill_use_ports(self, h...
[18:00:33] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) Will be receiving the DIMM tomorrow. The HP engineer recommended to update the firmware after the DIMM has been replaced.
[18:04:43] <wikibugs>	 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10jcrespo) > The reason behind that is, the remote_executor.kill_job() does not close the port instantly (takes more than 30s in my machine).  Interesting, I w...
[18:14:42] <wikibugs>	 10DBA: kill_job function in remote execution module of transfer framework does not close the ports instantly - https://phabricator.wikimedia.org/T252950 (10jcrespo) I've checked and both a manual "kill -9" and a "kill -15" should make the port available almost instanatly, so probably it is not that. Maybe kill_j...
[18:27:42] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) Hello Papaul,   Greetings from Hewlett Packard Enterprise!  As discussed , as per the AHS logs :   Memory Failure is seen on Proc 2 DIMM 4. Uncorrectable Machine Check exception is s...
[19:27:18] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10bd808) >>! In T251598#6145566, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/SaytKHIBj_Bg1xd3eZOe} [2020-05-...