[03:21:41] 10DBA, 10Growth-Team, 10PageCuration, 10User-DannyS712: updatePageTriageQueue doesn't remove old records if the page doesn't exist - https://phabricator.wikimedia.org/T256704 (10DannyS712) [04:03:17] 10DBA, 10Data-Services, 10Projects-Cleanup: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10Marostegui) Everything :) We just removed the tables from labs infra . Dropping wikis isn't something trivial. We could truncate them though [04:44:32] 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Marostegui) p:05Triage→03Medium [04:44:43] 10DBA: pl_from index still lingers in random hosts - https://phabricator.wikimedia.org/T256684 (10Marostegui) p:05Triage→03Medium [04:44:58] 10DBA: page_restrictions indexes have been majestically drifting from code - https://phabricator.wikimedia.org/T256682 (10Marostegui) p:05Triage→03Medium [04:46:58] 10DBA, 10Operations, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6265626, @herron wrote: >>>! In T256538#6262958, @Marostegui wrote: >> @herron any idea how big these DBs can be and how many writes we'd be expecting? >> Wh... [04:51:39] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1080.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006300451_marostegui_8504... [04:53:01] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) p:05Triage→03Medium [04:57:22] 10DBA: pl_from index still lingers in random hosts - https://phabricator.wikimedia.org/T256684 (10Marostegui) db1096:3316 and db1098:3316 had `pl_from` on jawiki and ruwiki - I have removed them. [04:59:08] 10DBA: pl_from index still lingers in random hosts - https://phabricator.wikimedia.org/T256684 (10Marostegui) 05Open→03Resolved a:03Marostegui Remove from s4: ` root@cumin1001:~# for i in db1141 db1121 db1148; do mysql.py -h$i commonswiki -e "show create table pagelinks\G" | grep -i UNIQUE ; done root@cumi... [04:59:13] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [05:01:46] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1080.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1080.eqiad.wmnet'] ` [05:04:23] 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Marostegui) [05:05:04] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1080.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006300504_marostegui_1043... [05:07:37] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) [05:09:32] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) [05:11:37] 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Marostegui) [05:12:51] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) [05:14:18] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) [05:14:56] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) codfw done: ` # /home/marostegui/section s8 | grep codfw | while read host port; do echo "$host:$port" ; mysql.py -h$host:$port wikidatawiki -e "show create table imagelinks\G" | egrep UNIQUE ; done db2100.codf... [05:15:40] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1080.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1080.eqiad.wmnet'] ` [05:16:16] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1005 [] db1126 [] db1124 [x] db1116 [] db1114 [] db1111 [] db1109 [] db1104 [x] db1101 [x] db1099 [x] db1092 [] db1087 [06:31:54] 10DBA, 10Operations: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) [06:33:22] 10DBA, 10Operations: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) [06:33:49] 10DBA, 10Operations: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) p:05Triage→03High I was in process of moving db1080 to m2, but I will move it to m1 instead so we can replace and decommission this host. [06:34:35] were the dimms disabled, does it have less memory now? [06:35:05] no, it doesn't [07:59:53] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [08:10:11] jynus: I am seeing something strange with transfer.py [08:10:29] let me know [08:11:30] https://phabricator.wikimedia.org/P11705 [08:14:27] that's weird [08:15:52] Yeah, I don't get it [08:16:34] everything is against me today :( [08:24:15] I will revert until I debug, because I don't see what is going on, that part of the code didn't change at all [08:24:36] I can also use the old one [08:24:41] so you don't have to rush [08:24:53] but it might cause backups to fail [08:25:01] no, I need a working version on production always [08:25:06] right :) [08:29:25] which transfer.py and try again? [08:30:04] it works now [08:33:02] I will debug to see what went wrong [08:33:09] thanks [08:33:17] but I suspect a packaging issue [08:33:19] not a code issue [08:34:21] may I ask if the --no-compress option gives you better speed or something? [08:34:41] nah, I thought it wasn't worth it for just 500G [08:34:44] ok [08:38:46] yeah, it is packaging because with a local clone it works [08:39:03] may I ask you to test a silly transfer with the latest HEAD? [08:39:12] sure [08:39:16] to confirm it HEAD works for you too? [08:39:27] let me see [08:40:08] e.g. try the same command to a separate server, then cancell [08:40:11] *cancel [08:40:55] git clone "https://gerrit.wikimedia.org/r/operations/software/transferpy" [08:41:28] PYTHONPATH=clone_dir transfer.py ... [08:42:19] because I got the same errror as you with the package install, but not with HEAD [08:42:27] 18 bytes correctly transferred from db1117.eqiad.wmnet to db1080.eqiad.wmnet [08:42:45] and you are sure it is using the clone version, right? [08:42:58] I think that one disables cumin output [08:43:10] yes, I just recloned it [08:43:34] ok, so good and bad news- code is ok, but I f* the packaging [08:43:36] the testing is done with the recloned [08:44:02] which I can see it happening missing paths, but that error is weird [08:44:23] will test locally and report [08:44:45] yeah, I don't see why the packaging could cause that, cause it looked like code error [08:44:58] I could have understood path issues or similar, but that particular error... [08:47:23] it may still be a path + bad logic showing the wrong error [08:55:45] 10DBA: Execution error after moving to debian package - https://phabricator.wikimedia.org/T256725 (10jcrespo) [08:59:14] 10DBA: Execution error after moving to debian package - https://phabricator.wikimedia.org/T256725 (10Privacybatm) In our testing environment, I am currently using the Debian package only. Let me see what could be this issue! [09:00:53] 10DBA: Execution error after moving to debian package - https://phabricator.wikimedia.org/T256725 (10Privacybatm) Can we get a --verbose output, that would tell us if it is the problem with Cumin? [09:02:41] 10DBA: Execution error after moving to debian package - https://phabricator.wikimedia.org/T256725 (10jcrespo) --verbose was the first thing I tried, but it didn't work either, it just showed the same error. Maybe the package I created was faulty? I will try to build it again with HEAD. [09:14:52] I will use cumin2001 to install the new package, ok? [09:15:08] use cumin1001 for work [09:17:15] good [09:17:36] I did nothing and I think it now works [09:17:45] :-/ [09:21:10] I am going to re-revert [09:23:15] let me know when transfer finishes [09:23:26] will do, should be done in a few minutes [09:24:13] I think I may have uploaded an old version of the package? [09:25:28] 10DBA, 10Patch-For-Review: Execution error after moving to debian package - https://phabricator.wikimedia.org/T256725 (10jcrespo) I rebuilt, and it works again. Maybe I had uploaded an earlier version of the package? [09:29:43] 10DBA, 10Patch-For-Review: Execution error after moving to debian package - https://phabricator.wikimedia.org/T256725 (10Privacybatm) Actually, I never got that error. I will have a look for the possibility of that error. [09:36:18] jynus: transfer finished [09:39:52] could you remove all local repos and try again the global install execution? [09:40:07] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [09:40:11] yep [09:42:17] it works [09:42:22] 26 bytes correctly transferred from db1117.eqiad.wmnet to db1080.eqiad.wmnet [09:42:35] did you copy the alphabet? :) [09:42:41] hahaha [09:42:51] so I changed nothing [09:43:14] I wonder if it was a leftover of calling the locally installed class, not being purged [09:43:31] jynus: out of interest - did puppet remove the old transfer.py from the cumin machines? [09:43:38] no, I did manually [09:43:45] ah ok [09:43:53] I could have added a ensure=> removed [09:43:58] but it is only 2 hosts [09:44:56] ensure => absent [09:46:25] 10DBA, 10Patch-For-Review: Execution error after moving to debian package - https://phabricator.wikimedia.org/T256725 (10jcrespo) 05Open→03Resolved a:03jcrespo So I think this was a deployment issue only, something about building the package or configured paths leftover from the previous installations. T... [09:46:27] 10DBA: Package transferpy framework - https://phabricator.wikimedia.org/T253736 (10jcrespo) [09:46:35] I think there is some weird caching with paths until you loggout [09:49:04] `hash -r` will fix that in bash [09:49:29] the first time bash has to search $PATH for a binary, it will remember the location until told to forget it [09:49:47] even if deleted? [09:49:56] I didn't know about that functionality [09:50:08] I thought it was dynamic every time [09:50:26] yep [09:50:52] or let me rephrase it- i would expect to failover to dynamic if not found [09:52:06] anyway, nothing to see here- I failed to do the deploy properly [09:52:21] but I don't think it will happen again now that we are migrated to package-based deployment [09:52:32] jynus: a demonstration: https://phabricator.wikimedia.org/P11707 [09:52:55] oh, I belive you because I saw it before [09:53:02] I was just complaining [09:53:06] haha [09:54:55] anyway, tonight will be the stress test, when backups run with the new "binary" [10:01:12] look we are very much away from the 8 puppet deployments and and reverts some SRE did of a patch [10:01:17] *far away [10:03:21] 10DBA, 10Operations, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) db1080 is ready. Now we just need to schedule another m1 failover to promote db1080 to master. [10:16:41] marostegui: i'm going to change our mysql dashboard to use thanos (https://wikitech.wikimedia.org/wiki/Thanos#Porting_dashboards_to_Thanos) [10:17:14] kormat: ah good, which one will you change first? [10:18:12] that's a moderately good question [10:18:27] i was thinking of the main "MySQL" one. but i'm going to make a copy of it first and work on that [10:18:36] and some alerts will need to be updated [10:18:42] i'll open a task for this [10:18:47] sounds good [10:19:12] you can also use: https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?orgId=1 which is no longer super critical [10:19:21] as we have extra capacity and we are not peaking that much [10:19:27] and 10.4 works better with compression [10:21:24] ok, cool [10:21:46] dashboards like that just scream out to be auto-generated [10:23:38] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [10:23:46] marostegui: let me know if i've missed any ^ [10:24:39] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Marostegui) [10:25:12] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) p:05Triage→03Medium [10:27:08] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [11:14:56] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond is the scope of this task done or is there anything else left? [12:07:11] 10DBA, 10Operations, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) Actually I just realised that this host won't be replaced next FY, as we are replacing up to db1095. [12:49:41] 10DBA, 10Patch-For-Review: Make checksum parallel to the data transfer in transferpy package - https://phabricator.wikimedia.org/T254979 (10Privacybatm) I have run benchmarks with the new cloud test machines. bigfile: 1.4TB manySmallFiles300: 293GB (150 000 files) I have run the benchmark in the following ord... [12:57:06] jynus: re: storing list of sections in puppet, i know what you're talking about. on the other hand, currently puppet depends on knowing what hosts are in what sections to know how to set up the host (particularly for multi-section) [12:57:29] if in the future something like zarcillo could act as an ENC and be the single source of truth, that would be great [12:57:45] but in the meantime i'd like to make the management of sections in puppet a bit more.. controlled [12:59:03] 10DBA: transferpy --checksum wrongly output `checksums do not match` message - https://phabricator.wikimedia.org/T256755 (10Privacybatm) [13:00:24] 10DBA: transferpy --checksum wrongly output `checksums do not match` message - https://phabricator.wikimedia.org/T256755 (10Privacybatm) [13:17:01] kormat: yeah, it wasn't a "don't do this", more of a notice of the issues/direction we were going [13:17:20] +1 [13:18:49] 10DBA: transferpy --checksum wrongly output `checksums do not match` message - https://phabricator.wikimedia.org/T256755 (10jcrespo) [13:22:41] 10DBA: transferpy --checksum wrongly output `checksums do not match` message - https://phabricator.wikimedia.org/T256755 (10jcrespo) Indeed an issue with a clear reason why it happens (directory traversal order in the method used is not deterministic). There are several options on how to go about this. I think... [13:25:43] 10DBA: transferpy --checksum wrongly output `checksums do not match` message - https://phabricator.wikimedia.org/T256755 (10Privacybatm) Yeah, I will have a search on this. Let's this ticket be here so that we can keep an eye on it! [13:31:32] 10DBA, 10Patch-For-Review: Make checksum parallel to the data transfer in transferpy package - https://phabricator.wikimedia.org/T254979 (10jcrespo) A preliminary result from this suggests that parallel checksum should be able to be disabled, but be enabled by default (unless cpu usage increased a lot). Now i... [13:35:42] 10DBA, 10Patch-For-Review: Make checksum parallel to the data transfer in transferpy package - https://phabricator.wikimedia.org/T254979 (10Privacybatm) >>! In T254979#6267654, @jcrespo wrote: > A preliminary result from this suggests that parallel checksum should be able to be disabled, but be enabled by defa... [13:37:22] 10DBA, 10Patch-For-Review: Make checksum parallel to the data transfer in transferpy package - https://phabricator.wikimedia.org/T254979 (10jcrespo) > I feel like user enabling will be the best idea right? Fair enough. > I will try with source parallel checksum as we discussed in the last meeting. Thanks. [14:10:00] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10jcrespo) @Dzahn @QChris Can this start happening? Database backups will be available for some time a... [14:11:14] 10DBA, 10Patch-For-Review: Make checksum parallel to the data transfer in transferpy package - https://phabricator.wikimedia.org/T254979 (10Privacybatm) Sorry, I forgot to give the `sysbench` outputs. **sysbench --test=fileio --file-total-size=150G prepare** 161061273600 bytes written in 373.16 seconds (411.6... [14:12:51] andrewbogott: ^ this may be interesting for cloud team as a benchmark of ceph? [14:15:44] subscribed! [16:05:14] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jcrespo) The proposed refactoring could broke the core/misc separation. We have deployed a far-from-ideal misc::idp_test (which I still have to...