[09:49:47] 10Traffic, 10SRE: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Vgutierrez) [09:50:45] 10Traffic, 10SRE: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Vgutierrez) p:05Triage→03Unbreak! [09:55:41] 10Traffic, 10SRE: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Volans) It seems to me that this is related to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/interface/files/interface-rps.py#186 ` >>> a = 'foo %d' >>> b = re.s... [09:57:04] 10Traffic, 10SRE: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10elukey) https://phabricator.wikimedia.org/T273918 was filed by Effie last week, it also happened for some MW servers. [10:04:40] 10Traffic, 10SRE, 10Patch-For-Review: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Volans) I did just a quick look but I think that if you use `\\d` it should do the right thing: ` >>> a = 'foo %d bar' >>> b = re.sub('%d', r'(\\d+)', a) >>> c = re.compile(r'^\s*([0-9]+):.*... [10:33:19] 10netops, 10SRE: Upgrade Fastnetmon to 1.2.0 - https://phabricator.wikimedia.org/T271228 (10ayounsi) [10:51:53] vgutierrez: I will merge 662637 [10:52:40] effie: <3 thanks [10:53:14] I'll restart lvs4007 again after it's merged to double check that everything works as expected [11:08:45] 10Traffic, 10SRE, 10Patch-For-Review: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Vgutierrez) 05Open→03Resolved a:03jijiki Thanks @Volans and @jijiki [12:35:02] hmmm this raises the question: why only lvs4007? [12:35:09] or is it only lvs4007, I guess? [12:35:23] (or have they all been doing this since buster upgrade?) [12:35:39] or did python go from 3.6 to 3.7 without an OS upgrade? [12:35:46] vgutierrez: ^ [12:36:22] keep in mind interface-rps re-runs may blip interface traffic, it's not generally-safe to just run it for kicks, I don't think. [12:36:37] All of them [12:36:42] Not only 4007 [12:36:55] so we've got no interface-rps effects on any LVSes since buster, basically? [12:37:05] that could explain some issues! :) [12:37:12] we had.. that's fixed already [12:37:35] what do you mean? [12:38:51] (by, "that's fixed already", do you mean the patch just merged today above, or do you mean that they've all been re-run since that merge today, or that they were all somehow otherwise-fixed earlier?) [12:38:58] Effie fixed this morning interfaces-rps [12:39:28] And I'm restarting the lvs servers today to upgraded the kernels [12:39:40] yeah I get that, my question is: have all our LVSes been running without the benefits of interface-rps since their buster upgrades? (it sounds like yes, in which case a lot of our questions about their poor performance in certain recent situations are answered) [12:39:50] I don't think so [12:40:15] it seems like either that has to be the case, or python moved from 3.6 to 3.7 in the middle of buster, which would also surprise me [12:40:16] It looks like a buster update triggered this [12:40:43] hmmm [12:40:45] no, the script was recently switched to Py3 [12:40:52] let me find the commit [12:41:06] https://gerrit.wikimedia.org/r/c/operations/puppet/+/652575/ [12:42:22] ok, now things make more sense! [12:42:26] thanks! [12:49:03] I think I may have actually hit this and been confused into thinking it was something else recently, too. It's possible anyways, now that I look back. [12:53:29] hmmm I donno, maybe not. it's hard to say at this point, but time with a fixed script will tell I guess. [13:27:40] so yeah the various py2->py3 helper tools don't cover this case, I checked [13:27:57] which sorta-makes-sense, it's not one of the known 2->3 things, it's a subtle 3.6 -> 3.7 thing. [13:28:28] I think this was linked in one of our bugs or the patch, but it's the best upstrema-python reference I found: [13:28:31] https://bugs.python.org/issue34304 [13:28:58] which seems to indicate that the viewpoint of upstream is that developers should check for deprecation warnings to know about these things [13:29:13] but then deprecation warnings are non-default warnings in python in general :/ [13:29:58] is there maybe a best-practice we should be employing that would've snagged this, kinda equivalent to perl's old pattern of starting scripts out with "use strict; use warnings" [13:30:32] I know python has a warnings module, but looking more for where the advice is of "add these lines to the top of every python script if you care about quality and being warning of impending breakage" [13:31:20] bblack: test before deploying and/or write tests :-P [13:32:59] for the warnings you can use either the -W or set PYTHONWARNINGS to either 'default' (show all) or 'error' (convert warnings to errors) [13:33:17] yeah I meant more in-script, so that it's not relliant on a way of executing [13:33:52] interface-rps isn't a very-testable kind of script :/ [13:34:15] import warnings [13:34:25] warnings.simplefilter('default') # or error [13:34:42] right [13:34:44] set also os.environ["PYTHONWARNINGS"] if subprocess is used [13:34:50] see https://docs.python.org/3/library/warnings.html#overriding-the-default-filter [13:34:56] that kind of thing, except apparently "default" warnings won't tell you about a breaking change to string parsing :P [13:35:00] but yeah [13:35:29] bblack: can't be tested on the sretest* hosts? [13:36:29] will surely not have the fancy network cards of the LVSes, but maybe it's enough for a sanity check [13:36:39] I meant more like, can't be tested in the abstract [13:36:53] since it reads and touches system-level stuff [13:37:05] sure [13:37:19] but yes, could be at least somewhat tested on any live host, as root, accepting that it might blip interface traffic for the whole host as you test [13:37:27] or if the test goes really badly, might kill network access heh [13:37:41] or maybe in a VM against a virtual network driver? [13:38:13] can at least emulate the interface to the US, and although that part could go out of sync, the rest of the script would be tested [13:39:06] this kind of thing (the string incompat) could in theory be caught by a parse-only test without execution [13:39:09] in this particular case, running the function in /any/ test would have caught it [13:39:24] but my vague understanding is that python doesn't really parse lines until it executes them, so that model may not apply here in general [13:39:43] (e.g. I've seen syntax errors in unexecuted if-branches go ignored when a script runs) [13:39:47] just spoon feed it a mock /proc/interrupts file I guess [13:40:39] and everything else [13:40:59] it would be simpler to feed it a "mock" /proc + /sys in general by executing in an emulated kernel you don't care about [13:41:25] assuming the virtual interface there has enough scaffolding to meaningfully test at least some of it. [13:42:06] I'm mostly just trying to see if there's some easy answer to have prevented this, that we could apply in the future [13:42:59] the warnings thing would've caught it, if we applied non-default warnings and raised them all to errors, inside the script itself with the warnings module. [13:43:37] well maybe. I guess all that would've done is make it suddenly break on an even earlier version of python [13:43:58] (nobody's going to go looking for just the warnings output in production on an auto-executed script) [13:51:45] most of the easy answers (even that CI or a human should just try executing the script after a change, or try it with -W or whatever), it all kinda falls apart due to the whole system-scope thing. Probably the best way would be to refactor the whole script into a class that has some "root path" argument that defaults to '/' [13:52:12] and ship a test that somehow sets up a mock of everything for a common scenario in an alternate path? [13:52:29] I don't think the mock data could emulate some of the effects, but it would at least let the script "run" and catch these kinds of things. [13:54:41] it needs reads from fake versions of at least /sys/devices/system/node , /sys/devices/system/cpu , /sys/class/net/ (lots of subpaths here) , /proc/interrupts , /proc/irq/ [13:55:21] well some of those are writes, not reads, but I think all the writes are pretty blind and wouldn't fail as long as the directory existed [13:55:34] hmmm [13:58:14] it's complicated, but probably much better than saying "this can only be tested in some emulation container" [15:14:38] 10Traffic, 10UploadWizard: Uploading via UploadWizard gets stuck for a 11 MB JPG - https://phabricator.wikimedia.org/T274150 (10Urbanecm) [17:28:51] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Milimetric) +1 to @Gilles's idea. Reverse image searches don't yield anything obvious. [18:00:40] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mforns) The crazy request volume starts on July 2020 https://pageviews.toolforge.org/mediaviews/?project=commons.wikimedia.org&platform=&referer=all-referers... [18:17:20] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10elukey) >>! In T273741#6812144, @mforns wrote: > The crazy request volume starts on July 2020 > https://pageviews.toolforge.org/mediaviews/?project=commons.w... [19:04:33] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mforns) If it's an app, it would need to be **very** popular. Maybe Aarogya Setu, the app for reducing Covid infections? IIUC it's mandatory in India. [19:08:29] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Samwalton9) > Maybe Aarogya Setu, the app for reducing Covid infections? If it is, it isn't part of the initial app setup process, which I just tested out o... [19:21:32] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10nshahquinn-wmf) >>! In T273741#6812424, @mforns wrote: > If it's an app, it would need to be **very** popular. > Maybe Aarogya Setu, the app for reducing Cov... [22:42:01] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Michaelrhanson) I found several places where this URL is being used in sample code, which might have been picked up by somebody and built into an app: https... [22:42:58] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Daniel.gayo) Could it be this app? https://apps.apple.com/hk/app/iclass-corporate/id1439400748?l=en The picture appears in a screenshot... [23:08:06] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Legoktm) @spinda found that this image is used in quite a few different places: * https://github.com/triniwiz/nativescript-image-cache-it/issues/11 * https:... [23:10:23] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Michaelrhanson) Hm! It is included in the imagenet URL list, I think. Could we be looking at some CV training pipeline that's not caching properly? http:/... [23:14:13] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) >>! In T273741#6813531, @Michaelrhanson wrote: > I found several places where this URL is being used in sample code, which might have been picked up... [23:14:39] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) >>! In T273741#6813536, @Daniel.gayo wrote: > Could it be this app? > > https://apps.apple.com/hk/app/iclass-corporate/id1439400748?l=en > > The pi... [23:18:46] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10fdans) >>! In T273741#6813616, @Michaelrhanson wrote: > Hm! It is included in the imagenet URL list, I think. Could we be looking at some CV training pipel... [23:29:40] 10Traffic, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10fdans) As was suggested on Twitter, this surge coincides almost perfectly with the ban of TikTok, as well as other 223 Chinese apps, in India [23:46:23] 10Traffic, 10Commons, 10SRE: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10AntiCompositeNumber)