Touhou-Project.com

Conceptualizing it

Added 2023-01-12 17:57:26 +0000 UTC

Hi guys, hope that you had a very happy new year and are excited for another year of girls with silly hats. Last time around, I sort of teased that I’d talk about the Matrix side of things but I’ve decided to punt that for the time being— I’ll get into it eventually but it partially depends on external factors. I’ve been busy doing all sorts of stuff, however, and think I have enough stuff that I can share in a coherent post.

I’m going to talk about something of a recurring topic: spam. There’s multiple ways that THP fights it and, like all other websites open to public posting, it will be an eternal issue. I rewrote one of the ways that spam is handled in the past few weeks to give the system greater flexibility and accuracy while keeping an eye out for performance concerns, code complexity, and preventing its overzealous application. While I could talk about what I did exactly (and I will briefly in the end), I want to first walk through those of you who may have never thought about this sort of issue the types of things when designing these systems.

Let’s assume that a user makes a post on the site. That information, you may recall from previous explanations, is processed in multiple ways to figure out the poster’s name, where text is supposed to be bold, if there’s an image, update etc. What then if instead of the latest on velvety crow tengu down, they instead are a bot that means to spam a link to a site that promises to enlarge this or that part of your anatomy? Even the non-technical of you may figure out the obvious answer: have a check where the user inputs are analyzed; perhaps have some sort of list of forbidden words or patterns.

Following this proposed solution (one of many, but let’s limit ourselves to this for the time being), a check is made when posts are parsed and a list of terms or words is read and then compared to the content of the post. If it matches, that naughty poster is banned or at least prevented from posting. Rational enough, right?

What happens then if this poster gets wise to this simple filter and decides to up their game, say by introducing spaces between letters? A human can still read the message and figure it out so, as far as they’re concerned, the spam works. Well then, the counter to that is to get rid of spaces in the message and evaluate whether the filter matches on that. That would catch that. But, ah, perhaps you’ve noticed an obvious flaw with that: if the filter was set to, for example, “penis land” and a normal user were to post “pen island” they would get caught by the filter as well! The filter is therefore, overzealous and might disrupt normal posting.

This can be circumvented to an extent by making a second, stricter check run only for certain kinds of users, say, users that post very short messages or users that haven’t posted at all before. Most people would remain unaffected by that kind of measure. But it’s not bulletproof. What if, additionally, our would-be spammer uses characters that look like normal characters but aren’t. Might be characters with diacritics or something else in the vast range of unicode characters, Hell, maybe it’s even more out there than that and there’s suddenly spam in other languages with different sorts of glyphs. How do you catch that?

The basic filtering system in place would likely prove to be inadequate as those characters tend to be multibyte characters which, to oversimplify, are comparatively more complex data-wise so dealing with them. This means that in your filtering function needs to account for them as matching isn’t quite so straightforward and even things like spaces are technically different for full-width characters. There also has to be a check for special characters like the ones that control the flow of text from right to left and similar because they’d break the page. Now the filter has to reference another function that does that.

And, well, we’ve been focusing on just a message but what about other things that can abused by spam. Like a file name. Well, run it through a similar filter. What about a file itself? Have database entries that check against a known signature of a bad image, like MD5. Before too long the simple filter has many moving parts. Perhaps too many. Shouldn’t we be concerned about performance? If it does all this checking for every post then it’s extra work for the server and may make the user experience for normal users more painful.

Though I haven’t been exhaustive about the concerns, I think that this thought experiment should illustrate the challenges in getting something basic but important this just right. Add to this that a lot of the code is old and that features have been added on (or mushroomed even) over the years and you’ve got a right recipe for an unwieldy mess. At a certain point adding complexity or introducing more features is liable to make stuff break. If there had ever been an original design vision (which in our particular case is doubtful due to the chaotic nature of Kusaba X development), it has long been subsumed by years of alterations.

So then, that answers the implicit question that some of you may have been thinking at the beginning of this post: why did you spend precious dev time rewriting functionality that was already working well enough? I wanted to have a clearer design and structure for things. Make it less of a pain to change or add to it as needed. Reducing overhead wherever I can while I was at it as well as ensuring that things like false positives (which had happened rarely) became a thing of the past.

To this end, there’s a lot more code in common now among the various “filters” and checks of that sort. Not only that but there’s more thorough, less-error-prone checks where appropriate. The latter means (slightly) slower execution, sure, but it happens with less frequency as the conditions for these checks are more particular. I need to underscore that this is all still going to be a continuing work in process. Yeah, there’s a lot I can keep tinkering with to make more robust. But it should be understood that the war against spam is eternal and—should spammers due more unusual things in the future—I’ll have to respond appropriately. It should be easier to do that now and I think that the couple of days I spent on analysis, design, execution, testing, and iterating were worth it.

I planned to add more to this post detailing some other work I’ve been doing, including stuff that’s alright live on the site. But I fear that the post has gotten too long. So I’m going to split things up and call it a day here. This time, however, I don’t expect there to be much of a gap between posts and you should expect another one maybe next week. I’ll tease that, while it’s nothing really flashy in and of itself, there’s good cause to celebrate as it directly helps with some of the exciting work I’ve got planned further down the line.

Until next time, take it easy!