Keyword Filters Let Through What Regex Catches

A practical regex cheatsheet for YouTube comment moderation: spam patterns, Unicode bypass detection, and the rules that actually hold up against modern bots

By CommentShark TeamMarch 20, 202612 min read

There is a moment most creators hit where their keyword-based comment filter stops working. They added "crypto" to the blocked list, and the spam disappears for a week, then reappears as "crуpto" with a Cyrillic "у" instead of a Latin "y". They add "telegram", and the next round shows up as "tele.gram" or "t.e.l.e.g.r.a.m". Every time you close one gap, the spammers find the next one. Keyword filters assume spammers write the words in straightforward English. They do not.

This is where regex earns its keep. A regex pattern can match "crypto" whether it is spelled with Latin or Cyrillic letters, whether there are random spaces or punctuation between the letters, whether the word is capitalized, lowercase, or alternating. A single regex rule replaces dozens of keyword rules and stays effective for longer. The catch is that regex looks intimidating if you have never used it, and most creators never push through the learning curve.

This cheatsheet is the shortest path from keyword-based filtering to regex-based filtering that actually works on YouTube comments. It covers the basic regex syntax you need (and only what you need), a library of field-tested patterns for common spam categories, and the advanced techniques that handle Unicode bypass attempts. You will not become a regex expert, but you will have a working set of patterns you can deploy tomorrow.

Quick answer: if keyword filters are letting spam through, the four highest-leverage regex patterns are: URL detection with protocol and bare-domain variants, phone number detection including obfuscated forms, Unicode homoglyph matches for common spam words, and repeated-character patterns for emoji flooding and zero-width exploits. Together they catch 70-80% of spam categories that evade simple keyword matching.

Why Regex Beats Keyword Filtering on YouTube Comments

Keyword filtering works by checking if a comment contains a specific string. It is fast to set up, easy to understand, and completely helpless against any spammer who modifies their text. The three failure modes of keyword filtering are well documented.

Obfuscation. Spammers insert punctuation, spaces, or special characters into words that should trigger your filter. "telegram" becomes "te1egram", "tele gram", or "t.e.l.e.g.r.a.m". Your keyword filter sees different text, does not match, and the comment goes through.

Substitution. Spammers replace letters with lookalikes from other Unicode blocks. "scam" becomes "sсam" where the "c" is Cyrillic. The comment looks identical to a human reader but is invisible to keyword matching.

Variation. Spammers rephrase their core message constantly. A keyword filter tuned for "check my profile" does not catch "peep the bio" or "look at my page". You end up with an ever-growing list of keywords that never keeps up.

Regex addresses the first two directly. It can match patterns with optional characters between letters, handle multiple character classes at once, and express constraints keyword filters cannot. The third (variation) still benefits from AI classification rather than pattern matching, which is why the best comment moderation setups use both. For the broader picture of how these layers fit together, see how to automatically moderate YouTube comments.

Abstract overlapping geometric shapes representing keyword and regex moderation layers

The Regex Basics You Need for YouTube Comment Filtering

Regex is a large topic but most comment filtering only uses a small subset of it. Here is the minimum you need. If you want the full reference, MDN's regex guide is the canonical source, and regex101.com is the indispensable tool for testing patterns before deploying them.

Character Classes

  • [abc] matches any one of a, b, or c
  • [a-z] matches any lowercase letter
  • [A-Z] matches any uppercase letter
  • [0-9] or \d matches any digit
  • \s matches any whitespace (space, tab, newline)
  • \w matches any word character (letters, digits, underscore)
  • . matches any single character except newline

Quantifiers

  • * zero or more of the preceding element
  • + one or more of the preceding element
  • ? zero or one of the preceding element (optional)
  • {n} exactly n of the preceding element
  • {n,} n or more
  • {n,m} between n and m

Anchors and Groups

  • ^ start of the string
  • $ end of the string
  • (abc) a capture group for the sequence abc
  • (a|b) matches either a or b
  • (?:abc) non-capturing group (usually what you want for filters)
  • \b word boundary (useful for matching whole words)

Flags

  • i case-insensitive matching
  • u Unicode mode (important for non-Latin characters)
  • g global (find all matches, not just the first)

That is the vocabulary. Every pattern below uses these pieces. If any of them feels unfamiliar, open a test comment on regex101 and try the pattern against a few strings. Five minutes of experimentation is worth more than an hour of reading.

A Regex Pattern Library for YouTube Comment Moderation

What follows is a tested library of patterns for the most common comment moderation needs. You can paste these directly into a regex-enabled rule in the Comment Assistant. All examples use case-insensitive mode unless noted.

Pattern: Any URL or Bare Domain

(?:https?:\/\/|www\.)\S+|\b[a-z0-9-]+\.(?:com|net|org|io|co|me|tv|xyz|live|link|site)\b

Catches both explicit URLs (with http/https/www) and bare domains like "example.com". Extend the TLD list to match your needs. Common spam TLDs to include: xyz, live, link, site, shop, top. Skip the TLDs you legitimately want to allow (for example, if you are a .edu-adjacent channel).

Pattern: Phone Numbers in Various Formats

(?:\+?\d[\s().-]?){7,15}

Matches phone numbers formatted with any combination of spaces, dashes, parentheses, or dots. Catches "+1 (555) 123-4567", "555-123-4567", "+1.555.123.4567", and deliberately obfuscated forms. The 7-15 digit range covers national and international formats while avoiding false positives on short numeric strings.

Pattern: Telegram and WhatsApp Handles

(?:t\.?e\.?l\.?e\.?g\.?r\.?a\.?m|w\.?h\.?a\.?t\.?s\.?a\.?p\.?p)

The \.? pattern makes every period optional, which catches both the plain word and any obfuscated variant like "t.e.l.e.g.r.a.m" or "w.h.a.t.s.a.p.p". Add additional separators if needed: [\s.-]? covers spaces, periods, and dashes.

Pattern: Cryptocurrency Wallet Addresses

\b(?:0x[a-f0-9]{40}|bc1[a-z0-9]{39,59}|[13][a-km-zA-HJ-NP-Z1-9]{25,34})\b

Matches Ethereum (0x...), Bitcoin Bech32 (bc1...), and Bitcoin legacy wallet addresses. These are almost always spam on non-crypto channels. Even on crypto channels, consider flagging for review rather than auto-publishing.

Pattern: Emoji Flooding

(\p{Emoji_Presentation}\s*){6,}

Matches six or more consecutive emoji (with optional whitespace between them). Useful for catching bot accounts that spam pure emoji reactions. Requires Unicode mode enabled. Adjust the {6,} threshold based on what feels like emoji flooding on your channel.

Pattern: Repeated Character Spam

(\w)\1{5,}

Matches any character repeated 6 or more times. Catches patterns like "aaaaaa" or "!!!!!!!!!" that are almost always bot output or low-value noise. Increase the {5,} if you want to allow some emphasis.

Pattern: Homoglyph-Aware "Crypto"

[cс][rг][yу][pр][tт][oо]

Matches "crypto" even if written with Cyrillic homoglyphs: "с" for c, "г" for r (visually close in some fonts), "у" for y, "р" for p, "т" for t, "о" for o. This technique generalizes: for any keyword you care about, build a character class for each letter that includes both the Latin letter and its common Cyrillic or Greek lookalikes. For the full list of confusables, the Unicode confusables list is the reference source.

Pattern: Zero-Width Character Obfuscation

[\u200B-\u200D\uFEFF]

Matches zero-width space, zero-width non-joiner, zero-width joiner, and byte-order mark. These invisible characters are used to break keyword filters by inserting hidden gaps inside spam words. Flagging any comment containing one of these characters is usually safe because they rarely appear in legitimate comments.

Abstract grid of distinct shapes representing the library of regex patterns for comment filtering

Advanced Regex Techniques for Stubborn Spam Patterns

Lookarounds for Context-Sensitive Matching

Sometimes you want to match a word only when it is followed or preceded by something specific. Lookarounds let you do this without consuming characters. (?=abc) is a positive lookahead: match if "abc" follows. (?!abc) is a negative lookahead: match if "abc" does not follow. (?<=abc) and (?<!abc) are the corresponding lookbehinds.

Example: to match "free" only when it is followed by a suspicious-looking promise (like "free money", "free giveaway", "free iphone"), you can write \bfree\s+(?=money|giveaway|iphone|gift|cash|bitcoin). This catches the scam pattern without flagging benign uses of "free".

Combining Unicode Blocks

You can match whole Unicode blocks in regex. \p{Script=Cyrillic} matches any Cyrillic letter. If your channel audience is primarily English-speaking and you never expect legitimate Cyrillic comments, a rule that flags comments containing Cyrillic characters can catch a significant share of homoglyph-based obfuscation. This is aggressive and will have false positives if you have any Russian, Ukrainian, or Bulgarian viewers, so use judgment.

Non-Greedy Matching

By default, quantifiers are greedy: .+ matches as much as possible. Non-greedy quantifiers match as little as possible: .+?. This matters when you are trying to extract a specific substring between delimiters and greedy matching would capture too much. For filter rules, you usually want greedy. For extraction rules (if you need to pull a handle or ID out of a comment), non-greedy is often safer.

Testing and Debugging Your Regex Rules

Never deploy a regex rule without testing it against real comments first. The two most common failure modes are rules that match too aggressively (false positives, hiding legitimate comments) and rules that do not match what you expected (bugs in the pattern). Both are catchable by testing.

The workflow: paste your pattern into regex101.com, then paste in 20-30 example comments that should match and 20-30 that should not. Adjust the pattern until the match behavior is exactly what you want on the full sample. Then deploy in approval mode for at least 48 hours so you can spot-check real behavior against the intended behavior.

Keep a running list of false-positive examples you catch during approval mode and tune the pattern to exclude them. Regex rules are never "done". They evolve as spam evolves. For the systematic side of building out this kind of rule library, our auto-reply rules ideas post has the broader framework.

Common Mistakes with Regex Comment Filtering

Forgetting to Escape Special Characters

Characters like ., +, *, ?, (, ), [, ], and \ have special meaning in regex. If you are matching a literal period (for example, in a domain), you must escape it: \.. A pattern like example.com will also match "example_com", "exampleXcom", and any other single character between "example" and "com", which is almost never what you want.

Not Using Case-Insensitive Mode

Without the i flag, a pattern matching "crypto" will not match "Crypto" or "CRYPTO". Always enable case-insensitive mode for English-language content filters. The only exception is when you are deliberately trying to match all-caps spam (a legitimate signal) and want the case to matter.

Overly Aggressive Anchors

Anchoring a pattern with ^ and $ means it must match the whole comment. If your spam pattern can appear anywhere in a longer comment (it usually can), anchoring will cause you to miss matches. Use word boundaries (\b) instead when you want to match whole words without requiring them to be the entire comment.

Writing One Giant Regex Instead of Several Simple Ones

A single 200-character regex is unreadable, hard to debug, and fragile. Split your filtering into multiple separate rules, each with a focused purpose. The Comment Assistant supports stacking rules, so three simple regex rules run as efficiently as one complex one. You gain clarity and maintainability without any performance cost.

Building a Complete Regex Filter Stack

A solid regex filter stack for a mid-sized channel usually looks like this: a URL-and-bare-domain rule, a phone-number rule, a messaging-handle rule (telegram/whatsapp), a cryptocurrency-address rule, a homoglyph-aware rule for each of your top 3-5 scam keywords, an emoji-flooding rule, and a zero-width-character rule. That is 8-10 rules total. Deploy them all in auto-hide mode or route them to a review queue while you validate.

Pair these regex rules with your AI classification rules and your simple keyword rules. Regex handles the pattern-based detection, AI handles the semantic detection, keywords handle the obvious cases. Together they cover far more ground than any single approach. For a broader view of how this fits into your overall comment strategy, see the comment triage matrix and how to stop spam comments on YouTube. For a ready-made blocked words list you can paste directly into YouTube Studio as a complement to your regex setup, use our Blocked Words List tool.

Deploy Regex-Based Comment Filtering with CommentShark

Regex is the single highest-leverage upgrade available to creators fighting spam with keyword lists. It catches what keywords miss, stays effective as spammers adapt, and takes less maintenance over time once you have a solid base stack in place. The creators who moderate at scale without drowning are almost all using regex as one layer of their system.

CommentShark's Comment Assistant supports regex patterns natively in rule definitions, with per-video scoping and approval workflows so you can validate patterns before letting them run autonomously. The pattern library above works directly in the rule editor. Paste, test, deploy.

Set up regex-based spam filters alongside AI classification and keyword rules to catch the spam your current setup is missing.

Get Started with CommentShark