Skip to content

AI recommended character set for API use

const ACCOUNT_REF_PATTERN = /^[a-z0-9][a-z0-9-]{1,61}[a-z0-9]$/
// 59 chars user input + 4 char random (nanoid alphabet)
const accountRef = `${userInput.slice(0, 59)}-${nanoid(3)}`
// Gives 238,328 combinations (62^3)

Claude

For account references that need to work across URLs, cookies, and headers, Iโ€™d recommend sticking to the unreserved characters from RFC 3986:

Recommended character set for account references:

  • A-Z (uppercase letters)
  • a-z (lowercase letters)
  • 0-9 (digits)
  • - (hyphen)
  • _ (underscore)
  • . (period)
  • ~ (tilde)

This gives you 66 characters total and avoids encoding issues across all your use cases.

Characters to avoid by context:

ContextAvoidReason
URL paths/ ? # [ ] @Path separators, query delimiters
Query params& = ? # +Query syntax, encoding issues
Cookies; , = \ spaceCookie syntax delimiters
HeadersControl chars (0-31, 127), high-bit chars (128-255)HTTP header restrictions
General% + <space>Percent-encoding conflicts, ambiguity

For account labels, Iโ€™d recommend:

  • All printable ASCII characters (32-126)
  • Exclude: ASCII control characters (0-31, 127)
  • Include: Letters, digits, punctuation, symbols, and spaces
  • This gives you readable text while avoiding terminal control sequences and non-printable characters

Regex patterns:

// Account reference
const ACCOUNT_REF_PATTERN = /^[A-Za-z0-9._~-]+$/
// Account label
const ACCOUNT_LABEL_PATTERN = /^[\x20-\x7E]+$/

The reference pattern is very conservative but guarantees no encoding headaches. The label pattern allows readable text including spaces while blocking problematic control characters.

Account labels with Emoji

For allowing emoji like smiley faces while maintaining security, Iโ€™d recommend a whitelist approach with specific Unicode blocks:

Recommended Unicode ranges for account labels:

  1. Basic Latin (U+0020-U+007F) - Standard printable ASCII
  2. Emoji blocks:
    • Emoticons (U+1F600-U+1F64F) - ๐Ÿ˜€๐Ÿ˜๐Ÿ˜‚๐Ÿคฃ๐Ÿ˜ƒ๐Ÿ˜„๐Ÿ˜…๐Ÿ˜†๐Ÿ˜‰๐Ÿ˜Š๐Ÿ˜‹๐Ÿ˜Ž๐Ÿ˜๐Ÿ˜˜๐Ÿฅฐ๐Ÿ˜—๐Ÿ˜™๐Ÿ˜š
    • Miscellaneous Symbols (U+2600-U+26FF) - โ˜€๏ธโญโค๏ธโœจโšก
    • Dingbats (U+2700-U+27BF) - โœ…โŒโญ•
    • Transport/Map Symbols (U+1F680-U+1F6FF) - ๐Ÿš€๐ŸŽฏ๐Ÿ 
    • Miscellaneous Symbols and Pictographs (U+1F300-U+1F5FF) - ๐ŸŒŸ๐ŸŽ‰๐ŸŽŠ

Security considerations to avoid:

  • Text direction controls (U+202A-U+202E, U+2066-U+2069) - Can cause spoofing
  • Variation selectors (U+FE00-U+FE0F) - Rendering inconsistencies
  • Zero-width characters (U+200B-U+200D) - Invisible characters
  • Private use areas (U+E000-U+F8FF) - Undefined behavior
  • Surrogates (U+D800-U+DFFF) - Can break string handling

TypeScript implementation:

// Conservative emoji-friendly pattern
const ACCOUNT_LABEL_PATTERN =
/^[\x20-\x7E\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}]+$/u
// Or more restrictive - just emoticons
const SIMPLE_EMOJI_PATTERN = /^[\x20-\x7E\u{1F600}-\u{1F64F}]+$/u

Additional validation:

  • Length limits - Emoji can be 1-4 bytes each
  • Normalization - Use String.normalize('NFC') for consistent representation
  • Homograph detection - Consider visually similar characters

Recommendation: Start with just the Emoticons block (U+1F600-U+1F64F) which covers most common smiley faces, then expand based on user needs. This minimizes attack surface while providing good UX.

The key is being explicit about what you allow rather than trying to block everything dangerous - Unicode has many edge cases that can cause security issues.

AWS Service naming considerations

Yes, AWS services have additional naming restrictions beyond general URL safety. Here are the key constraints:

AWS Resource Naming Restrictions:

S3 Buckets (most restrictive):

  • Only a-z, 0-9, -, .
  • No uppercase letters
  • No consecutive periods
  • Canโ€™t start/end with - or .
  • 3-63 characters

IAM Resources (roles, policies, users):

  • A-Z, a-z, 0-9, +, =, ,, ., @, -, _
  • No spaces
  • Path components can use /

Lambda Functions:

  • A-Z, a-z, 0-9, -, _
  • No periods, no spaces

CloudFormation Stack Names:

  • A-Z, a-z, 0-9, -
  • No underscores, no periods
  • Must start with letter

DynamoDB Tables:

  • A-Z, a-z, 0-9, -, _, .
  • 3-255 characters

CDK-Specific Considerations:

// CDK often generates resource names by concatenating
// account-ref + stack-name + resource-type + random-suffix
const stackName = `MyApp-${accountRef}-Stack`
const bucketName = `myapp-${accountRef.toLowerCase()}-bucket-${randomSuffix}`

Safest character set for AWS compatibility:

// Ultra-safe for all AWS services
const AWS_SAFE_PATTERN = /^[A-Za-z0-9-]+$/
// Slightly more permissive (works with most services)
const AWS_COMPATIBLE_PATTERN = /^[A-Za-z0-9._-]+$/

Specific gotchas:

  • Case sensitivity: S3 bucket names must be lowercase
  • Hyphens: Canโ€™t start/end resource names in many services
  • Periods: Problematic in S3 bucket names (SSL cert issues)
  • Underscores: Not allowed in CloudFormation stack names
  • Length limits: S3 buckets (63), most others (64-255)

Recommended approach:

// For account references used in AWS resource names
const AWS_ACCOUNT_REF_PATTERN = /^[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]$/
// Ensures:
// - Starts with letter
// - Ends with letter or digit
// - Only contains letters, digits, hyphens
// - No consecutive special characters

This pattern works across all AWS services and CDK constructs without requiring transformation or encoding.

Grok

To determine which additional characters might complicate handling with AWS services or AWS Cloud Development Kit (CDK) names, particularly in the context of your account references and labels (including emoji), letโ€™s explore the constraints imposed by AWS services and the CDK. The goal is to identify characters that could cause issues in contexts like resource naming, CloudFormation templates, or CDK constructs, and to ensure compatibility with your requirements for account references (used in URLs, cookies, headers) and account labels (including smiley faces/emoji). Iโ€™ll guide you through the considerations, referencing AWS-specific restrictions and CDK behaviors, and provide a comprehensive view of problematic characters.

1. Context Recap: Account References and Labels

  • Account References: Used in URLs (path/query), cookies (keys/values), and headers. Recommended to use [A-Za-z0-9\-_] (alphanumeric, dash, underscore) for maximum compatibility.
  • Account Labels: More permissive, allowing printable ASCII (U+0020โ€“U+007E) and emoji (e.g., U+1F600โ€“U+1F64F for smiley faces, plus other emoji ranges like U+1F300โ€“U+1F5FF, U+1F900โ€“U+1F9FF, U+2700โ€“U+27BF, U+2600โ€“U+26FF). These are primarily for display but may appear in databases, JSON, or headers.
  • New Consideration: You want to allow emoji in labels and now need to know which characters (beyond those already discussed) could cause issues with AWS services or CDK naming conventions.

2. AWS Services: Character Restrictions

AWS services impose varying restrictions on resource names, which can affect account references and labels. These restrictions depend on the service (e.g., S3, Lambda, IAM, DynamoDB) and context (e.g., resource names, tags, ARNs). Below, Iโ€™ll outline characters that complicate handling across AWS services, focusing on those relevant to your use case.

General AWS Resource Naming Constraints

AWS resource names often share common restrictions but vary by service. The most restrictive services provide a baseline for safe character sets. Based on AWS documentation and community insights, here are key considerations:

  • Common Allowed Characters: Most AWS services allow:

    • Alphanumeric: A-Z, a-z, 0-9
    • Safe symbols: - (hyphen), _ (underscore), sometimes . (dot)
  • Common Restrictions:

    • Case Sensitivity: Most services are case-sensitive, but some (e.g., RDS DB subnet groups) convert names to lowercase, which can cause issues if uppercase is used. No Title Found

    • Length Limits: Vary by service (e.g., S3 bucket names: 3โ€“63 characters; ElastiCache cluster names: 1โ€“20 characters). [What are the most restrictive aws resource name limitations]](https://stackoverflow.com/questions/46052869/what-are-the-most-restrictive-aws-resource-name-limitations-e-g-characters-and)

    • Special Characters: Many services prohibit or restrict special characters like /, :, ?, #, &, =, %, @, spaces, and control characters.

    • Unicode/Emoji: Most AWS resource names are limited to ASCII. Unicode (including emoji) is generally supported only in specific contexts like tags or metadata fields (e.g., AWS Elemental MediaConnect tags allow UTF-8 up to 128 characters for keys and 256 for values).tagging-restrictions

Problematic Characters in AWS Services

The following characters are likely to cause issues in AWS resource names or related contexts:

CharacterUnicode/ASCIIIssue in AWS ServicesRelevant Services
/U+002FReserved in ARNs, URLs, and IAM role names; causes deployment errors in CDK.IAM, S3, CloudFormation, CDK
:, ?, #, &, =VariousURL reserved characters; invalid in most resource names; require encoding.S3, Lambda, URLs
Space ( )U+0020Invalid in most resource names; requires encoding in URLs (%20) and cookies.S3, Lambda, IAM, Cookies
Uppercase LettersAโ€“Z (U+0041โ€“U+005A)Converted to lowercase in some services (e.g., RDS DB subnet groups); can cause mismatches.RDS, DynamoDB
_ (underscore)U+005FProhibited in S3 bucket names; allowed elsewhere but may cause UI rendering issues.S3
Consecutive Hyphens (โ€”)U+002DProhibited in S3 bucket names; may cause parsing issues in other services.S3
. (dot)U+002EAllowed in some services (e.g., S3 bucket names) but not others (e.g., IAM roles).IAM, Lambda
Control CharsU+0000โ€“U+001F, U+007FInvalid in all resource names; cause parsing or security issues (e.g., log poisoning).All services
Non-ASCII (e.g., emoji)>U+007FInvalid in most resource names; supported in tags/metadata but requires UTF-8 handling.S3, Lambda, IAM (names); Tags
Bidirectional ControlsU+202Aโ€“U+202ECan cause text rendering issues or spoofing; not allowed in most contexts.All services

Service-Specific Examples

3. AWS CDK: Naming Conventions and Issues

The AWS CDK generates CloudFormation templates, which impose additional constraints on resource names and logical IDs. CDKโ€™s naming behavior can introduce complexities, especially when custom names (like account references or labels) are used.

CDK Naming Behavior

Problematic Characters in CDK

  • Slash (/): Causes deployment errors in IAM roles and other resources; reserved for ARNs and paths.[

    ](https://github.com/aws/aws-cdk/issues/28502)

  • Colon (:): Used in ARNs; invalid in most resource names and logical IDs.

  • Spaces and Special Characters: Invalid in logical IDs and most physical names; require encoding in other contexts.

  • Uppercase: Converted to lowercase by some services, causing inconsistencies.aws-cdk-things-you-should-know-d620a2de9669

  • Unicode/Emoji: Not supported in logical IDs or most physical names; may work in tags but requires UTF-8 handling.

CDK-Specific Recommendations

4. Impact on Account References and Labels

Account References

Account Labels with Emoji

  • Emoji Support: Emoji are supported in tags and metadata fields (e.g., AWS Elemental MediaConnect tags). However, they are invalid in most resource names (e.g., S3 buckets, IAM roles, Lambda functions).[

    ](https://docs.aws.amazon.com/mediaconnect/latest/ug/tagging-restrictions.html)

  • Complications:

    • Resource Names: If labels are used to generate resource names (e.g., S3 buckets, DynamoDB tables), emoji will cause failures. Transform labels to ASCII (e.g., John ๐Ÿ˜Š โ†’ john-smiley) before using in resource names.

    • CDK Logical IDs: Logical IDs in CloudFormation templates must be alphanumeric. Emoji in labels used as construct IDs will cause synthesis errors.

    • Length Limits: Emoji are multi-byte (e.g., ๐Ÿ˜Š is 4 bytes in UTF-8). A 256-character label with emoji could exceed byte-length limits in tags (256 bytes) or other fields.[

      ](https://docs.aws.amazon.com/mediaconnect/latest/ug/tagging-restrictions.html)

    • Rendering Issues: Some AWS consoles or tools may not render emoji correctly, especially older systems.

  • Mitigation:

    • Store labels as tags or metadata, not resource names.
    • Use UTF-8 encoding consistently across database, API, and UI.
    • Validate byte length (e.g., 256 bytes max) to avoid exceeding limits.
    • Escape special characters (<, >, &, โ€, โ€™) in HTML/JSON contexts to prevent XSS or parsing issues.

5. Additional Problematic Characters

Beyond the characters already discussed (control characters, URL reserved characters, bidirectional controls), consider these additional cases that could complicate AWS or CDK handling:

6. Recommendations for Safe Implementation

Account References

  • Character Set: Stick to [A-Za-z0-9\-_] (max 64 characters).
  • Validation: Use regex ^[a-zA-Z0-9\-_]{1,64}$ in CDK or API code.
  • CDK Usage: Avoid using references as construct IDs unless transformed to alphanumeric; let CDK generate physical names.
  • Example: user123, account-abc_456

Account Labels

  • Character Set: Allow printable ASCII (U+0020โ€“U+007E) and emoji (U+1F600โ€“U+1F64F, U+1F300โ€“U+1F5FF, U+1F900โ€“U+1F9FF, U+2700โ€“U+27BF, U+2600โ€“U+26FF).
  • Validation: Use regex like ^[\x20-\x7E\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F900-\U0001F9FF\U00002700-\U000027BF\U00002600-\U000026FF]{0,256}$ (adjust for language-specific Unicode support).
  • Byte Length: Limit to 256 bytes to accommodate multi-byte emoji.
  • CDK Usage: Store labels as tags (e.g., resource.addTag(โ€œLabelโ€, label)) rather than physical names. Ensure UTF-8 encoding in CloudFormation templates.
  • Example: Johnโ€™s Account ๐Ÿ˜Š, Team A ๐ŸŒŸ

AWS and CDK Best Practices

Grok part 2

To design an API where account references are used in URLs, cookies, headers, and potentially as cookie keys, while also defining a more permissive character set for account labels, we need to carefully consider which ASCII characters are safe and practical for each use case. Letโ€™s explore this step-by-step through a Socratic approach to guide your decision-making.

Account References in URLs, Cookies, and Headers

First, letโ€™s think about the account reference, which will appear in sensitive contexts like URL paths, query parameters, cookies, and headers. What properties do you think are essential for an account reference to ensure itโ€™s safe and interoperable across these contexts? For instance, should it be short, human-readable, or restricted to certain characters to avoid encoding issues?

Consider the constraints of URLs. URLs have specific rules defined by standards like RFC 3986. Which characters do you think might cause issues in a URL path or query parameter if not encoded? For example, would including a space or a question mark in an account reference create problems? Why might certain characters, like slashes or ampersands, complicate parsing in URLs?

Now, think about cookies and headers. Cookies (per RFC 6265) and HTTP headers have their own restrictions. What happens if you include a comma or semicolon in a cookie value? Could including non-printable ASCII control characters, like a null byte or line feed, break cookie parsing or pose security risks? Similarly, in headers, which characters might be problematic due to how servers or clients interpret whitespace or special characters?

Letโ€™s narrow it down. A common approach for identifiers like account references is to use a limited, safe subset of ASCII characters to minimize encoding and parsing issues. For example, alphanumeric characters (A-Z, a-z, 0-9) are generally safe. Would you consider adding other characters, like hyphens or underscores, to make the reference more readable or flexible? Why or why not? What about avoiding characters like percent signs (%) or equals signs (=), which have special meanings in URLs or cookies?

To help you reason through this, letโ€™s consider a potential character set for account references. A safe choice might be:

  • Alphanumeric: A-Z, a-z, 0-9
  • Optional safe symbols: - (hyphen), _ (underscore), . (dot)

Why might these be good choices? Are there any contexts where even these characters could cause issues? For instance, could a dot in a cookie key be misinterpreted? Would you want to enforce case sensitivity, or make the reference case-insensitive to simplify handling?

Account Labels

Now, letโ€™s explore the account labels, which youโ€™ve indicated should be more permissive, allowing spaces but excluding ASCII control characters like the delete character. What do you think the purpose of the label is? Is it for display to users, storage in a database, or something else? Why would allowing spaces be useful, and what risks might come with including them in certain contexts, like URLs or JSON?

Since you want to exclude ASCII control characters (e.g., delete, null, carriage return), letโ€™s define what โ€œcontrol charactersโ€ means. In ASCII, control characters are typically those with codes 0โ€“31 and 127 (delete). Do you think all of these should be excluded, or are there specific ones youโ€™re most concerned about? For example, why might a delete character (ASCII 127) or a null byte (ASCII 0) be particularly problematic?

For labels, a more permissive set might include:

  • Alphanumeric: A-Z, a-z, 0-9
  • Common punctuation: space, hyphen (-), underscore (_), dot (.), comma (,), apostrophe (โ€™), etc.
  • Excluding control characters (ASCII 0โ€“31, 127)

Does this seem like a reasonable starting point? What other characters might you want to allow for user-friendly labels? For example, would you include symbols like ampersands (&) or hashtags (#)? Why might these be problematic in some systems, like databases or web forms?

Reference Table for Characters to Avoid

You asked for a reference table listing characters to avoid in different contexts. Letโ€™s reason through what such a table might include. For each context (URL path, query parameters, cookies, headers, labels), what characters do you think should be avoided, and why? Hereโ€™s a starting point to guide your thinking:

  1. URL Path (RFC 3986):
    • Reserved characters: /, ?, #, [, ], @, !, $, &, โ€™, (, ), *, +, ,, ;, =
    • Why avoid? These have special meanings (e.g., / separates path segments, ? starts query parameters).
    • Non-printable characters (ASCII 0โ€“31, 127): Can break parsing or pose security risks (e.g., injection attacks).
    • Space: Requires encoding (%20), which can complicate usage.
  2. URL Query Parameters:
    • Reserved: &, =, #, +, space
    • Why avoid? & separates key-value pairs, = separates keys from values, and spaces need encoding.
    • Control characters: Same risks as in paths.
  3. Cookies (RFC 6265):
    • Forbidden in values: ,, ;, space, control characters (ASCII 0โ€“31, 127)
    • Forbidden in keys: Same as values, plus =, as it separates key from value.
    • Why avoid? These can break cookie parsing or lead to malformed requests.
  4. HTTP Headers:
    • Forbidden: Control characters (ASCII 0โ€“31, 127), non-ASCII characters.
    • Restricted: Whitespace (space, tab) at the start or end, as it may be trimmed or cause parsing errors.
    • Why avoid? Headers are sensitive to formatting, and control characters can cause security issues.
  5. Account Labels:
    • Avoid: Control characters (ASCII 0โ€“31, 127) due to risks in storage, display, or processing.
    • Caution: Characters like <, >, & (potential XSS risks in web contexts), or quotes (โ€, โ€™) that might break JSON or SQL if not escaped.

Would you like to refine this table further? For example, should we specify which characters are safe rather than focusing on those to avoid? Would you want to test specific characters in each context to see how they behave in real systems?

For account references, a conservative approach is to use:

  • A-Z, a-z, 0-9, -, _, .
  • Why? These are safe across URLs, cookies, and headers, requiring minimal encoding. Theyโ€™re also human-readable and unlikely to cause parsing issues.
  • Questions: Should the length be limited (e.g., 1โ€“64 characters)? Should it be case-sensitive?

For account labels, a more permissive set could be:

  • A-Z, a-z, 0-9, space, -, _, ., ,, โ€™, !, :
  • Exclude: ASCII 0โ€“31, 127, and potentially risky characters like <, >, &, โ€, \ unless properly escaped in context.
  • Questions: Should labels have a maximum length? Are there specific display or storage constraints to consider?

Next Steps

To solidify your design, what specific use cases or systems (e.g., web frameworks, databases) are you targeting? Are there particular edge cases, like internationalization or legacy system compatibility, you want to explore? Would it help to create a sample account reference and label, then test how they behave in a URL, cookie, or header?

If youโ€™d like a visual representation of safe vs. unsafe characters for each context, would you want me to generate a chart comparing the allowed characters across URLs, cookies, headers, and labels? This could help clarify the differences visually.

What do you think is the next question we should tackle to refine your APIโ€™s character set choices?