How to tokenize data for AI

Using regexes and field value replacement for structured data to securely use AI.

When working with AI capabilities in Frends, customer data, support tickets, and similar information often contains Personally Identifiable Information (PII). Sending such sensitive data directly to AI models raises compliance concerns, especially under regulations like GDPR or HIPAA. This guide demonstrates how to implement reversible tokenization that protects PII while allowing the AI to process your data effectively.

This approach replaces sensitive information with placeholder tokens before the data reaches the AI. After the AI processes the tokenized content, the system restores the original values, providing a seamless experience where privacy remains protected throughout the process.

Prerequisites

To follow along with this guide, you'll need at least the Editor role, or similar permission level, in Frends to create and edit Processes.

You should have some familiarity with regular expressions and basic C# scripting. If you plan to use the AI Connector, you'll need Frends Credits available or access to an alternative AI provider like Azure OpenAI or Ollama.

Understanding the Tokenization Workflow

Tokenization creates a temporary representation of sensitive data. Consider a customer support message that reads: "Contact John Smith at [email protected] or call +1-555-1234." Before this data reaches the AI, the system transforms it into: "Contact {{NAME_001}} at {{EMAIL_001}} or call {{PHONE_001}}."

The AI processes this tokenized version—summarizing it, extracting action items, or performing other operations. When the AI completes its work, the system uses the mapping created during tokenization to swap all placeholder tokens back to their original values. The end user sees the complete, original information in context, while the AI model never processes any actual PII.

This method works across different AI providers and doesn't require special model training or configuration. You maintain full control over what gets tokenized and how.

Creating the tokenization within Process

Different types of data can use different tokenization strategies. If you're working with unstructured text like support messages, email bodies, or customer comments, you can use pattern-based tokenization with regular expressions. If you're working with structured data like JSON objects with defined schemas, you can instead use field-based tokenization that identifies PII by field names.

Pattern-Based Tokenization for Unstructured Text

When dealing with free-form text content, you need to identify PII by recognizing patterns. Email addresses follow predictable formats, as do phone numbers, social security numbers, and other common PII types. Using regular expressions, you can automatically detect and replace these patterns with tokens.

Consider a customer support message:

Hi, I noticed a charge on my account. Please contact me at [email protected] 
or call +1-555-1234 to discuss. My account manager Sarah Johnson 
([email protected]) mentioned this would be resolved.

This message contains email addresses, a phone number, and a person's name scattered throughout the text. You don't know in advance where these will appear or how many there will be.

Here's how you would tokenize this content using a C# Code task in Frends:

{
    var messageText = #var.MessageText;
    var tokenMap = new JObject();
    
    // Define patterns and their token prefixes
    var patterns = new Dictionary<string, string>
    {
        { @"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", "EMAIL" },
        { @"(?<![A-Za-z])(?:\+?\d{1,3}[-.\s]?)?(?:\d[-.\s]?){5,13}\d\b", "PHONE" },
        { @"\b([A-Z][a-z]+ [A-Z][a-z]+)\b", "NAME" },
        { @"\b\d{3}-\d{2}-\d{4}\b", "SSN" }
    };
    
    var counters = new Dictionary<string, int>();
    var tokenizedMessage = messageText;
    
    foreach (var pattern in patterns)
    {
        var regex = new System.Text.RegularExpressions.Regex(pattern.Key);
        var matches = regex.Matches(tokenizedMessage);
        
        if (!counters.ContainsKey(pattern.Value))
            counters[pattern.Value] = 1;
        
        foreach (System.Text.RegularExpressions.Match match in matches)
        {
            var token = $"{{{{{pattern.Value}_{counters[pattern.Value]:D3}}}}}";
            tokenMap[token] = match.Value;
            tokenizedMessage = tokenizedMessage.Replace(match.Value, token);
            counters[pattern.Value]++;
        }
    }
    
    return JObject.FromObject(new { 
        TokenizedText = tokenizedMessage, 
        TokenMap = tokenMap 
    });
}

Note: These regex patterns serve as examples. For production use, refine patterns to match your specific data format and reduce false positives. For example, the name pattern shown only captures simple "FirstName LastName" formats and may not detect compound names, prefixes, or suffixes.

The code defines multiple regex patterns in a dictionary where each key is the pattern and each value is the token type prefix. It then processes the text for each pattern, finding all matches and replacing them with unique tokens like {{EMAIL_001}}, {{PHONE_001}}, or {{NAME_001}}.

After running this code, the message becomes:

Hi, I noticed a charge on my account. Please contact me at {{EMAIL_001}} 
or call {{PHONE_001}} to discuss. My account manager {{NAME_001}} 
({{EMAIL_002}}) mentioned this would be resolved.

The token map stores the relationships:

{
  "{{EMAIL_001}}": "[email protected]",
  "{{PHONE_001}}": "+1-555-1234",
  "{{NAME_001}}": "Sarah Johnson",
  "{{EMAIL_002}}": "[email protected]"
}

You can easily extend this approach by adding more patterns to the dictionary.

Field-Based Tokenization for Structured Data

When working with structured data like JSON objects, field names often tell you what contains PII. Fields named customerEmail, phoneNumber, or firstName clearly indicate sensitive data. Instead of searching for patterns, you walk through the JSON structure and tokenize values based on their field names.

Consider a customer record:

{
  "customerId": "12345",
  "firstName": "John",
  "lastName": "Smith",
  "email": "[email protected]",
  "phone": "+1-555-1234",
  "accountStatus": "active",
  "address": {
    "street": "123 Main St",
    "city": "Springfield",
    "zipCode": "12345"
  },
  "contactPreferences": {
    "preferredEmail": "[email protected]",
    "marketingOptIn": true
  }
}

This structure has PII scattered across multiple levels: top-level fields like email and phone, nested objects containing street and zipCode, and even duplicate values like the email appearing in two places.

Here's how you would tokenize this using field-based detection:

{
    var customerData = JObject.Parse(#var.CustomerJson);
    var tokenMap = new JObject();
    
    // Define fields to tokenize
    var fieldsToTokenize = new List<string>
    {
        "email", "phone", "name", "firstname", "lastname", "street", "zipcode"
    };
    
    // Counter for tokens
    int tokenCounter = 1;
    
    void TokenizeFields(JToken token)
    {
        if (token.Type == JTokenType.Object)
        {
            foreach (var prop in ((JObject)token).Properties().ToList())
            {
                var fieldName = prop.Name.ToLower();
    
                // Check if field should be tokenized
                if (fieldsToTokenize.Any(f => fieldName.Contains(f)) && prop.Value.Type == JTokenType.String)
                {
                    var originalValue = prop.Value.ToString();
                    var tokenKey = $"{{{{TOKEN_{tokenCounter:D3}}}}}"; // e.g., TOKEN_001
                    
                    tokenMap[tokenKey] = originalValue;
                    prop.Value = tokenKey;
                    tokenCounter++;
                }
                else
                {
                    // Recursively process nested objects
                    TokenizeFields(prop.Value);
                }
            }
        }
        else if (token.Type == JTokenType.Array)
        {
            foreach (var item in token)
            {
                TokenizeFields(item);
            }
        }
    }
    
    // Perform tokenization
    TokenizeFields(customerData);
    
    return JObject.FromObject(new {
        TokenizedData = customerData,
        TokenMap = tokenMap
    });
}

The code recursively walks through the JSON structure, examining each property name. When it finds a field that matches PII patterns, it tokenizes the value. The recursion handles nested objects and arrays automatically, so PII anywhere in the structure gets protected.

After running this on the customer record, you get the following tokenized JSON:

{
		"customerId": "12345",
		"firstName": "{{TOKEN_001}}",
		"lastName": "{{TOKEN_002}}",
		"email": "{{TOKEN_003}}",
		"phone": "{{TOKEN_004}}",
		"accountStatus": "active",
		"address": {
			"street": "{{TOKEN_005}}",
			"city": "Springfield",
			"zipCode": "{{TOKEN_006}}"
		},
		"contactPreferences": {
			"preferredEmail": "{{TOKEN_007}}",
			"marketingOptIn": true
		}
}

And also the token map to track all the replacements:

{
		"{{TOKEN_001}}": "John",
		"{{TOKEN_002}}": "Smith",
		"{{TOKEN_003}}": "[email protected]",
		"{{TOKEN_004}}": "+1-555-1234",
		"{{TOKEN_005}}": "123 Main St",
		"{{TOKEN_006}}": "12345",
		"{{TOKEN_007}}": "[email protected]"
}

This approach scales well with complex nested structures and handles arrays of objects naturally. You can customize the field name matching logic to fit your specific data schemas.

Restoring Original Values After AI Processing

Once the AI has processed your tokenized data and returned a response, you need to restore the original PII values. The detokenization process is straightforward: replace every token with its corresponding original value from the token map.

There are two methods for detokenizing the AI responses: String replacement that directly replaces the tokens from string content with the earlier tokenized content, and field-based detokenization for JSON data, that makes sure the resulting content is valid JSON.

Here is an example for performing a simple string replacement detokenization:

{
    var tokenMap = #var.TokenizedRegex["TokenMap"];
    var aiResponse = #result[Generate response with AI].Response.ToString();

    foreach (var kvp in tokenMap.Properties())
    {
        aiResponse = aiResponse.Replace(kvp.Name, kvp.Value.ToString());
    }

    return aiResponse;
}

And here is an example for JSON field replacement:

{
    var tokenMap = JObject.Parse(#var.TokenizedField["TokenMap"].ToString());
    var aiResponseJson = JObject.Parse(#result[Remap the Customer JSON with AI].Response.ToString());

    void DetokenizeJson(JToken token)
    {
        if (token.Type == JTokenType.Object)
        {
            foreach (var prop in ((JObject)token).Properties().ToList())
            {
                if (prop.Value.Type == JTokenType.String)
                {
                    var value = prop.Value.ToString();
                    foreach (var kvp in tokenMap.Properties())
                    {
                        value = value.Replace(kvp.Name, kvp.Value.ToString());
                    }
                    prop.Value = value;
                }
                else
                {
                    DetokenizeJson(prop.Value);
                }
            }
        }
        else if (token.Type == JTokenType.Array)
        {
            foreach (var item in token)
            {
                DetokenizeJson(item);
            }
        }
    }

    DetokenizeJson(aiResponseJson);

    return aiResponseJson;
}

The detokenization process works regardless of what the AI returns. Whether it's a summary, an analysis, or transformed data, any tokens that appear get replaced with their original values, making the output natural and complete while ensuring the AI never saw the actual PII.

Using AI and Tasks to customize the tokenization

While the examples in this guide cover basic tokenization scenarios, your specific requirements may differ. You might need to tokenize CSV or XML data, or perhaps you're working with an entirely different data structure.

Fortunately, Frends provides several tools to help you implement tokenization securely for your particular use case. The AI Code Generator is a powerful tool for generating custom tokenization logic tailored to your needs. What makes it particularly valuable is its security-first approach: only metadata and variable references are sent to the AI, meaning your actual data never leaves your environment.

This allows you to confidently use the AI Assistant to generate tokenization code without compromising data privacy. The AI Assistant is also incredibly accessible, even if you're not deeply experienced with programming. You simply describe what you need in plain language, and the AI Assistant will generate appropriate code for your specific situation.

When working with different data formats, it's worth noting that Frends handles JSON natively, which makes it the optimal format for data processing and tokenization. If your source data comes in a different format, it's generally a good practice to convert it to JSON first.

Frends provides useful Tasks for this purpose, such as CSV to JSON and XML to JSON, along with many other converters for different data formats. Converting your data to JSON before applying tokenization logic allows you to use the examples in this guide almost directly, which significantly reduces development time and complexity. Since Frends processes JSON natively in the background, working with this format ensures the smoothest integration experience overall.

Conclusion

Reversible tokenization gives you practical ways to leverage AI capabilities while maintaining strict data privacy controls. Pattern-based tokenization using regular expressions works well for free-text content where PII appears in unpredictable locations. Field-based tokenization excels with structured data where field names indicate what contains sensitive information. Choose the approach that matches your data type, and you can confidently process sensitive information with AI while maintaining compliance and protecting customer privacy.

Last updated

Was this helpful?