CSV Parser Service

1. Executive Summary

The CSV Parser Service is a shared utility service that provides robust, standards-compliant CSV parsing capabilities for the Augment-It platform. It handles complex CSV scenarios including quoted fields with embedded commas and newlines, automatic data type inference, validation, and transformation. The service is designed to be consumed by multiple microfrontends, particularly the RecordCollector application, ensuring consistent data processing across the platform.

2. Background & Motivation

Problem Statement

CSV parsing requirements vary across different components in the Augment-It platform, leading to duplicate code and inconsistent behavior when processing customer data imports.

Current Limitations

Inline CSV parsing logic embedded in components
Inconsistent handling of edge cases (quoted fields, embedded commas)
No centralized validation or error handling
Limited support for data type inference and transformation
Difficulty in extending parsing capabilities

Why This Solution

Centralized, reusable CSV parsing logic
Consistent error handling and validation across all applications
Support for complex CSV scenarios required by enterprise data
Extensible architecture for future enhancements

3. Goals & Non-Goals

Goals

RFC 4180 Compliance: Full support for CSV standard including quoted fields
Type Inference: Automatically detect and convert data types (string, number, boolean, date)
Validation: Configurable validation rules for required fields and data integrity
Error Handling: Comprehensive error reporting with line-level details
Performance: Handle large CSV files efficiently with streaming support
Extensibility: Plugin architecture for custom validators and transformers

Non-Goals

Excel/XLSX Support: Focus only on CSV format (Excel support is separate service)
Real-time Processing: Designed for batch import operations
Database Integration: Parser only handles data transformation, not persistence
UI Components: Service-only implementation, no visual components

4. Technical Design

High-Level Architecture

graph TD A[CSV File|String Input] --> B[CSV Parser Service] B --> C[Line Parser] C --> D[Field Extractor] D --> E[Type Inference Engine] E --> F[Validation Engine] F --> G[Data Transformer] G --> H[Structured Output] I[Configuration] --> B J[Custom Validators] --> F K[Custom Transformers] --> G

Core Components

1. CSV Line Parser

Responsibility: Parse individual CSV lines respecting quote boundaries
Features:
- Handle escaped quotes ("" sequences)
- Support multi-line quoted fields
- Configurable delimiter support (comma, semicolon, tab)

2. Type Inference Engine

Responsibility: Automatically detect and convert data types
Supported Types:
- String (default fallback)
- Number (integer/float detection)
- Boolean (true/false, yes/no, 1/0)
- Date (ISO 8601, common formats)
- Email (basic validation pattern)
- URL (HTTP/HTTPS validation)

3. Validation Engine

Responsibility: Apply validation rules to parsed data
Built-in Validators:
- Required field validation
- Data type validation
- Length constraints
- Pattern matching (regex)
- Custom validation functions

API Specifications

Primary Interface

typescript

interface CSVParserOptions {
  delimiter?: string; // Default: ','
  hasHeader?: boolean; // Default: true
  requiredColumns?: string[];
  typeInference?: boolean; // Default: true
  skipEmptyRows?: boolean; // Default: true
  maxRows?: number; // For large file protection
  encoding?: string; // Default: 'utf-8'
  customValidators?: Record<string, ValidatorFunction>;
  customTransformers?: Record<string, TransformerFunction>;
}

interface ParseResult<T = Record<string, any>> {
  data: T[];
  headers: string[];
  errors: ParseError[];
  warnings: ParseWarning[];
  metadata: {
    totalRows: number;
    processedRows: number;
    skippedRows: number;
    processingTime: number;
  };
}

interface ParseError {
  row: number;
  column?: string;
  field?: string;
  message: string;
  code: ErrorCode;
  severity: 'error' | 'warning';
}

// Main parsing function
function parseCSV(input: string | File, options?: CSVParserOptions): Promise<ParseResult>;

// Stream-based parsing for large files
function parseCSVStream(input: ReadableStream, options?: CSVParserOptions): AsyncIterable<ParseResult>;

// Validation-only function
function validateCSV(input: string | File, schema: ValidationSchema): Promise<ValidationResult>;

Core Implementation

typescript

// Based on existing implementation from RecordList.tsx
class CSVParser {
  private parseCSVLine(line: string, delimiter: string = ','): string[] {
    const result: string[] = [];
    let current = '';
    let inQuotes = false;
    
    for (let i = 0; i < line.length; i++) {
      const char = line[i];
      
      if (char === '"') {
        if (inQuotes && line[i + 1] === '"') {
          // Handle escaped quotes
          current += '"';
          i++;
        } else {
          // Toggle quote mode
          inQuotes = !inQuotes;
        }
      } else if (char === delimiter && !inQuotes) {
        // End of field
        result.push(current.trim());
        current = '';
      } else {
        current += char;
      }
    }
    
    result.push(current.trim());
    return result;
  }

  private inferType(value: string): { type: string; convertedValue: any } {
    if (value === '' || value === null || value === undefined) {
      return { type: 'string', convertedValue: value };
    }

    // Number detection
    const numberValue = Number(value);
    if (!isNaN(numberValue) && value !== '') {
      return { 
        type: Number.isInteger(numberValue) ? 'integer' : 'float', 
        convertedValue: numberValue 
      };
    }

    // Boolean detection
    const lowerValue = value.toLowerCase();
    if (['true', 'false', 'yes', 'no', '1', '0'].includes(lowerValue)) {
      return { 
        type: 'boolean', 
        convertedValue: ['true', 'yes', '1'].includes(lowerValue) 
      };
    }

    // Date detection (basic ISO 8601 pattern)
    if (/^\d{4}-\d{2}-\d{2}T?/.test(value) && !isNaN(Date.parse(value))) {
      return { type: 'date', convertedValue: new Date(value) };
    }

    // Email detection
    if (/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(value)) {
      return { type: 'email', convertedValue: value };
    }

    // URL detection
    if (/^https?:\/\//.test(value)) {
      try {
        new URL(value);
        return { type: 'url', convertedValue: value };
      } catch {
        // Invalid URL, treat as string
      }
    }

    return { type: 'string', convertedValue: value };
  }

  public async parse(input: string, options: CSVParserOptions = {}): Promise<ParseResult> {
    const startTime = Date.now();
    const errors: ParseError[] = [];
    const warnings: ParseWarning[] = [];
    
    try {
      // Split lines while handling quoted newlines
      const lines = input.split(/\r?\n/).filter(line => 
        options.skipEmptyRows ? line.trim() : true
      );
      
      if (lines.length === 0) {
        throw new Error('Empty CSV file');
      }

      // Parse header
      const headers = this.parseCSVLine(lines[0], options.delimiter);
      
      // Validate required columns
      if (options.requiredColumns) {
        const missingColumns = options.requiredColumns.filter(col => 
          !headers.includes(col)
        );
        if (missingColumns.length > 0) {
          errors.push({
            row: 0,
            message: `Missing required columns: ${missingColumns.join(', ')}`,
            code: 'MISSING_REQUIRED_COLUMNS',
            severity: 'error'
          });
        }
      }

      // Parse data rows
      const data: Record<string, any>[] = [];
      const maxRows = options.maxRows || lines.length;
      
      for (let i = 1; i < Math.min(lines.length, maxRows + 1); i++) {
        const line = lines[i];
        if (!line.trim() && options.skipEmptyRows) continue;

        try {
          const values = this.parseCSVLine(line, options.delimiter);
          const record: Record<string, any> = {
            id: crypto.randomUUID() // Generate unique ID for each record
          };

          headers.forEach((header, index) => {
            const rawValue = values[index] || '';
            
            if (options.typeInference) {
              const { convertedValue } = this.inferType(rawValue);
              record[header] = convertedValue;
            } else {
              record[header] = rawValue;
            }
          });

          data.push(record);
        } catch (error) {
          errors.push({
            row: i,
            message: `Failed to parse row: ${error instanceof Error ? error.message : 'Unknown error'}`,
            code: 'PARSE_ERROR',
            severity: 'error'
          });
        }
      }

      const processingTime = Date.now() - startTime;
      
      return {
        data,
        headers,
        errors,
        warnings,
        metadata: {
          totalRows: lines.length - 1, // Exclude header
          processedRows: data.length,
          skippedRows: (lines.length - 1) - data.length,
          processingTime
        }
      };
    } catch (error) {
      errors.push({
        row: -1,
        message: `Fatal parsing error: ${error instanceof Error ? error.message : 'Unknown error'}`,
        code: 'FATAL_ERROR',
        severity: 'error'
      });
      
      return {
        data: [],
        headers: [],
        errors,
        warnings,
        metadata: {
          totalRows: 0,
          processedRows: 0,
          skippedRows: 0,
          processingTime: Date.now() - startTime
        }
      };
    }
  }
}

Error Handling

Expected Error Cases

File Format Errors
- Invalid file encoding
- Malformed CSV structure
- Inconsistent column counts
Data Validation Errors
- Missing required fields
- Type conversion failures
- Invalid data formats
System Errors
- File read errors
- Memory limitations
- Network timeouts (for URL-based inputs)

Error Recovery Strategies

Partial Success: Continue processing valid rows, report errors for invalid ones
Graceful Degradation: Fall back to string type if type inference fails
Detailed Reporting: Provide row and column-level error information

Security Considerations

Input Validation
- File size limits to prevent DoS attacks
- Content-type validation
- Malicious CSV injection prevention
Memory Management
- Streaming support for large files
- Configurable memory limits
- Garbage collection optimization

5. Implementation Plan

Phase 1: Core Functionality

Basic CSV Parser (Week 1)
- Line parsing with quote handling
- Header extraction
- Basic error reporting
Type Inference Engine (Week 1)
- Number, boolean, date detection
- Configurable type inference options
- Fallback to string type
Integration with RecordCollector (Week 2)
- Replace inline parsing logic
- Maintain backward compatibility
- Add comprehensive error handling

Phase 2: Advanced Features

Validation Engine (Week 3)
- Required field validation
- Custom validator support
- Pattern matching validation
Performance Optimizations (Week 3)
- Streaming support for large files
- Memory usage optimization
- Background processing for large datasets
Extended Type Support (Week 4)
- Email and URL validation
- Currency and percentage formats
- Custom data type plugins

Phase 3: Integration & Polish

Service Integration (Week 5)
- Module federation setup
- API documentation generation
- Comprehensive test coverage
Monitoring & Analytics (Week 5)
- Performance metrics collection
- Error tracking and reporting
- Usage analytics

Dependencies

Internal: Module federation framework, shared error handling service
External: Web File API, Crypto API for UUID generation
Development: TypeScript 5+, Jest for testing, ESLint for code quality

Testing Strategy

Unit Tests
- CSV parsing logic with various edge cases
- Type inference accuracy
- Validation rule enforcement
Integration Tests
- End-to-end parsing with RecordCollector
- Large file handling
- Error scenarios and recovery
Performance Tests
- Parsing speed benchmarks
- Memory usage profiling
- Concurrent parsing scenarios

6. Alternatives Considered

Third-Party Libraries

Papa Parse: Popular CSV parsing library
- Pros: Well-tested, comprehensive features
- Cons: Large bundle size, external dependency
- Decision: Rejected in favor of custom implementation for better control

Browser-Native CSV API

Pros: No external dependencies, potentially faster
Cons: Limited browser support, less control over parsing logic
Decision: Rejected due to compatibility requirements

Server-Side Processing

Pros: Better performance for large files, reduced client load
Cons: Network latency, requires backend infrastructure
Decision: Deferred to Phase 4 as optional enhancement

7. Open Questions

Large File Handling: What's the practical size limit for client-side processing?
Internationalization: How should we handle different locale-specific number/date formats?
Custom Delimiters: Should we support tab-separated values (TSV) and other delimiters?
Data Preview: Should the parser provide a preview mode for large files?
Encoding Detection: Should we automatically detect file encoding or require explicit specification?

8. Appendix

Glossary

RFC 4180: The standard specification for CSV file format
Type Inference: Automatic detection and conversion of data types from string values
Streaming: Processing data in chunks rather than loading everything into memory
Module Federation: Webpack feature allowing sharing of code between separate builds

References

Revision History

v0.1.0 (2025-08-12): Initial specification based on existing implementation
v0.0.0.1 (2025-08-09): Initial file creation