Services SDK
Document Parser

Document Parser

Parse PDF, DOCX, XLSX, and text files into structured data.

Installation

pnpm add @aicr/document-parser

Quick Start

import { parseDocument, parsePDF, parseDOCX, parseXLSX } from '@aicr/document-parser';
 
// Auto-detect format
const result = await parseDocument(buffer, { filename: 'report.pdf' });
 
console.log(result.text);      // Full text content
console.log(result.pages);     // Page-by-page content
console.log(result.tables);    // Extracted tables
console.log(result.metadata);  // Document metadata

API Reference

parseDocument(input, options)

Automatically detects file format and parses the document.

Parameters:

  • input: Buffer | ArrayBuffer - Document content
  • options: ParseOptions
    • filename?: string - Filename for format detection
    • mimeType?: string - MIME type override
    • maxPages?: number - Limit pages to parse
    • extractTables?: boolean - Extract table data (default: true)
    • extractImages?: boolean - Extract image metadata (default: false)

Returns: Promise<ParseResult>

interface ParseResult {
  format: 'pdf' | 'docx' | 'xlsx' | 'text' | 'markdown' | 'html';
  text: string;
  metadata: DocumentMetadata;
  pages: PageContent[];
  tables: TableData[];
  images: ImageData[];
  stats: ParseStats;
}

parsePDF(input, options)

Parse PDF documents specifically.

const result = await parsePDF(pdfBuffer, { maxPages: 10 });

parseDOCX(input, options)

Parse Word documents.

const result = await parseDOCX(docxBuffer);

parseXLSX(input, options)

Parse Excel spreadsheets. Returns structured table data.

const result = await parseXLSX(xlsxBuffer);
 
// Access sheet data
result.tables.forEach(table => {
  console.log(`Sheet ${table.pageNumber}:`, table.headers);
  table.rows.forEach(row => console.log(row));
});

parseText(input, options)

Parse plain text, markdown, or HTML.

const result = await parseText(textContent, { format: 'markdown' });

Types

DocumentMetadata

interface DocumentMetadata {
  pageCount: number;
  wordCount: number;
  charCount: number;
  title?: string;
  author?: string;
  createdAt?: Date;
  modifiedAt?: Date;
  extra?: Record<string, unknown>;
}

TableData

interface TableData {
  pageNumber: number;
  tableIndex: number;
  headers: string[];
  rows: string[][];
  cells?: CellData[][];
}
 
interface CellData {
  value: string;
  type: 'string' | 'number' | 'boolean' | 'date' | 'formula' | 'empty';
  raw?: unknown;
}

ParseStats

interface ParseStats {
  durationMs: number;
  fileSizeBytes: number;
  pagesParsed: number;
  tablesExtracted: number;
  imagesFound: number;
  warnings: string[];
}

Error Handling

import { parseDocument, ParserError } from '@aicr/document-parser';
 
try {
  const result = await parseDocument(buffer);
} catch (error) {
  if (error instanceof ParserError) {
    switch (error.code) {
      case 'UNSUPPORTED_FORMAT':
        console.error('File format not supported');
        break;
      case 'ENCRYPTED_DOCUMENT':
        console.error('Document is password protected');
        break;
      case 'CORRUPTED_FILE':
        console.error('File appears to be corrupted');
        break;
      case 'PARSE_FAILED':
        console.error('Parse failed:', error.message);
        break;
    }
  }
}

Supported Formats

FormatExtensionMIME Type
PDF.pdfapplication/pdf
Word.docxapplication/vnd.openxmlformats-officedocument.wordprocessingml.document
Excel.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Text.txttext/plain
Markdown.mdtext/markdown
HTML.htmltext/html