Document Parser
Parse PDF, DOCX, XLSX, and text files into structured data.
Installation
pnpm add @aicr/document-parserQuick Start
import { parseDocument, parsePDF, parseDOCX, parseXLSX } from '@aicr/document-parser';
// Auto-detect format
const result = await parseDocument(buffer, { filename: 'report.pdf' });
console.log(result.text); // Full text content
console.log(result.pages); // Page-by-page content
console.log(result.tables); // Extracted tables
console.log(result.metadata); // Document metadataAPI Reference
parseDocument(input, options)
Automatically detects file format and parses the document.
Parameters:
input:Buffer | ArrayBuffer- Document contentoptions:ParseOptionsfilename?:string- Filename for format detectionmimeType?:string- MIME type overridemaxPages?:number- Limit pages to parseextractTables?:boolean- Extract table data (default: true)extractImages?:boolean- Extract image metadata (default: false)
Returns: Promise<ParseResult>
interface ParseResult {
format: 'pdf' | 'docx' | 'xlsx' | 'text' | 'markdown' | 'html';
text: string;
metadata: DocumentMetadata;
pages: PageContent[];
tables: TableData[];
images: ImageData[];
stats: ParseStats;
}parsePDF(input, options)
Parse PDF documents specifically.
const result = await parsePDF(pdfBuffer, { maxPages: 10 });parseDOCX(input, options)
Parse Word documents.
const result = await parseDOCX(docxBuffer);parseXLSX(input, options)
Parse Excel spreadsheets. Returns structured table data.
const result = await parseXLSX(xlsxBuffer);
// Access sheet data
result.tables.forEach(table => {
console.log(`Sheet ${table.pageNumber}:`, table.headers);
table.rows.forEach(row => console.log(row));
});parseText(input, options)
Parse plain text, markdown, or HTML.
const result = await parseText(textContent, { format: 'markdown' });Types
DocumentMetadata
interface DocumentMetadata {
pageCount: number;
wordCount: number;
charCount: number;
title?: string;
author?: string;
createdAt?: Date;
modifiedAt?: Date;
extra?: Record<string, unknown>;
}TableData
interface TableData {
pageNumber: number;
tableIndex: number;
headers: string[];
rows: string[][];
cells?: CellData[][];
}
interface CellData {
value: string;
type: 'string' | 'number' | 'boolean' | 'date' | 'formula' | 'empty';
raw?: unknown;
}ParseStats
interface ParseStats {
durationMs: number;
fileSizeBytes: number;
pagesParsed: number;
tablesExtracted: number;
imagesFound: number;
warnings: string[];
}Error Handling
import { parseDocument, ParserError } from '@aicr/document-parser';
try {
const result = await parseDocument(buffer);
} catch (error) {
if (error instanceof ParserError) {
switch (error.code) {
case 'UNSUPPORTED_FORMAT':
console.error('File format not supported');
break;
case 'ENCRYPTED_DOCUMENT':
console.error('Document is password protected');
break;
case 'CORRUPTED_FILE':
console.error('File appears to be corrupted');
break;
case 'PARSE_FAILED':
console.error('Parse failed:', error.message);
break;
}
}
}Supported Formats
| Format | Extension | MIME Type |
|---|---|---|
.pdf | application/pdf | |
| Word | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
| Excel | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
| Text | .txt | text/plain |
| Markdown | .md | text/markdown |
| HTML | .html | text/html |