Skip to main content
Open In ColabOpen on GitHub

Writer PDF Parser

This notebook provides a quick overview for getting started with the Writer PDFParser document loader.

Writer's PDF Parser converts PDF documents into other formats like text or Markdown. This is particularly useful when you need to extract and process text content from PDF files for further analysis or integration into your workflow. In langchain-writer, we provide usage of Writer's PDF Parser as a LangChain document parser.

Overviewโ€‹

Integration detailsโ€‹

ClassPackageLocalSerializableJS supportPackage downloadsPackage latest
PDFParserlangchain-writerโŒโŒโŒPyPI - DownloadsPyPI - Version

Setupโ€‹

The PDFParser is available in the langchain-writer package:

%pip install --quiet -U langchain-writer

Credentialsโ€‹

Sign up for Writer AI Studio to generate an API key (you can follow this Quickstart). Then, set the WRITER_API_KEY environment variable:

import getpass
import os

if not os.getenv("WRITER_API_KEY"):
os.environ["WRITER_API_KEY"] = getpass.getpass("Enter your Writer API key: ")

It's also helpful (but not needed) to set up LangSmith for best-in-class observability. If you wish to do so, you can set the LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY environment variables:

# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

Instantiationโ€‹

Next, instantiate an instance of the Writer PDF Parser with the desired output format:

from langchain_writer.pdf_parser import PDFParser

parser = PDFParser(format="markdown")

Usageโ€‹

There are two ways to use the PDF Parser, either synchronously or asynchronously. In either case, the PDF Parser will return a list of Document objects, each containing the parsed content of a page from the PDF file.

Synchronous usageโ€‹

To invoke the PDF Parser synchronously, pass a Blob object to the parse method referencing the PDF file you want to parse:

from langchain_core.documents.base import Blob

file = Blob.from_path("../../data/page_to_parse.pdf")

parsed_pages = parser.parse(blob=file)
parsed_pages
API Reference:Blob

Asynchronous usageโ€‹

To invoke the PDF Parser asynchronously, pass a Blob object to the aparse method referencing the PDF file you want to parse:

parsed_pages_async = await parser.aparse(blob=file)
parsed_pages_async

API referenceโ€‹

For detailed documentation of all PDFParser features and configurations, head to the API reference.

Additional resourcesโ€‹

You can find information about Writer's models (including costs, context windows, and supported input types) and tools in the Writer docs.


Was this page helpful?