echohive42

How to install Unstructured for Langchain Document loaders

Added 2023-02-21 23:40:29 +0000 UTC

This post is for video: https://youtu.be/svzd5d1LXGk

If you're looking to install Unstructured for LangChain's document loaders, then you're in the right place. In this guide, we'll walk you through the step-by-step process of installing Unstructured and its dependencies, including LangChain and OpenAI.

Create a new environment To begin, create a new environment using a python version less than 3.11. You can do this by running the following command:

conda create -n unstructured python=3.10

Once the environment is created, activate it using:

Copy codeconda activate unstructured

Upgrade pip and setuptools Before installing any packages, it's a good idea to upgrade pip and setuptools. To do this, run the following commands:

pip install --upgrade setuptools
python.exe -m pip install --upgrade pip

If you encounter a permission error with pip upgrade, then try this command instead:

python.exe -m pip install --upgrade pip --user

Install LangChain and OpenAI Next, install LangChain and OpenAI by running:

Copy codepip install langchain
pip install openai

Be sure to read the installation instructions of these two documents carefully before beginning.

Install Git To install Git, visit the following website and download the appropriate version for your operating system: https://git-scm.com/download/win
Install unstructured[local-inference] To install Unstructured, run the following command:

pip install unstructured[local-inference]

This command will take some time to complete. If you encounter any numpy errors, then upgrade numpy by running:

pip install numpy --upgrade

You'll also need to install Cython and torch:

pip install cython
pip3 install torch torchvision torchaudio

Install Detectron2 For Detectron2 installation, you can follow the instructions in this helpful link: https://haroonshakeel.medium.com/detectron2-setup-on-windows-10-and-linux-407e5382df1Alternatively, you can clone the Detectron2 repository and install requirements as follows:

git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd..
pip install opencv-python

Install layoutparser To install layoutparser, run:

pip install layoutparser[layoutmodels,tesseract]

Install other dependencies Install other dependencies required by Unstructured by running the following commands:

pip install python-magic
pip install python-magic-bin

You'll also need to download and install Poppler and Tesseract. Download 7-zip (https://www.7-zip.org/) and unzip Poppler to place it in your working directory. You should also add Poppler BIN to PATH. Then, download and install Tesseract from this website: https://github.com/UB-Mannheim/tesseract/wikiNote the Tesseract installation path and add it to your environment parh variable and install the pytesseract package by running:

pip install pytesseract

NOTE: You will have to add Poppler's bin folder directory path and Tesseracts installation path to your system environment path variable. This is explained in the video much more clearer!!

Install NLTK dependencies Finally, run the following commands to install NLTK dependencies:

python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('averaged_perceptron_tagger')"

Restart VS Code Restart VS Code and ensure that the Unstructured environment is active in VS Code by checking the Python interpreter (Ctrl + Shift + P).

By following these instructions, you should be able to successfully install Unstructured for LangChain's document loaders. If you encounter any issues, be sure to refer to the documentation and installation instructions