How to install Unstructured for Langchain Document loaders
Added 2023-02-21 23:40:29 +0000 UTCThis post is for video: https://youtu.be/svzd5d1LXGk
If you're looking to install Unstructured for LangChain's document loaders, then you're in the right place. In this guide, we'll walk you through the step-by-step process of installing Unstructured and its dependencies, including LangChain and OpenAI.
- Create a new environment To begin, create a new environment using a python version less than 3.11. You can do this by running the following command:
conda create -n unstructured python=3.10Once the environment is created, activate it using:
Copy codeconda activate unstructured- Upgrade pip and setuptools Before installing any packages, it's a good idea to upgrade pip and setuptools. To do this, run the following commands:
pip install --upgrade setuptools
python.exe -m pip install --upgrade pipIf you encounter a permission error with pip upgrade, then try this command instead:
python.exe -m pip install --upgrade pip --user- Install LangChain and OpenAI Next, install LangChain and OpenAI by running:
Copy codepip install langchain
pip install openaiBe sure to read the installation instructions of these two documents carefully before beginning.
- Install Git To install Git, visit the following website and download the appropriate version for your operating system: https://git-scm.com/download/win
- Install unstructured[local-inference] To install Unstructured, run the following command:
pip install unstructured[local-inference]This command will take some time to complete. If you encounter any numpy errors, then upgrade numpy by running:
pip install numpy --upgradeYou'll also need to install Cython and torch:
pip install cython
pip3 install torch torchvision torchaudio- Install Detectron2 For Detectron2 installation, you can follow the instructions in this helpful link: https://haroonshakeel.medium.com/detectron2-setup-on-windows-10-and-linux-407e5382df1Alternatively, you can clone the Detectron2 repository and install requirements as follows:
git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd..
pip install opencv-python- Install layoutparser To install layoutparser, run:
pip install layoutparser[layoutmodels,tesseract]- Install other dependencies Install other dependencies required by Unstructured by running the following commands:
pip install python-magic
pip install python-magic-binYou'll also need to download and install Poppler and Tesseract. Download 7-zip (https://www.7-zip.org/) and unzip Poppler to place it in your working directory. You should also add Poppler BIN to PATH. Then, download and install Tesseract from this website: https://github.com/UB-Mannheim/tesseract/wikiNote the Tesseract installation path and add it to your environment parh variable and install the pytesseract package by running:
pip install pytesseract
NOTE: You will have to add Poppler's bin folder directory path and Tesseracts installation path to your system environment path variable. This is explained in the video much more clearer!!- Install NLTK dependencies Finally, run the following commands to install NLTK dependencies:
python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('averaged_perceptron_tagger')"- Restart VS Code Restart VS Code and ensure that the Unstructured environment is active in VS Code by checking the Python interpreter (Ctrl + Shift + P).
By following these instructions, you should be able to successfully install Unstructured for LangChain's document loaders. If you encounter any issues, be sure to refer to the documentation and installation instructions
Comments
Hi Patrick. I think the problem can result from two things. Tesseract which does OCR(optical character recognition) and/or Detecteon which detects the layout in an image like object. Unless they update those libraries and Untstuctured implements their latest version, this problem will persist. You can take a look at other PDF readers and see if they can extract the text better. If there is such a package, this approach should work. You just would have to write a logic to deal with multiple pdf files. I hope this is helpful.
Echo Hive
2023-03-31 20:11:49 +0000 UTCHi everyone, I followed this tutorial and now I have a program that runs in vscode on windows11. It uses document loader to reads text from URLs, images and all regular files like txt, md, py, etc. It mostly works, but images can sometimes be problematic. E.g., if I load a two page infographic .jpg with UnstructuredFileLoader() it will only extract some of the text from the image. IDK whether UnstructuredFileLoader is causing the issue. Or, is it an image segmentation problem (Detectron2)? Other possibilities: (LayoutParser? Tesseract? PyTesseract? Opencv-python? Torchvision? Trawling for suggestions. Here, there, everywhere. Anyone?
Patrick Young
2023-03-31 18:49:03 +0000 UTCThank you very much for sharing your own installation experience and tips!
Echo Hive
2023-03-27 20:05:10 +0000 UTCI haven’t used unstructured much. But it is default loader for almost all document loaders in langchain. So I wanted to make a video about how to install it.
Echo Hive
2023-03-27 20:04:25 +0000 UTCps Echohive do you use it a lot? What sort of use cases benefit?
Patrick Young
2023-03-27 17:15:48 +0000 UTCBingo. Works like a charm! Joined Patreon. Thx Echohive'! NB instruction post from patreon wouldn't work for me. Most problems are caused by package incompatibility. NB my system: Windows 11 home system running on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with an nvidia RTX 3050 GPU. As others have found I also ran into probs pip installing unstructured[local-inference], pycotools errors and ninja build compilation errors when trying to get detectron2 up & running. You need C++ build tools - install from visual studio). Also I installed detectron2 *first* using a .yml I found on stackoverflow because I realised that using the latest versions of Python, dectectron2, pytorch wouldn't work. I could get it to work by using python=3.8, cudatoolkit=11.0, pytorch==1.7.1 and quite an old version of detectron2 from the facebookresearch/detectron2 git . If Echohive, or anyone else wants the .yml file see step 2 on https://stackoverflow.com/questions/60631933/install-detectron2-on-windows-10 in the answer posted by DV82XL user:5752730.
Patrick Young
2023-03-27 17:13:00 +0000 UTCThank you for sharing that! 🙌
Echo Hive
2023-03-25 18:09:06 +0000 UTCyou need to install microsoft build tools. I installed visual studio C++ development module to get this & then it build pycotools
Patrick Young
2023-03-25 18:07:38 +0000 UTCMake sure to upgrade setup tools and upgrade pip before pip installing: pip install --upgrade setuptools python.exe -m pip install --upgrade pip
Echo Hive
2023-03-03 22:33:16 +0000 UTCIm about to attempt to tackle the problem with GPTChat, but thought I'd ask this awesome community alongside.
Kris Wilkinson
2023-03-03 11:33:00 +0000 UTCnote: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for pycocotools Successfully built unstructured-inference python-docx python-pptx unstructured iopath antlr4-python3-runtime Failed to build pycocotools ERROR: Could not build wheels for pycocotools, which is required to install pyproject.toml-based projects
Kris Wilkinson
2023-03-03 11:31:51 +0000 UTCIm coming across the below error whilst attempting to pip install unstructured[local-inference]
Kris Wilkinson
2023-03-03 11:31:38 +0000 UTCThis just may be the holy grail for self-taught developers like me as this is the part that takes the most skill, and most time to understand!
Kris Wilkinson
2023-03-03 11:30:47 +0000 UTCYou can definitely try although I am not sure if they will work. My hunch is that they should.
Echo Hive
2023-02-25 19:55:59 +0000 UTCDo you think I can use https://llamahub.ai/ functions as a loader, and use it as input to LangChain? I find using lLamhub loaders and GPTIndex much more simpler.... and it has pdf, docx and many other loaders (the websie says it can be used with Langchan too !!!
Vipin Kasarla
2023-02-25 19:54:31 +0000 UTCIt probably won’t work with MS word documents because as far as I could tell you need “Libreoffice” installed for that. Which seemed to me like it was complicated.
Echo Hive
2023-02-25 19:39:11 +0000 UTCThanks for this post, because I want to use this with your previous post on ChatGPT with documents. Q: Would this also work with MS Word documents? (If not how I make it work?)
Vipin Kasarla
2023-02-25 19:03:31 +0000 UTC