Indexes
This document outlines the process for scraping text from internet sources and document files, splitting text, and inserting documents into a vector database using the Prompt Engineers platform.
Last updated
This document outlines the process for scraping text from internet sources and document files, splitting text, and inserting documents into a vector database using the Prompt Engineers platform.
Last updated
Select Source: Choose 'Files' or other sources like 'Text', 'YouTube', 'URLs', etc., from the 'Select Loader' dropdown.
Upload Files: Click 'Choose Files' to upload documents (supports .txt, .pdf, .csv).
View Document: If you're uploading a PDF, it can be previewed in the interface.
Select Splitter: From the 'Select Splitter' dropdown, choose a method for splitting text (e.g., 'Spacy', 'Python', 'Markdown').
Set Overlap: Configure 'Chunk Overlap' to determine how much overlap each chunk of text should have (useful for ensuring context in analysis).
Collect Text: Click 'Collect' to start the scraping process. The system may display progress, such as "Splitting page 12 into chunks."
Review Chunks: Once the document is processed, review the extracted text chunks.
Upsert Documents:
New Index: Select 'New', enter an 'Index Name', and click 'Create'.
Existing Index: Select 'Existing', choose the index, and upsert the text.
Ensure the proper selection of the splitter and loader to match the document type and desired chunking strategy.
The 'Upsert' process allows for adding new documents or updating existing ones in the vector database.
Monitor the progress bar or messages to ensure that the scraping and upserting are proceeding correctly.
This workflow enables users to efficiently process and store large amounts of text data, preparing it for advanced search and retrieval in vector databases.