A web-based platform for converting raw knowledge base files into deployable vector database snapshots for the Gaia Network. Upload your documents, get AI-powered embeddings, and deploy instantly to your Gaia Node.

- Multi-Format Support: Process TXT, Markdown, PDF, and CSV files
- AI-Powered Embeddings: Uses gte-Qwen2-1.5B model with 1536-dimensional vectors
- WasmEdge Acceleration: High-performance processing when available
- Drag & Drop Interface: Intuitive web-based file upload
- Real-Time Progress: Live updates during processing
- Auto-Deploy Ready: Generates Hugging Face URLs for instant Gaia Node deployment
- Cloud & Local: Works with Qdrant Cloud or local instances
- Python 3.8 or higher
- Qdrant instance (local or cloud)
- Hugging Face account (for uploads)
- Optional: WasmEdge for performance optimization
-
Clone the repository
git clone https://github.com/your-org/gaia-console.git cd gaia-console
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
cp .env.example .env # Edit .env with your configuration
-
Configure your environment
# Qdrant Configuration QDRANT_URL=http://localhost:6333 QDRANT_API_KEY=your_qdrant_api_key # Hugging Face Configuration HF_TOKEN=your_hugging_face_token HF_DATASET_NAME=your_username/your_dataset_name # Optional Settings MAX_FILE_SIZE=10485760 # 10MB
-
Start the server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
-
Open your browser and navigate to
http://localhost:8000
- Upload Files: Drag and drop your knowledge base files (TXT, MD, PDF, CSV)
- Process: Click "Generate Snapshot" and watch real-time progress
- Deploy: Copy the generated Hugging Face URL and configuration commands
- Use with Gaia Node:
gaianet config \ --snapshot YOUR_SNAPSHOT_URL \ --embedding-url https://huggingface.co/gaianet/gte-Qwen2-1.5B-instruct-GGUF/resolve/main/gte-Qwen2-1.5B-instruct-f16.gguf \ --embedding-ctx-size 8192 gaianet init gaianet start
Format | Description | Processing Method |
---|---|---|
TXT | Plain text files | Paragraph-based chunking |
MD | Markdown documents | Header-based sectioning |
PDF documents | Converted to Markdown via markitdown | |
CSV | Tabular data | Row-based processing |
- Maximum file size: 10MB per file
- No limit on number of files per session
- Batch processing supported
Variable | Description | Required | Default |
---|---|---|---|
QDRANT_URL |
Qdrant instance URL | Yes | http://localhost:6333 |
QDRANT_API_KEY |
Qdrant API key (for cloud) | No | - |
HF_TOKEN |
Hugging Face write token | Yes | - |
HF_DATASET_NAME |
Target HF dataset | Yes | - |
MAX_FILE_SIZE |
Max file size in bytes | No | 10485760 |
docker run -p 6333:6333 qdrant/qdrant
- Sign up at Qdrant Cloud
- Create a cluster
- Get your API key and URL
- Update your
.env
file
- Create a Hugging Face account at huggingface.co
- Generate an access token with write permissions
- Create a dataset or use existing one for snapshots
- Update your
.env
with token and dataset name
For optimal performance, install WasmEdge:
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash
source ~/.bashrc
The application will work without WasmEdge but with reduced performance.
GET /
- Web interfacePOST /process
- Process uploaded filesGET /process-stream
- Server-sent events for progress updatesGET /check-wasm
- Check WasmEdge availability
curl -X POST "http://localhost:8000/process" \
-F "[email protected]" \
-F "[email protected]" \
-F "session_id=my_session_123"
Response:
{
"status": "success",
"snapshot_url": "https://huggingface.co/datasets/your_dataset/resolve/main/snapshots/snapshot_20241201_143022.tar.gz",
"message": "Processed 2 files with 150 embeddings"
}
┌─────────────────┐ ┌──────────────┐ ┌─────────────┐
│ Web Browser │◄──►│ FastAPI │◄──►│ Qdrant │
│ (Frontend) │ │ (Backend) │ │ (VectorDB) │
└─────────────────┘ └──────────────┘ └─────────────┘
│
▼
┌──────────────────┐
│ WasmEdge + │
│ gte-Qwen2-1.5B │
│ (AI Processing) │
└──────────────────┘
│
▼
┌──────────────────┐
│ Hugging Face │
│ (Storage) │
└──────────────────┘
- File Upload → Temporary storage with validation
- Format Detection → Route to appropriate processor
- Content Extraction → Text extraction and cleaning
- Embedding Generation → AI model creates vectors
- Vector Storage → Qdrant collection creation
- Snapshot Creation → Database export and compression
- Upload & Deploy → Hugging Face hosting
Problem: "Qdrant connection failed"
# Solution: Check if Qdrant is running
docker ps | grep qdrant
# Or test connection manually
curl http://localhost:6333/collections
Problem: "WasmEdge not available" warning
# Solution: Install WasmEdge (optional but recommended)
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash
Problem: "Hugging Face upload failed"
# Solution: Check token permissions
huggingface-cli whoami
# Ensure dataset exists and you have write access
Problem: PDF processing fails
# Solution: Install additional markitdown dependencies
pip install markitdown[all]
Enable debug logging:
export PYTHONPATH=.
python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
import main
"
- Use WasmEdge for 3-5x faster embedding generation
- Batch related files for better context understanding
- Clean documents before upload (remove headers/footers)
- Use local Qdrant for faster processing (if possible)
# Install test dependencies
pip install pytest pytest-asyncio httpx
# Run tests
pytest tests/
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Make your changes
- Add tests for new functionality
- Ensure all tests pass:
pytest
- Commit your changes:
git commit -m 'Add amazing feature'
- Push to the branch:
git push origin feature/amazing-feature
- Open a Pull Request
This project uses:
- Black for code formatting
- isort for import sorting
- flake8 for linting
# Format code
black .
isort .
flake8 .
- File size limits prevent DoS attacks
- Input validation on all file types
- Temporary file cleanup after processing
- No sensitive data stored permanently
- Secure token handling for external services
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Gaia Network: Official Website
Built with ❤️ for the Gaia Network ecosystem