Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
⭐ Drop a star to help us grow!
Deutsch | English | Español | français | 日本語 | 한국어 | Português | Русский | 中文
CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index for RAG, creating knowledge graphs, or performing any custom data transformations — goes beyond SQL.
Just declare transformation in dataflow with ~100 lines of python
# import data['content'] = flow_builder.add_source(...) # transform data['out'] = data['content'] .transform(...) .transform(...) # collect data collector.collect(...) # export to db, vector db, graph db ... collector.export(...)
CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.
Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components.
CocoIndex keep source data and target in sync effortlessly.
It has out-of-box support for incremental indexing:
If you're new to CocoIndex, we recommend checking out
Follow Quick Start Guide to define your first indexing flow. An example flow looks like:
@cocoindex.flow_def(name="TextEmbedding") def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): # Add a data source to read files from a directory data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files")) # Add a collector for data to be exported to the vector index doc_embeddings = data_scope.add_collector() # Transform data of each document with data_scope["documents"].row() as doc: # Split the document into chunks, put into `chunks` field doc["chunks"] = doc["content"].transform( cocoindex.functions.SplitRecursively(), language="markdown", chunk_size=2000, chunk_overlap=500) # Transform data of each chunk with doc["chunks"].row() as chunk: # Embed the chunk, put into `embedding` field chunk["embedding"] = chunk["text"].transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2")) # Collect the chunk into the collector. doc_embeddings.collect(filename=doc["filename"], location=chunk["location"], text=chunk["text"], embedding=chunk["embedding"]) # Export collected data to a vector index. doc_embeddings.export( "doc_embeddings", cocoindex.targets.Postgres(), primary_key_fields=["filename", "location"], vector_indexes=[ cocoindex.VectorIndexDef( field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
It defines an index flow like this:
More coming and stay tuned 👀!
For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.
We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.
Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
Join our community here:
We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo to stay tuned and help us grow.
CocoIndex is Apache 2.0 licensed.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4