IPFS Controller / Content Orchestration #100

RobotSail · 2023-02-09T21:05:41Z

RobotSail
Feb 9, 2023
Maintainer

IPFS Controller

IPFS Controller is a system which allows applications to leverage the benefits of IPFS for
data distribution and replication for usage within existing large-scale applications.
Specifically, applications can query a centralized controller to learn about the current
state of content in their network, and be able to access data relevant to their use-case.

Motivation

Suppose you operate a thriving video-sharing platform, whose users view and upload short-lived video blobs which then virally spread throughout your user network. To achieve a snappy user experience, you have your servers aggregate and cache videos on servers that are located near their users.

You don't know which blobs you will need to download, only some specific parameters that you'd like to search for.

For example, a server you operate located in the Southwest United States would run a query for the following parameters:

query TrendingContent {
	videos(country: "UnitedStates", views: { greaterThan: 100000 }) {
		contentID
		title
		createdAt
	}
}

Obtaining a listing of content identifiers, along with added metadata and attributes associated with the given content. After querying for this data, your server now has a list of all the content IDs it needs to download within that refresh cycle.
All that would need to happen from this point forward, is that the server ask its local IPFS node to download these exact IDs.

The beauty of going through IPFS for this, is that your nodes no longer need to track or maintain exact location addresses of where the content can be found. This provides a separation of concern between the application logic - who is concerned about what content it's serving to users - and the data fetching logic, which is only tasked with obtaining the desired content identifiers.

Architecture

Overview

The big picture of this architecture is summarized as follows:

A centralized IPFS controller maintains a database which associates an IPFS Content ID with metadata and attributes
Servers perform CRUD operations on the controller through queries and mutations
Each server has an "IPFS interface" sidecar which exposes an API for the application to consume
Applications talk to the interface to upload/download data
The IPFS interface talks to the central controller
- New pieces of data are added to the IPFS DHT, controller is notified of updates
- Interface retrieves CIDs from the controller and talks to the local IPFS node to download them and make them available for the application
Application asks only for parameters of data, retrieves the record with associated metadata & attributes, and resolves the record within the local system when queried.
Users download data from local server, which already stores their desired content

Application <--> IPFS Controller

Of all the concepts presented in this outline, this is the most important one.
The communication between the application and the IPFS controller must be simple.
This means that it should be relatively simple for the application to
request data from the controller and upload new data.

Here's a rough outline of the interaction flow:

graph TD
    A[Application] --> |Query for content| B[IPFS Interface]
    B --> |Pass query| C[IPFS Controller]
    C --> |Provide IPFS CIDs| B
    B --> |Request CIDs from Network| D[Local IPFS Node]
    D --> |Provide location on disk| B
    B --> |Provide content|A

In this scenario, there are three main actors:

The Application, where the core logic lives
The IPFS interface, which handles communication between application, the central controller, and the local IPFS node
The IPFS Controller, which maintains records of all the content in the network

The application asks the interface for some data, say the 50 most downloaded movies in the united states, which then asks the controller for content IDs based on that filter.
Once the interface has the content IDs, it downloads that data from the IPFS network and
provides the results to the application. Now that the interface has downloaded that content,
it will be stored as a cache on the local IPFS node.

This means that not only will subsequent requests for the same content be
cached and quickly retrievable, other nodes downloading the same content
will be able to get it from those who already have it.

Because nodes can download content from each other in addition to the central content store,
this has the added benefit of reducing load on the central store.

Content Orchestration and Events

Another aspect of this would be the ability for the IPFS controller, or another overseeing
application to orchestrate the data cached by a given server.
In theory, you can already do this using IPFS without the use of a controller, but the bottleneck
ends up being that you don't know details about content popularity within a network.

You may configure metrics exporters and loggers within your application logic, but then where would the metrics end up going? A centralized database of course, which will then dictate content popularity, and... you guessed it, issue directives for the servers to then download said content.

The solution to this problem ends up being the IPFS controller once again,
therefore the ability for content orchestration should be provided either through the controller
or an auxillary application subscribed to the controller's events.

Load on IPFS Controller

The IPFS Controller never stores any actual IPFS data. Instead, it is solely responsible
for the storing of records associated with IPFS Content IDs.
This provides more availability for the controller to process an intense amount of requests
without breaking a sweat, and opens up room to integrate an eventing system where clients can subscribe to receive certain events, such as the publication of new records matching some set of parameters.

IPFS Controller

This is essentially the "brains" of the operation. The IPFS Controller has the following responsibilities:

Maintaining record of IPFS Content IDs and their associated attributes and metadata
Exposing an API which allows clients to do the following:
- Run queries to obtain existing records based on a set of parameters (attributes, metadata)
- Run mutations to create new records and update existing ones

API

With respect to the following points, we represent the API using GraphQL for its ability to
easily provide certain types of queries:

Schema {
	query: Query
	mutation: Mutation
}

type Query {
	records(params: RecordFilter): Record[]
	record(contentID: String!) Record
}

type Mutation {
	createRecord(input: CreateRecordInput!): Record!
	updateRecord(input: UpdateRecordInput!): Record!
	deleteRecord(input: DeleteRecordInput!): Boolean!
}

type Record {
	contentID: String!
	metadata: Metadata!
	# additional user-defined attributes
	attributes: Attribute[]
	previousVersion: Record
}

# Metadata Describes the object's constant (more or less) values
type Metadata {
	name: String!
	description: String
	createdAt: Date!
	updatedAt: Date!
	author: User!
	contentType: String!
	fileSize: Size!
	# more fields to describe the data here
}

# KVPair defines a mapping 
type Attribute {
	key: String!
	value: String!
}


input RecordFilter {
	contentType: ContentType
	authorID: ID
	createdAfter: DateTime
	createdBefore: DateTime
	# additional filters
}

The above schema is a sample of what it could be.
The following takes into account popularity data, allowing the controller
to return the IPFS CIDs based on how many views certain bits of content are receiving.

IPFS Controller API w/ Popularity & Metrics

schema {
	query: Query
	mutation: Mutation
}

type Query {
	record(params: Parameters): Record[]
	recordsByPopularity(params: Parameters): Record[]
}

type Mutation {
	createRecord(input: CreateRecordInput!): Record!
	updateRecord(input: UpdateRecordInput!): Record!
	deleteRecord(input: DeleteRecordInput!): Boolean!
}

input Parameters {
	contentType: String
	author: String
	attributes: KVPairInput[]
	createdAfter: Date
	createdBefore: Date
	updatedAfter: Date
	updatedBefore: Date
	orderBy: OrderByInput
	limit: Int
	offset: Int
}

input OrderByInput {
	field: String!
	direction: OrderDirection!
}

enum OrderDirection {
	ASC
	DESC
}

input AttributeInput {
	key: String!
	value: String!
}

type Record {
	contentID: String!
	metadata: Metadata!
	attributes: Attribute[]
	previousVersion: Record
	popularityData: PopularityData!
}

# Metadata describes the object's constant (more or less) values
type Metadata {
	name: String!
	description: String
	createdAt: Date!
	updatedAt: Date!
	author: User!
	contentType: String!
	fileSize: Size!
	# more fields to describe the data here
}

# Attribute defines a mapping between a key and a value
type Attribute {
	key: String!
	value: String!
}

type Size {
	unit: String!
	quantity: Integer
}

type PopularityData {
	views: Int!
	downloads: Int!
	lastViewed: Date!
	lastDownloaded: Date!
}

input CreateRecordInput {
	metadata: MetadataInput!
	attributes: AttributeInput[]
}

input UpdateRecordInput {
	contentID: String!
	metadata: MetadataInput
	attributes: AttributeInput[]
}

input DeleteRecordInput {
	contentID: String!
}

input MetadataInput {
	name: String
	description: String
	author: String
	contentType: String
	fileSize: SizeInput
	# more fields to describe the data here
}

input SizeInput {
	unit: String
	quantity: Integer
}

Controller storage

Storing this data should ideally be handled by either a NoSQL document store such as MongoDB, CockroachDB, or a key-value database such as redis or etcd.
SQL should be avoided due to issues in obtaining scale in the way that would be required for this type of system.

Assuming MongoDB as the database, the records would be stored in a primary collection titled content
and would have the following shape:

{
	"_id": {
		"$oid": "123456789"
	},
	"contentID": "bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi",
	"metadata": {
		"contentType": "video",
		"createdAt": "2023-02-08T23:42:44.155Z",
		"video": {
			"codec": "mp4",
			"lengthSeconds": 540,
			"resolution": "1920x1080",
			"framesPerSecond": 20, 
			// ...
		}
	},
}

Just like in object storage, there are additional attributes associated with the data.
These attributes would be mutable piece of data which assist in knowing which data to interact with.

Use as a CDN

The IPFS Controller system in theory allows you to have CDN boxes which download data
in a mechanism similar to Amazon S3. The benefit of this system, is that when it comes
to storing immutable blobs of data (images, videos, 3D objects, etc.), CDN
nodes would be able to distribute the load incurred from downloading from
a central source onto other CDN servers carrying the same content.

In this instance, the controller would take on the role of talking with nodes
and dictating which content gets stored on which server.

Comparison to Existing CDN Solutions

A CDN system which is roughly similar is AWS CloudFront, which uses S3
for its storage engine - which leads me to think that something similar
could be accomplished by leveraging IPFS as a storage layer.

The unknowns in this system are how exactly this system would provide a real benefit
over current solutions, what problems current solutions have, and whether this system could be used to
address any of them.

Roles & Permissions

By having the controller as a central ledger of what content exists within a system, it's
also possible to provide RBAC to content & manage all data available within a network
while still leveraging the benefits of handling data using a decentralized network internally.

Data Visibility

The idea for this system is that all IPFS sidecar nodes would ideally be running on the same
private network, disconnected from the public IPFS DHT.
It would also be possible to run this type of system in public, however you would lose
out on the ability control access to content once it has been distributed.

RobotSail · 2023-02-15T20:55:15Z

RobotSail
Feb 15, 2023
Maintainer Author

After some further thought, it makes sense to consider the "IPFS Controller" to be the main concept, and the "interfaces" to simply be clients for the controller which can then be implemented in however way possible.

The concept of creating a CDN using this system would be an application of the IPFS Controller, and can be separated from this write-up.

0 replies

mathew-cf · 2023-03-03T21:18:44Z

mathew-cf
Mar 3, 2023

This design seems fine at a high level assuming that your private nodes are peered with each other sufficiently. I'd take a use case (e.g. CDN), flesh out the specific requirements and metrics and then iterate based on that.

You may want to look at IPFS search. It seems to have a system similar to your "controller" above where you can query for CIDs based on certain metadata.
https://ipfs-search.com/#/

I think the Filecoin slack (https://filecoin.io/slack) is probably a good place to get some more feedback. It's used both for IPFS and FIlecoin discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPFS Controller / Content Orchestration #100

{{title}}

Replies: 3 comments

{{title}}

{{title}}

Select a reply

IPFS Controller / Content Orchestration #100

RobotSail Feb 9, 2023 Maintainer

IPFS Controller

Motivation

Architecture

Overview

Application <--> IPFS Controller

Content Orchestration and Events

Load on IPFS Controller

IPFS Controller

API

Controller storage

Use as a CDN

Comparison to Existing CDN Solutions

Roles & Permissions

Data Visibility

Replies: 3 comments

RobotSail Feb 15, 2023 Maintainer Author

mathew-cf Mar 3, 2023

RobotSail
Feb 9, 2023
Maintainer

RobotSail
Feb 15, 2023
Maintainer Author

mathew-cf
Mar 3, 2023