updates to specification document after videoconference

0c9fa62a · Rok Roskar · 3c570489 · 0c9fa62a
Commit 0c9fa62a authored 8 years ago by Rok Roskar
--- a/docs/roadmap/bigdata-usecases.md
+++ b/docs/roadmap/bigdata-usecases.md
@@ -12,7 +12,7 @@ openBIS should be extended to support big-data use cases. This requires changing
 | Command       | Description                                                                                 |
 |---------------|---------------------------------------------------------------------------------------------|
 | obis init     | Register unmanaged data set with openBIS                                                    |
-| obis commit   | Commit local state to openBIS server                                                        |
+| obis commit   | Commit local state to openBIS server and synchronize database                                                       |
 | obis add-ref  | Reference a existing unmanaged data set                                                     |
 | obis clone    | Copy data to new location and register copy with openBIS                                    |
 | obis get      | Retrieve the files for an unmanaged data set                                                |
@@ -43,7 +43,7 @@ A user wishes to analyze a large data set on a cluster. Here is an overview of t
 |----------------------|-----------------------------------------------------------------------------------------|
 | [download data]      | Stage data on the server in folder "foo"                                                |
 | obis init data .     | Prepare an unmanaged (data) data set                                                    |
-| obis commit          | Inform the openBIS server about the current state                                       |
+| obis commit          | Inform the openBIS server about the current state and send a file listing for the database                                       |
 | mkdir/cd ../bar      | Create a folder to contain the analysis code                                            |
 | obis init analysis . | Prepare an unmanaged (analysis) data set                                                |
 | obis add-ref  [path] | Indicate that the analysis data set references the data data set                        |
@@ -67,11 +67,11 @@ This would make it possible to re-run the analysis on different infrastructure.

 ## obis init [data/analysis]

-Create a new unmanaged data set in openBIS. This command has two variants: data and analysis. With the data argument, a git-annex is also initialized so that the (potentially large) data files can be managed. With the analysis argument only git is initialized, since the repository is assumed to hold just source code and analysis results (which are assumed to be small).
+Create a new unmanaged data set in openBIS. This command has two variants: data and analysis. With the data argument, a git-annex is also initialized so that the (potentially large) data files can be managed. With the analysis argument only git is initialized, since the repository is assumed to hold just source code and analysis results (which are assumed to be small). Here the user is also queried about the openBIS instance they want to use for this repository/dataset. 

 ## obis commit

-Informs openBIS about the current state of the repository. If it is unknown to openBIS, a new data set is created. If is is known to openBIS, a new data set is created which is the child of the previous state of the data set. The unmanaged data set stores the git commit id as metadata. Unmanaged data sets may have copies.
+Informs openBIS about the current state of the repository. If it is unknown to openBIS, a new data set is created. If is is known to openBIS, a new data set is created which is the child of the previous state of the data set. The unmanaged data set stores the git commit id as metadata. Unmanaged data sets may have copies. A reference to the git repository is stored (a git URL) so that the data set may be cloned at a later time. Once the data set ID is returned by the openBIS server, a marker file is created in the directories belonging to the dataset to enable discovery in case of a lost link. 

 ## obis add-ref [path do data set]

@@ -79,11 +79,11 @@ Store a reference to another data set. This is, for example, used in analysis da

 ## obis clone [data set id]

-Clone a data set that is known to openBIS. This create a "copy" data set in openBIS.
+Clone a data set that is known to openBIS. By default cloning doesn't copy any data, this is done with `obis get`.

-## obis get
+## obis get [*file*]

-Retrieve any data from the annex and save it locally.
+Retrieve any data from the annex and save it locally. There is probably no need to inform openBIS about the copy because `git-annex` handles the copy counting and a list of currently-known copies can be obtained easily with `git annex whereis`. 

 # Handling of Scenarios

@@ -109,3 +109,5 @@ This needs to be looked into in greater detail. Git-annex can manage data that i

 - Korolev et. al. use git tree IDs instead of commit IDs. Are these better?
 - How well does git-annex handle content in HDFS? Is some work necessary to improve this support?
+- where does the copy count get updated? When `git annex sync` is done? Or are all known remotes queried? 
+- Do we need two different versions of `obis init` for data and analysis? Perhaps we can just do one to reduce complexity
\ No newline at end of file