2016年8月12日星期五

Read the docs




These days, I finished the document for the MediCurator.


Below is the link:

http://medicurator.readthedocs.io/en/latest/

2016年8月10日星期三

Document and fix the bug




This week, I wrote the document and fix the bug.

Details at

Download only what is new

https://bitbucket.org/BMI/medicurator/issues/6/download-only-what-is-new-when-attempting


The document:

https://bitbucket.org/BMI/medicurator/issues/5/medicurator-readthedocs-documentation

Fix the hard code the API_key

https://bitbucket.org/BMI/medicurator/issues/7/avoid-the-need-to-hard-code-the-api-key

2016年7月31日星期日

Research on Medicurator Supporting Dicomweb




I have done some research on the dicomweb. It can apply to MediCurator. The dicomweb has three levels - study, series and instance. I can query it level by level and inherit it to Medicurator like TCIA. The user only need to add the Url of the server, it can work. And I can use the function of retrieve to download the images. So the Madicurator can implement via it. However, I only find a serverhttp://www.dicomserver.co.uk/DICOMRS.html on which the dicomweb has implemented while the retrieve function didn't finished. So I can not experiment with retrieve function. In conclusion, Medicurator can use the dicomweb without too much modification.

2016年7月26日星期二

MediCurator Refactor and Add the Duplicate Detect




This week, I refactor the MediCurator into the three modules,

  1. medicurator-core for the core medicurator stuff.
  2. medicurator-server for the API.
  3. medicurator-client for the web app.

So that , I can avoid the dependencies conflicts.


And I add the near Duplicate Detect function to the web app and APIs.

    http://localhost:4567/duplicateSets?replicasetID1=***&replicasetID2=***

This makes the functions completely.

What's more, I write some scripts which make user to run my project easier.


Building

       ./compile.sh


Run webapp

       ./run_servlet.sh


Run Restful API

        ./run_api.sh


2016年7月20日星期三

Restful API and Delete - Download Workflow




Restful API and Delete - Download Workflow


First, this week, I implement the Restful API.

As shown in the README, it concludes the following

API:

http://localhost:4567/signup?username=***&password=***

http://localhost:4567/login?username=***&password=***

http://localhost:4567/getReplicaSets?userid=***

http://localhost:4567/createReplicaSets?userid=***&replicaName=***

http://localhost:4567/getDataSets?replicasetID=***

http://localhost:4567/addDataSet?replicasetID=***&datasetID=***

http://localhost:4567/removeDataSet?replicasetID=***&datasetID=***

http://localhost:4567/getRootDataSets http://localhost:4567/getSubsets?datasetID=***

http://localhost:4567/downloadDataSets?datasetID=***


http://localhost:4567/downloadOneDataSets?datasetID=***

http://localhost:4567/deleteDataSets?datasetID=***

http://localhost:4567/deleteOneDataSet?datasetID=***


And there is

http://localhost:4567/duplicateSets?replicasetID1=***&replicasetID2=***

This is to be done.

The second thing I have done is fix the Delete - Download Workflow. Now Medicurator can support the function that the user download and delete and download again.

I implement this by taking the meaning of "remove" and "delete" apart. Remove means to move the dataset out of the replicaset and "delete" means delete directly which can download again.




2016年7月13日星期三

Implement Local File Source




       This week,  I implement the medicurator  for Local file source.

       Through my test, it proves to have been working well. The test  local file is under the path:medicurator/target/classes/image

       It can be downloaded through the web application written last week.

      The downloaded file is now stored at medicurator/target/classes/local.test
      The path can be changed in Constant.java
   
      Afterwards, I will finish the complete workflow on downloaded tracking to solve the delete problem. I think I should implement a delete invoke function so that when the user's delete behavior  through the website or the delete message sent from the duplicate detect, it will invoke the function.

 

2016年7月6日星期三

Hdfs Apply to Medicurator




This week, I apply Hdfs to Medicurator.

As we all know, the hadoop distributed file system(HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deploved on low-cost hardware. Hdfs provides high throughput access to application data and is suitable for applications that have large data sets.

To make medicurator easier to deal with the high throughput, I decide to use the Hdfs. I inherit the class Storage and make the already existed LocalStorage become HdfsStorage. In order to realize this, I mainly use the API, referred https://hadoop.apache.org/docs/r2.6.1/api/overview-summary.html.

After my test, it works well. I only run this on my single computer, to make this run on the cluster, there still has some work to do.

To add, the consumer can choose the localStorage or the hdfsStorage according to their peference by changing the STORAGE (hdfs/local) in Constants.java. To use HDFS, the user should config HDFS_URI and HDFS_BASEDIR in Constants.java.For example, HDFS_URI = "hdfs://localhost:9000/" and HDFS_BASEDIR = "/user/xxx/medicurator/"
Source code and More information
https://bitbucket.org/BMI/medicurator

2016年6月29日星期三

Web Application - - Medicurator



I use Maven-Tomcat7-plugin to implement a website.

This web application is for user to consume the Medicurator.

It will run at http://localhost:2222/index

It contains

  • Signup 
  • Login
  • Logout
  • personal page which lists all the replicaset
  • function: 
    • add a replicaset
    • add a dataset
    • delete a dataset
    • download a dataset
To add:
  1. All the images are organized by hierarchy, which will be showed directly level by level to help the potential user easily get access to what they want.
  2. The implementation is relatively robust. It won't be influenced by the collapse of the server, which means it will remember all the users' information no matter what happened to the server.
Further to learn about this:

https://bitbucket.org/BMI/medicurator

 

2016年6月8日星期三

The code hierarchy of MediCurator version 1 (before mid evaluation)

/medicurator/src/main/java/edu/emory/bmi/medicurator/
.
├── dupdetect     ----------------- Near-duplicate detection module
│   │
│   ├── DetectImage.java    ----- Detect duplicate image pairs
│   │
│   ├── DetectMetadata.java  ----- Detect near-duplicate metadata pairs
│   │
│   ├── DupDetect.java   --------- Entry of the detection module
│   │
│   ├── DuplicatePair.java  ------ Define the data type of duplicate pair
│   │
│   └── Verify.java    ---------- Check if a pair is really near-duplicate


├── general    -------------------- Define the abstract data structures
│   │
│   ├── DataSet.java  ------------ A DataSet may contains several Images
│   │                      and sub DataSets. Maintained as a tree.
│   │
│   ├── DataSource.java  --------- DataSource has a root DataSet
│   │
│   │
│   ├── Metadata.java  ----------- Metadata is a collection of key-value
│   │                      pairs Both of key and value are String.
│   │
│   ├── ReplicaSet.java  --------- ReplicaSet contains many Datasets. The
│   │                      DataSets might from different DataSource
│   │
│   └── User.java  --------------- User has username and password as well
│                          as several ReplicaSets.

├── image  ------------------------- Various image types
│   │
│   ├── DicomImage.java  --------- Implementation of DICOM image type.
│   │
│   └── Image.java  -------------- The abstraction of image, a image is
│                          consists of a Metadata and a byte[] of
│                          raw image data.


├── infinispan   ------------------- Contact with Infinispan
│   │
│   ├── ID.java  ----------------- Put and get various data with data id
│   │
│   ├── Manager.java  ------------ The global only DefaultCacheManager
│   │
│   └── StartInfinispan.java  ---- Just start a Infinispan node


├── storage   --------------------- Persist storage
│   │
│   ├── HdfsStorage.java   ------ (TODO)store to HDFS
│   │
│   ├── LocalStorage.java  ------ Store to local disk
│   │
│   └── Storage.java   ---------- Interface of storage, save and load


└── tcia   ------------------------ Implementation of TCIA data source
    │
    ├── TciaAPI.java  ----------- Implementation of TCIA RESTful API
    │
    ├── TciaDataSet.java  ------- DataSet
    │
    ├── TciaDataSource.java  ---- DataSource
    │
    ├── TciaHierarchy.java  ----- Five hierarchy of TCIA DataSet
    │  
    └── TciaQuery.java  --------- Generate and send request with HTTPS get

2016年5月25日星期三

Medicurator Abstract Definition



Medicurator is composed of six main part(More details on the bitbucket)

User 
         User has his username, password, replicaset. A user  can save many replicasets.

ReplicaSet

        Replicaset contains many datasets.
        It can operate add, get, remove, download  etc on the datasets.

Datasource
      
      The datasource is decided by the system. Now it has two   functions- getRootDataSet;retrieveDataSet 

Dataset
     
      Dataset is a tree structure. One dataset has its parent, child,metadateID and the data.
      
Metadata
      
      tag--string
      attribute--key-value<string, string>

Data

      represent the content requested by the user      
      Download process




2016年5月18日星期三

TCIA dataset hierarchy

The Cancer Imaging Archive (TCIA) is an open-access database of medical images for Cancer research. People can download the entire contents of a collection in bulk or utilize the RESTful API to search or download within or across collections.

TCIA's data is managed in a hierarchy way:
         the Whole database
                Collection1  Collection2  Collection3 .......
         in a Collection
                Patient1  Patient2  Patient3 .......
         in a Patient
                Study1  Study2  Study3 ......
         in a Study
                 Series1  Series2  Series3 ......
         in a Seris
                 image1  image2  image3 ...... (DICOM format)

         One can use RESTful API to search a Patients' set or a Studies' set of specific collections or some other filter conditions.
         One can also download the images of a Seris as a zip file or download one specific image as DICOM format file.
         Besides, TCIA supports some other options such as on can choose the format of response (CSV/HTML/XML/JSON), the modality of Study (CT/MR...), or the body part examined.


an image has its global unique ID : SOPInstanceUID
an image set of a Series has a global unique ID : SeriesInstanceUID
and a Study also has a global unique ID : StudyInstanceUID

To avoid redownloading the same image, one can keep track to the SOPInstanceUIDs and SeriesInstanceUIDs which have been downloaded before. Because one can only download image by those two UIDs via RESTful API.


2016年5月11日星期三

Infinispan and Spark

Today I find an amazing thing that Infinispan and Spark can work together! It may be very useful to my project. https://github.com/infinispan/infinispan-spark

Infinispan is a distributed in-memory key/value data storage system which should be much faster than HDFS. It can be used in distributed computing task as fast temporary storage and cache. http://infinispan.org/

Spark is a MapReduce style of distributed computing framwork. Visit http://spark.apache.org/ for more information

My thought on MEDIator: A Data Sharing Synchronization Platform for Heterogeneous Medical Image Archives



MEDIator: A Data Sharing Synchronization Platform for Heterogeneous Medical Image Archives
This is a paper written by my mentor.The content has  connection with the project I am going  to do.

Here, I list something I learn form this paper.

While sharing data is encouraged in science, algorithms and architectures should be designed for mashing up and sharing the medical data efficiently. Hence, a data sharing synchronization system should be secured and minimize data duplication in client instances, in addition to the regular requirements of the data access integration platforms.A data sharing synchronization platform should let data consumers to view sub sets of data that satisfy user-defined search criteria, and share them with others using pointers to the actual data.

This paper presents MEDIator, a data sharing and synchronization middleware platform for heterogeneous medical image archives. MEDIator allows sharing pointers to medical data efficiently, while letting the consumers manipulate the pointers without modifying the raw medical data. MEDIator has been implemented for multiple data sources, including Amazon S3, The Cancer Imaging Archive (TCIA), caMicroscope, and metadata from CSV files for cancer images.


 Also, an in-memory data grid can be an alternative for a traditional storage for the replica sets, as it provides faster storage, access, and execution. And this paper uses the platform - Infinispan. By the way, in my project, I plan to use Infinispan.

MEDIator lets the users create, update, retrieve, and delete replica sets, and share the replica sets with others. 
Higher Level Use Case VIew


MEDIator APIs :InterfaceAPI,PubConsAPI,Integrator
(details ignored)


Integration with Medical Data Sources 

Clinical data is deployed in multiple data sources such as TCIA, caMicroscope, and Amazon S3. Figure 3 de- picts the deployment of the system with multiple med- ical data sources.This part can help us access to different sources of data.


"MEDIator is multi-tenanted where multiple users co-exist without the knowledge of existence of the other users, sharing the same cache space. Involving a time stamp for the class extending P ubC onsAP I , downloaded items can be tracked, and the dis can be produced for the user download. Thus a download can be paused and resumed later, downloading the images that have not been downloaded yet."

---I am not clear about this paragraph, to be discussed.

To concluded, firstly, I think I can employ the part of the Representation of Medical Image Sources to my project. This part can help me to represent the source data.Secondly, I can join the MEDIator to access data and then do the Near Duplicate Detection work based on that.


2016年5月5日星期四

Hello, GSoC 2016!

I'm really happy to be selected in Google Summer of Code 2016. Thanks to my mentor Pradeeban and Ashish, your patience and help give me confidence when I wrote my proposal. Wish us a good cooperation.

The name of my GSoC project is Near Duplicate Detection in Medical Image Archives. My proposal is here.

I've set up my code repository at https://bitbucket.org/BMI/medicurator.

Good luck and have fun!

Happy Birthday, Peking University!

Yesterday, May 4th is Peking University's 118th birthday. Congratulations! I'm proud of you forever.