Irene: 五月 2016

2016年5月25日星期三

Medicurator Abstract Definition

Medicurator is composed of six main part(More details on the bitbucket)

User

User has his username, password, replicaset. A user can save many replicasets.

ReplicaSet

Replicaset contains many datasets.

It can operate add, get, remove, download etc on the datasets.

Datasource

The datasource is decided by the system. Now it has two functions- getRootDataSet;retrieveDataSet

Dataset

Dataset is a tree structure. One dataset has its parent, child,metadateID and the data.

Metadata

tag--string

attribute--key-value<string, string>

Data

represent the content requested by the user

Download process

2016年5月18日星期三

TCIA dataset hierarchy

The Cancer Imaging Archive (TCIA) is an open-access database of medical images for Cancer research. People can download the entire contents of a collection in bulk or utilize the RESTful API to search or download within or across collections.

TCIA's data is managed in a hierarchy way:
the Whole database
Collection1 Collection2 Collection3 .......
in a Collection
Patient1 Patient2 Patient3 .......
in a Patient
Study1 Study2 Study3 ......
in a Study
Series1 Series2 Series3 ......
in a Seris
image1 image2 image3 ...... (DICOM format)

One can use RESTful API to search a Patients' set or a Studies' set of specific collections or some other filter conditions.
One can also download the images of a Seris as a zip file or download one specific image as DICOM format file.
Besides, TCIA supports some other options such as on can choose the format of response (CSV/HTML/XML/JSON), the modality of Study (CT/MR...), or the body part examined.

an image has its global unique ID : SOPInstanceUID
an image set of a Series has a global unique ID : SeriesInstanceUID
and a Study also has a global unique ID : StudyInstanceUID

To avoid redownloading the same image, one can keep track to the SOPInstanceUIDs and SeriesInstanceUIDs which have been downloaded before. Because one can only download image by those two UIDs via RESTful API.

2016年5月11日星期三

Infinispan and Spark

Today I find an amazing thing that Infinispan and Spark can work together! It may be very useful to my project. https://github.com/infinispan/infinispan-spark

Infinispan is a distributed in-memory key/value data storage system which should be much faster than HDFS. It can be used in distributed computing task as fast temporary storage and cache. http://infinispan.org/

Spark is a MapReduce style of distributed computing framwork. Visit http://spark.apache.org/ for more information

My thought on MEDIator: A Data Sharing Synchronization Platform for Heterogeneous Medical Image Archives

MEDIator: A Data Sharing Synchronization Platform for Heterogeneous Medical Image Archives
This is a paper written by my mentor.The content has connection with the project I am going to do.

Here, I list something I learn form this paper.

While sharing data is encouraged in science, algorithms and architectures should be designed for mashing up and sharing the medical data efficiently. Hence, a data sharing synchronization system should be secured and minimize data duplication in client instances, in addition to the regular requirements of the data access integration platforms.A data sharing synchronization platform should let data consumers to view sub sets of data that satisfy user-defined search criteria, and share them with others using pointers to the actual data.

This paper presents MEDIator, a data sharing and synchronization middleware platform for heterogeneous medical image archives. MEDIator allows sharing pointers to medical data efficiently, while letting the consumers manipulate the pointers without modifying the raw medical data. MEDIator has been implemented for multiple data sources, including Amazon S3, The Cancer Imaging Archive (TCIA), caMicroscope, and metadata from CSV files for cancer images.

Also, an in-memory data grid can be an alternative for a traditional storage for the replica sets, as it provides faster storage, access, and execution. And this paper uses the platform - Infinispan. By the way, in my project, I plan to use Infinispan.

MEDIator lets the users create, update, retrieve, and delete replica sets, and share the replica sets with others.

Higher Level Use Case VIew

MEDIator APIs :InterfaceAPI，PubConsAPI，Integrator
（details ignored）

Integration with Medical Data Sources

Clinical data is deployed in multiple data sources such as TCIA, caMicroscope, and Amazon S3. Figure 3 de- picts the deployment of the system with multiple med- ical data sources.This part can help us access to different sources of data.

"MEDIator is multi-tenanted where multiple users co-exist without the knowledge of existence of the other users, sharing the same cache space. Involving a time stamp for the class extending P ubC onsAP I , downloaded items can be tracked, and the dis can be produced for the user download. Thus a download can be paused and resumed later, downloading the images that have not been downloaded yet."

---I am not clear about this paragraph, to be discussed.

To concluded, firstly, I think I can employ the part of the Representation of Medical Image Sources to my project. This part can help me to represent the source data.Secondly, I can join the MEDIator to access data and then do the Near Duplicate Detection work based on that.

2016年5月5日星期四

Hello, GSoC 2016！

I'm really happy to be selected in Google Summer of Code 2016. Thanks to my mentor Pradeeban and Ashish, your patience and help give me confidence when I wrote my proposal. Wish us a good cooperation.

The name of my GSoC project is Near Duplicate Detection in Medical Image Archives. My proposal is here.

I've set up my code repository at https://bitbucket.org/BMI/medicurator.

Good luck and have fun！

Happy Birthday, Peking University！

Yesterday, May 4th is Peking University's 118th birthday. Congratulations! I'm proud of you forever.