A COMPACT IMPLEMENTATION OF INDEPENDENT COMPONENT ANALYSIS WITH GRAPHICAL PROCESSING UNIT
A Thesis
by
HARISH ANKAM
Submitted to the Office of Graduate Studies
Texas A&M University-Commerce
In partial fulfillment of the requirements
for the degree of
MASTER OF SCIENCE
May 2014A COMPACT IMPLEMENTATION OF INDEPENDENT COMPONENT ANALYSIS WITH GRAPHICAL PROCESSING UNIT
A Thesis
by
HARISH ANKAM
Approved by:
Advisor: Mutlu Mete
Committee: Abdullah Arslan
Unal Sakoglu
Head of Department: Sang C. Suh
Dean of the College: Dan Edelman
Dean of Graduate Studies: Arlene Horneiii
ABSTRACT
A COMPACT IMPLEMENTATION OF INDEPENDENT COMPONENT ANALYSIS WITH GRAPHICAL PROCESSING UNIT
Harish Ankam, MS
Texas A&M University-Commerce, 2013
Advisor: Mutlu Mete, Ph.D.
Independent Component Analysis (ICA) is a blind source-separation method that has been implemented in many fields. In the brain-imaging field, such as EEG and fMRI, given only the observed signals, the main goal of ICA is to recover the independent sources, which are assumed to be linearly mixed, and thus decompose information in the mixing system. However for large-size data such as fMRI, the computation time of ICA can be quite long, which necessitates computationally effective implementation methods.
We aimed to develop a minimally dependent and platform-independent ICA implementation using Graphical Processing Units (GPUs). GPUs are designed to rapidly manipulate and alter the computer memory to accelerate the building of images in a frame buffer intended for output to a display. GPUs have a large number of multiprocessors and iv
each multiprocessor has several cores. In this work, we implemented the serial portions of the ICA algorithm to run on CPU while some parallel portions of the ICA such as matrix inversion and determinant calculation run on GPUs. The data transfer between CPU and GPU, which generally slows performance, is also minimized in our implementation. To elucidate speed-up of newly introduced approach, our Java-based package is tested on different size fMRI data obtained from task-related neuro-experiments. The newly developed software is also validated using four sound files. Our software is integrated with well-known data mining and machine learning package WEKA to increase its usability.
Although our implementation is not first GPUs-based ICA implementation, it is novel and preferable over the previous implementations since our software is platform independent and does not depend on any obsolete libraries.
v
ACKNOWLEDGEMENTS
First and above all, I praise god, the almighty for providing me this opportunity and granting me the capability to proceed successfully. Without assistance and guidance of Dr. Mutlu Mete, Dr. Unal Sakoglu, and Dr. Abdullah Arslan this thesis proposal would not have been possible to complete. I would therefore, like to offer my sincere thanks to all of them.
I wish to acknowledge my deepest gratitude to my advisor, Dr. Mutlu Mete, whose supervision, encouragement, advice, and guidance from the initial to the final level enabled me to develop an understanding of the subject.
Thanks to my friends who all helped me when I needed their support and understood my single mindedness and my passion for the research.
Although I live far from my parents, communication with them provided me a good atmosphere and motivation to complete my research. I warmly thank my parents for their spiritual support in all aspects of my life. I would like to thank the office of graduate studies at Texas A&M University-Commerce who supported this study. I offer my regards and blessings to all of those who supported me in any respect during the completion of the thesis proposal. Thank you.
vi
TABLE OF CONTENTS
LIST OF TABLES ............................................................................................... viii
LIST OF FIGURES ............................................................................................... ix
CHAPTER
1. INTRODUCTION ..................................................................................1
Statement of the Problem .................................................................1
Purpose of the Study ........................................................................1
Research Questions ..........................................................................2
Significance of the Study .................................................................3
Method of Procedure........................................................................3
Definition of Terms..........................................................................4
Limitations and Delimitations ..........................................................5
Assumptions .....................................................................................6
2. REVIEW OF LITERATURE ..................................................................7
GPU Performance in General Purpose Applications .......................7
Advantages of Image Processing with GPU ..................................10
GPU in Medical Imaging ...............................................................10
ICA implementation with GPU .....................................................11
Blind Source Separation ................................................................11
Cocktail Party Problem ..................................................................12
3. METHOD ..............................................................................................14
JCUDA ...........................................................................................17 vii
WEKA............................................................................................18
WEKA Package Manager ..............................................................20
Design of the Study ........................................................................22
Infomax Strategy ............................................................................24
Entropy ...........................................................................................25
Gradient Ascent .............................................................................27
Gradient Ascent Algorithm for ICA ..............................................27
Matrix Determinant calculation in GPU ........................................27
Flow chart of ICA Infomax algorithm ...........................................30
4. EXPERIMENTAL RESULTS AND ANLYSIS ...................................36
Implementation In Java ..................................................................36
Embedding ICA into WEKA ........................................................42
5. CONCLUSION AND FUTURE RESEARCH ......................................51
Future Study ...................................................................................52
REFERENCES ......................................................................................................53
APPENDICES .......................................................................................................57
Appendix
A. Java code for WEKA package: ICA in CPU and GPU .................58
VITA ......................................................................................................................77
viii
LIST OF TABLES
TABLE
1. CPU and GPU execution time comparison for 30 iterations .....................39
2. CPU and GPU execution time comparison for 50 iterations .....................40
3. CPU and GPU execution time comparison for 100 iterations ...................41
ix
LIST OF FIGURES
FIGURE
1. Cocktail Party Problem ..............................................................................13
2. CUDA Memory Architecture ....................................................................15
3. WEKA startup screen ................................................................................19
4. WEKA Explorer.........................................................................................20
5. Package Manager GUI of WEKA ..............................................................21
6. Infomax Strategy ........................................................................................25
7. Java implementation of Matrix Determinant .............................................29
8. Calculate Determinant Kernel Code ..........................................................30
9. Flow chart of Infomax ICA Algorithm ......................................................31
10. WEKA Explorer and File browser dialog box ...........................................32
11. WEKA Explorer Select attributes Tab and ICA option .............................33
12. ICA Attributes Window in the WEKA Explorer .......................................34
13. ICA Results in the WEKA Explorer ..........................................................34
14. CPU and GPU ICA execution comparison bar chart for 30 iterations ......40
15. CPU and GPU ICA execution comparison bar chart for 50 iterations ......41
16. CPU and GPU ICA execution comparison bar chart for 100 iterations ....42
17. Example Sound Signals .............................................................................46
18. Entropy (H) ................................................................................................47
19. Gradient Ascent (G) .................................................................................. 48
20. Extracted Sound Signals ............................................................................49 1
Chapter 1
INTRODUCTION
Statement of the Problem
Independent Component Analysis (ICA) is a blind source-separation method that has been implemented in many fields. However for large-size data such as fMRI, the computation time of ICA can be quite long, which necessitates computationally effective implementation methods. This parallelizable method suffers from CPU based implementations and is not available to the machine learning community. WEKA (WEKA, 2009) is a machine learning and data mining library used by numerous scientists. Despite the popularity of WEKA, the versatile clustering, feature selection, machine learning, and data analysis the neuroimaging analysis community has not yet used tools found in WEKA. The most common analyses software which the neuroimaging community use (FSL, AFNI, and MATLAB-based SPM) lack machine learning tools. The MATLAB-based multi-voxel pattern analysis (MVPA) toolbox lacks an easy to use graphical user interface and requires extensive amount of MATLAB scripting, and it includes only a fraction of the machine learning methods that are available in WEKA. We aimed to develop a minimally dependent and platform-independent ICA implementation using Graphical Processing Units (GPUs) and extend WEKA platform with this package.
Purpose of the Study
WEKA has not found its use in neuroimaging community because of these reasons. WEKA has not been introduced to the neuroimaging community. WEKA cannot handle NIFTI-1.1 (.nii, or nifti) or ANALYZE (.img/.hdr) images, the most commonly used 2
MRI/fMRI data formats in neuroimaging community. NIFTI is the latest and most versatile version at this moment. WEKA’s special data format is .arff, and the NIFTI and ANALYZE formats need to be converted into .arff format. It is not so intuitive to use for neuroscientists. Our software will close the gap between the neuroimaging research community and this versatile software WEKA by implementing CPU and GPU version of ICA in a WEKA package.
Additionally, in practical EEG analysis, the computer power requirements are huge so that different ICA implementations run overnight or even more than a day. Hence a rapid iteration and examination of different procedures becomes completely impractical. To resolve this issue, the computationally effective ICA implementation is needed. Our aim is to improve the speed of Infomax ICA. Even though GPU has some bottlenecks like copying data from CPU to GPU, vector-matrix and matrix-matrix operations take almost all of the computational time. To minimize these operations time we have implemented these operations inside the GPU. This will reduce the total computation time of ICA algorithm considerably.
Research Questions
CPU-based implementation of ICA has been done in many libraries. GPU-based ICA was also introduced but they are extremely far away from the machine learning and data mining community. Hence, were motivated by these questions:
- How fast can ICA be speed-up on a GPU?
- How can ICA be introduced to machine learning and the data mining community?
3
Significance of the Study
Our software will close the gap between the neuroimaging research community and this software by converting NIFTI data to arff format, which the input format accepted by WEKA. We aim to include ICA package in to WEKA. The data conversion and ICA packages will be integrated into WEKA as graphical user interface (GUI) to help neuroimaging researchers easily explore and apply the versatile algorithms of WEKA to their data.
The specific objectives of this work were as follows:
• Develop a data conversion package which will convert NIFTI data to .arff format which is readable by WEKA and integrate it as a GUI package to WEKA’s interface. The GUI will also let users open and visualize NIFTI data
• Implement independent component analysis algorithm as an extension of WEKA package and give user the ability to use graphical processing unit (GPU) if supported by hardware
• Make the developed packages publicly available for the research community
• Perform analysis on a sample 4-D functional MRI NIFTI data set from various fields, such as Neuroimaging, Chemistry, and Biomedical Engineering
Method of Procedure
WEKA is implemented in JAVA (a programming language); therefore all programs were implemented in JAVA. At first, we created a package that would convert the NIFTI-1.1 format to WEKA .arff format. NIFTI is the most widely used functional imaging data format that is used by the neuroimaging research community. It is a standardized way to store vector-valued imaging datasets over one-dimensional domains. Second objective of this proposal involved integrating ICA algorithm to WEKA. ICA is one of the 4
clustering/decomposition methods that have been used to decompose the functional data into spatially independent components (each with a spatial map and associated time-course) and it has been widely used during the last decade in fMRI research. ICs can be used as features for classification of subjects (e.g. schizophrenia vs. healthy).
We proposed to develop a package, which would apply ICA algorithm to data in arff format that will then be used as features for subsequent classification analysis by WEKA and other softwares. Infomax-ICA is the most commonly employed ICA method for fMRI analysis. We implemented this version of algorithm in our JAVA package so that they can be integrated them into WEKA to use with NIFTI format.
Since the size of multi-subject 4-D NIFTI data has a large size (gigabytes), the main challenge in the application of the ICA algorithm to this type of data is convergence and computational time. Therefore, utilization of a GPU is useful in that manner to speed-up the analysis. If a GPU is available in the system, our software will automatically detect and utilize it to extract the ICA components.
Definition of Terms fMRI – functional magnetic resonance imaging (fMRI) is a procedure that can locate regions of the brain that are undergoing neural activity by detecting BOLD contrasts in the brain which signify metabolic activity and hence neurologic activity. Voxel – One point in the three-dimensional space of the brain; therefore, the smallest unit of data from fMRI analysis. This is similar to a pixel, which is one point in two-dimensional space. 5
WEKA – Waikato Environment for Knowledge Analysis (WEKA) is a platform-independent software package, developed in response to need for a unified workbench that would allow researchers easy access to up-to-date techniques in data mining and machine learning CUDA – Compute Unified Device Architecture, which is developed by NVIDIA to program inside the GPU. JCUDA – JAVA Compute Unified Device Architecture, a framework developed by NVIDIA to connect to CUDA code from JAVA.
Limitations and Delimitations
ICA can only separate linearly mixed sources. Since ICA is dealing with clouds of point, changing the order in which the points are plotted (the time points order in EEG) has virtually no effect on the outcome of the algorithm. Changing the data order (for instance swapping electrode locations in EEG) has also no effect on the outcome of the algorithm. The ICA algorithm has no a priori about the order of observations. Since ICA separates sources by maximizing their non-Gaussianity, perfect Gaussian sources cannot be separated. Even when the sources are not independent, ICA finds a space where they are maximally independents. However, our ICA implementation is based on CUDA architecture and hence it is platform specific to NVIDIA GPU devices and doesn’t work on AMD GPU devices. Also, In GPU specific ICA, we have to provide a multiple of four as the number of signals, as minimum block size is specified as four in GPU. But we can use CPU specific ICA when there are non-multiple of four signals present. 6
Assumptions
The key assumption in the use of ICA to separate the multivariate signal into subcomponents is that the subcomponents have been linearly mixed. There are also other physiological and natural sources of noise that are not linear which can distort the results, but are insignificant contributors.
7
Chapter 2
REVIEW OF LITERATURE
To perform this study, it is necessary to understand how parallel processing of large calculations can be accelerated using GPUs and how GPUs can be efficiently used to implement the general matrix calculations and use of GPUs in the field of machine learning algorithms for data mining and feature extraction. By providing a survey of related literature, this chapter presents the purpose of this study. Among the huge quantity of literature that explains the use of graphical processing units and its implementations, this literature review gives the summary of the art of GPUs acceleration in calculation on huge data and its implementations in machine learning algorithms that can be applied on vector dataset. The chapter is organized as follows: first, a literature on how general-purpose applications perform when implemented on GPUs; second, advantages of image processing with GPU; third, GPU in medical imaging; fourth about Blind Source Separation; and finally the Cocktail Party Problem.
GPU Performance in General Purpose Applications
The technique of using graphical processing unit in general purpose applications is known as General Purpose Graphical Processing Units (GPGPU). Mark Harris coined the term GPGPU (Harris, 2002) and he founded GPGPU.org when he recognized an early trend of using GPUs for non-graphics applications. GPGPU maps general algorithms to graphics hardware to make use of its data throughput and high computation capability. Much recent work focused on using GPU in general purpose applications. 8
Once the GPUs became faster in rendering initial computer graphics, demand for more advanced techniques increased. New pipeline stages for these effects were added to the application programming interfaces (APIs) and quickly implemented in hardware. At that time most of functionality was fixed functions to the existing API calls and were implemented in hardware. Each new effect resulted in APIs and hardware changes, greatly limiting the possibilities of graphics programmers.
This limitation resulted in a major architecture switch of GPUs. Small and simple moderately specialized processors replaced highly specialized function units. Such hardware contained some processors for vertex specific calculations, a larger amount of processors for pixel specific calculations (as there are more pixels present than vertices). Simple and limited programmable execution units replaced complete fixed functions. This greatly increased the flexibility of the hardware and allowed the GPU vendors to better scale horizontally. Adding more of this vertex could increase performance and pixel processors as well as other fixed function hardware.
However balancing the number of pixel and vertex processors is difficult as the workload distribution depends on the individual application and scene. Generally there are many more pixels drawn for a frame than vertices are calculated. But in some situations there may not be much vertices in a scene but still many pixels are drawn. In such a situation almost all the processing power of the vertex processors would be wasted since these are idle most of the time. In contrast, a very complex model where all vertex processors are utilized but cannot produce enough pixels for all pixel processors, leaving some of these processors idle. 9
Since this led to a suboptimal and uneven work distribution this aspect of the architecture was refined further. The specialized processor types, each with its own special instructions, were unified into one less specialized processor type that was able to handle all these tasks. The processors became more complex but also more flexible. That refined architecture became efficient even on changing workloads as the different tasks can be distributed over all processors.
A study by Che, Meng, Sheaffer, & Skadron (2008) on the performance of general-purpose applications on graphics hardware for computationally demanding applications such as Traffic Simulation, Thermal Simulation, and K-Means benefited from parallel computing capabilities. They used a high level programming language called CUDA (NVIDIA, 2007) for analyzing the performance of applications mentioned before. They showed that their applications accelerated using GPU, demonstrating a speed-up of 40x using generally available GPUs when compared with a CPU implementation. They also examined the performance of their applications, suggesting the advantages and inefficiencies of the programming model.
The first version of CUDA language was released in February 2007 by NVIDIA Corporation to enable general purpose computing on a GPU (Sahinidis & Vouzis, 2010). The CUDA memory architecture contains mainly five types of memories; they are Global memory, constant memory, texture memory, shared memory and registers (Sanders & Kandrot, 2010). The memories are available in CUDA is depicted in Figure 1. Global memory is the main means of communication in GPU using CUDA and it has a lifetime of application and all threads in grid can access the data stored in global memory. The size of global memory can be between 256MB to 6GB as of March 2014. The shared local memory 10
and registers is an on chip cache memory. So, the size of shared memory per thread block is restricted to 48KB per multiprocessor (NVIDIA, 2007). The constant memory and texture memory are part of global memory and all three memories together known as device memory. The size of constant memory is restricted to 64KB (NVIDIA, 2007).
Advantages of Image Processing with GPU
Image processing applications are used in various fields such as biomedical imaging, remote sensed image interpretations, defense surveillance, moving-object tracking, automatic visual inspection systems, and others. It includes many specific algorithms such as sorting and searching algorithms, graph algorithms, statistical methods, and others to implement image processing. Many of these methods are computational intensive algorithms, which are suitable to adapt to parallel processing.
GPU in Medical Imaging
The term of medical imaging covers the techniques and processes used to create images of a related body (animal, body) for clinical purposes or for the study of biomedical dynamics, such as anatomy and physiology. With the rapidly increasing development of high-performance computing and recent programmability for graphics hardware, the graphics hardware has evolved into a compelling platform for a wide range of computationally demanding tasks, such as medical image processing, dose calculation and treatment plan optimization, computer vision, and many more. Nowadays, GPUs are one of the standard tools in high-performance computing, and is being widely adopted throughout industry and academia many researchers and developers have become interested in utilizing the power of GPUs for general-purpose computing. The crucial advantages of GPU-based medical imaging benefit from high throughput computing, high memory bandwidth, 11
supporting 32-bit floating-point arithmetic, excellent price-to-performance ratio and specialized hardware for interpolation.
Independent Component Analysis implementation with GPU
Independent Component Analysis is one of the most effective methods for blind source separation. The most emblematic example is separation of mixed audio source signals recorded in a noisy environment. In last few years GPUs were transferred from specific application to general-purpose applications, which enabled us to increase the computational efficiency algorithms like ICA with the help of Graphical Processing Units. GPUs operate on the principle of single instruction multiple threads (SIMT) model. GPU has many multiprocessor cores, which runs the same instruction on large data by running thread for each multiprocessor for the same instruction.
To my knowledge, there are two implementations of ICA algorithm in GPU. The first one is Efficient Independent Component Analysis on a GPU by Ramalho, Tomas, & Sousa (2010). They have implemented FastICA algorithm in the Graphical Processing Unit. The second one is CUDAICA: GPU Optimization of Infomax- ICA EEG Analysis by Raimondo (2012). They have implemented Infomax strategy ICA algorithm in the GPUs. In both the implementations the GPUs’ high parallel processing capability is utilized to achieve the best ICA computation times.
Blind Source Separation
In the wireless communication, as many signals are introduced into the environment, there is high chance of overlapping and mixing of signals. When the receiver receives the mixed signals, it is very difficult to demodulate them because of influence of interfering signal on the original signal. In many applications, ability to correctly demodulate the 12
received signals effects the communication between source and receiver. So the ability to separate the received mixed signals plays a major role in this criterion. The process of separating the source signal from the mixture of signals is called Blind Source Separation.
Blind Source Separation (BSS) is the process of extracting source signals from observed signal mixtures with no or little information about the nature of source signals. Blind Source Separation has variety of applications, including neural imaging, economic analysis and signal processing. A classic example of blind source separation is cocktail party problem.
Cocktail Party Problem
The Cocktail Party Problem considers (Figure 1) the example of a room of speakers simultaneously speaking in their corresponding microphones. As the speakers are simultaneously speaking, a mixture of all the voices is recorded in each microphone record. The problem is to separate the each voice signal of the speaker using only the recorded mixture of their voices. As there is n number of attendees in the cocktail party, it is complicated to separate them from the mixture. ICA is one of the techniques for blind source separation. As all the speakers are independently speaking, we can separate the signals with the help of Independent Component Analysis, by considering source signals are linearly mixed.
13
Figure 1. Cocktail Party Problem
14
Chapter 3
METHOD
This study was implemented on graphical processing units to perform Independent Component Analysis on the independent signals and then to embed it into WEKA, which is a collection of machine learning and data mining algorithms. This study was carried out using a programming language, CUDA (NVIDIA, 2007) that enabled the use of parallel processing in a GPU with the combination of C-language functions on the host side to invoke the GPU. But to make it platform independent we called the CUDA functions from Java, which introduces JCUDA, a framework developed by NVIDIA to connect to CUDA from Java language. This study also needed installed GPU drivers in order to use GPUs of NVIDIA (NVIDIA, 2007).
The CUDA memory architecture is divided into grids, blocks, and threads. The blocks and threads available on a GPU enables the parallel processing as each thread acts as a separate computing unit. In a GPU, grid is a one-dimensional, blocks are two-dimensional, and threads are three-dimensional (NVIDIA, 2007). A grid is organized as 2D array of blocks and they can be accessed either in one dimension or two dimension. Number of blocks that are available in a single grid are 65536 and each block has a unique ID known as block ID. There is no synchronization mechanism between two blocks. A block is organized as 3D array of threads (Figure 2). The number of threads in a block cannot exceed 1024 combining all three dimensions (NVIDIA, 2007). A thread is a fundamental execution unit in CUDA and each thread has a unique thread ID within a block to distinguish them from each other and to access the appropriate data to process. In a GPU multiple threads run at 15
the same time, which allows the parallel processing. Threads of same block can synchronize with each other to access data; threads can be synchronized using __syncthreads( ) function GPU code.
Figure 2. CUDA Memory Architecture, courtesy of NVIDIA.
A GPU consists of two types of processors known as streaming multiprocessors and streaming processors. Streaming multiprocessors contains a set of streaming processors. In Tesla C2075 graphic card used for this study contains 14 streaming multiprocessors and each multiprocessor contains 32 streaming processors (Green, 2008). The GPU is 16
implemented as a set of multiprocessors. One or more blocks are assigned to each multiprocessor and executed using time slicing (NVIDIA, 2007). Each block is further divided into SIMD (Single Instruction on Multiple Data) groups of threads called warps. The way a block is divided into warps is always the same. Each warp contains threads of consecutive, increasing thread IDs and contains the same number of threads in all warps knows as warp size whose size is 32 in terms of threads (Kirk & Hwu, 2010). These warps are executed in a SIMD fashion. Each multiprocessor has a warp scheduler, which is used to select a warp to be executed on streaming processors. At any time only one warp is executed per multiprocessor.
A program containing CUDA functions and C code together can be compiled using a NVCC, which is a compiler that simplifies the processing of compiling CUDA code. The stages of NVCC compiler include separating, compiling, preprocessing and linking (NVIDIA, 2007). The objective of NVCC compiler is to separate code running on GPU from code running on CPU and compiling GPU code into binary form. The CPU code is compiled using native compiler of the operating system and then links both the compiled files together to make an executable file, which runs GPU code, parallel on GPU and serial code on CPU. The NVCC compiler is provided along with CUDA SDK by NVIDIA for free on their website. Whereas executing the code from Java, we need .ptx file of the CUDA code. We use NVCC compiler to generate the .ptx file from .cu files. We can call the functions present in the .ptx file from Java, with the help of the JCUDA libraries provided by NVIDIA. Generally we consider CUDA code as the kernel to be executed on the GPU. 17
JCUDA
JCUDA stands for Java Compute Unified Device Architecture, in which we call CUDA kernels from the Java with the help of libraries provided by NVIDIA. CUDA provides two different APIs, namely Runtime API and Driver API. Both APIs are similar regarding basic tasks like memory handling. From CUDA 3.0, both APIs are interoperable and also can be mixed to some extent. The important difference between these APIs is about how kernels are managed and executed. In the original CUDA Runtime API, kernels are defined and compiled together with C files. NVIDIA CUDA compiler, NVCC (NVIDIA, 2007), which uses another C compiler to compile the plain C parts of the source code, compiles the source code and takes care of compilation of CUDA specific parts like CUDA kernels and kernel<<…>> calls.
However NVCC cannot be used to compile a Java program. The kernel<<…>> call syntax can’t be used in Java, and there is not a single executable file like in CUDA after the compilation. Thus, it is not possible to call own CUDA kernels with JCuda Runtime API. Instead, the JCUDA Drive API has to be used for creating the kernel.
JCUDA Runtime API is mainly intended for the interaction with the Java bindings of the CUDA Runtime libraries like JCublas and JCufft. If we want to use these libraries instead of creating own kernels, can use the Java Runtime API for the interaction with these libraries. But these can provide only certain predefined functions. In case of full customized code we have to write our own kernels.
Kernels code can be written exactly in the same way as written for CUDA. Usually, the kernel code will be located in an individual file. In contrast with CUDA, our kernel code is separated from the main code for JCUDA, which provides modularity and individual code 18
maintenance. NVCC compiles the kernel source code. This will create a file which has to be loaded and executed using the Driver API. There are basically two types how the kernel can be compiled. One option is PTX file, which is a human readable file containing a specific form of assembler source code that is barely understandable to human. The second option is CUBIN file, which is a CUDA binary and contains the compiled code that can be loaded and executed by specific GPU. But CUBIN files have an important drawback, they are specific for the compute capability of the GPU. So CUBIN files created for one compute capability can not be loaded on a GPU with a different compute capability. Hence we have chosen .ptx file option in our code implementation. A PTX file created with the simple command “NVCC –PTX filename.cu –o filename.ptx”.
To load a compiled kernel we use cuModuleLoad(moduleObject, PtxFile) function of the JCUDA Driver API. To obtain a pointer to specific function we use cuModuleGetFunction(functionObject, moduleObject, functionName). In Java, the number of pointer indirections has to be verified carefully. We have to create a pointer to all of these parameters, which are also pointers. So we need a pointer to pointers or a double pointer. Even though it may look difficult to pass a double pointer to the kernel, once a prototype is established we can use the same prototype to all the kernel launches.
WEKA
WEKA is a collection of machine learning algorithms for data mining tasks. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well suited for developing new machine learning algorithms. WEKA is a software product of the University of Waikato, New Zealand and was first implemented in its modern form in 1997. It uses the GNU General Public License 19
(GPL). The software is written in the Java language and contains a GUI (Figure 3) for interacting with data files and producing visual results with tables and plots. It also has a general API, so you can embed WEKA, like any other library, in your own applications to such things as automated server-side data-mining tasks.
Figure 3. WEKA startup screen (WEKA, 2009)
When you start WEKA, the GUI chooser pops up and let the user choose four ways to work with WEKA and your data. We usually choose and work with the Explorer option (Figure 4).
20
Fig 4. WEKA Explorer (WEKA, 2009)
WEKA Package Manager
WEKA has the concept of packages as a bundle of additional functionality that is not provided with the main WEKA’s main jar file. A package consists of various jar files, documentation, source code, and metadata. This package simplifies the core WEKA system and allows users to install the interested specific additional functionalities as per the requirement. It also provides a simple and reliable mechanism for us to use when contributing to WEKA.
WEKA includes a facility for the management of packages and mechanism to load them dynamically at runtime. There are both command line and GUI package manager. We use the weka.core.WekaPackageManager class to run the package manager. It has following options. 21
Usage: Java weka.core.WekaPackageManager [option]
Options:
-list-packages <all | installed | available>
-package-info <repository | installed | archive> packageName
-install-package <packageName | packageZip | URL> [version]
-uninstall-package <packageName>
-refresh-cache
As well as a command line client, there is also a graphical user interface to reach WEKA’s package management system. This available from Tools menu in the GUI Chooser window. All the functionality available in the command line mode is also available in this GUI along with the option to choose multiple packages installation and uninstallation at a time (Figure 5).
Figure 5. Package Manager GUI of WEKA (Weka, 2009) 22
At the very top of the window there are three buttons. On the left hand side is a button that can be used to refresh the cached copy of the package repository metadata. The two buttons at the top right are used to install and remove packages respectively. Multiple packages can installed/uninstalled by using a shift left-click combination to select a range of packages.
The package list in the above window, for the installation, is the official package from the WEKA. We have provided ICA as an unofficial package. This can be installed by clicking the “File/Url” button on the top right of the package manager window. This will bring up an unofficial package install dialog where we have to select the developed ICA zip file through browsing the file system. This will install the ICA package into WEKA. After the completion of installation, we have to restart the WEKA to bring the installed packages into effect.
Design of the Study
Independent Component Analysis (ICA) attempts to extract signals y from the M mixed signals x by optimizing unmixing matrix W.
y = W x
Herault, Jutten, & Ans, (1984), neurophysiologists, first developed the principle of ICA when they were studying muscle contraction in the early 1980s. They observed two responses x1(t) and x2(t), from which they obtained position and velocity signals y1(t) and y2(t). With the help of nonlinear decorrelation, they proved that independent components can be represented as nonlinearly uncorrelated linear combinations. While it is the first ever known ICA, it is just for two signals and also much slower than present ICA techniques. In 23
other early ICA works, Chichocki, Unbehaun, & Rummert (1994) developed the algorithm to solve above y = W x equation and applied it in neural networks.
In the early 1990 period, Principle Component Analysis (PCA) and Projection Pursuit, which are similar methods to ICA, were applied to blind source separation method. Principle Component Analysis deals with sources that are Gaussian and uncorrelated signals, in contrast to ICA which deals with Gaussian and independent signals. Un-correlation is not a strong property like independence but PCA is computationally less challenging than the ICA. So PCA can be utilized as precursor to ICA algorithm. Projection Pursuit seeks one independent component at a time from a set of mixed signals by maximizing the kurtosis to find the most non-Gaussian signal. Hence it differs from ICA in a way that, Project pursuit extracts only one signal at a time, whereas ICA extracts all the signals at a time.
In 1995, Bell and T. J. Sejnowski & Jung (1995) developed an information-maximization approach to blind separation and blind deconvolution, which is now referred to as Infomax. The process of separating the mutually independent signals by maximizing entropy or information-flow is called Infomax. Infomax yields the same result as another ICA technique, Maximum Likelihood Estimation (MLE) (Stone, 2004). MLE optimizes parameter values, i.e., W to find the best fit of extracted signals y to the source signals of s.
In Infomax, MLE, Projection Pursuit and ICA, optimization is achieved by gradient ascent. Gradient ascent is the method of optimizing the function of multiple parameters by iteratively improving the initial assumption using the gradient which pointes to the direction of maximum slope. A disadvantage of this method is that if the step size not chosen accurately, then the function doesn’t converge properly. Also, if the initial assumption is 24
near to the local maximum, then there is high chance of converging to the local maximum instead of global maximum. Both of these cases produce erroneous results.
Slightly more advanced methods of ICA include complexity pursuit and FastICA (Ramalho, Tomas, & Sousa 2010). Complexity Pursuit is a method of ICA that extracts the signals with least complexity, as generally mixture signals will be more complex than of signals sources. FastICA describes a fixed-point algorithm in lieu with gradient ascent to perform more efficient calculations.
The recent extensions of ICA model include applications for noisy environments, in this criteria there are fewer mixtures than independent components and also convolution is involved in creating the mixtures. These are especially to the interest of using ICA in telecommunications. This extended the ICA and Blind Source Separation to code division multiplexing applications.
Infomax Strategy
Infomax is a method of ICA from information theory, which targets to find independent source signals by maximizing entropy. The general strategy of Infomax begin with the equation y = W x, where the extracted signals y are obtained from source mixture signals x after the optimization of unmixing matrix W. Infomax holds that the extracted signals are source signals if they are mutually independent. As independence is not measurable, we will measure it with the entropy values. That is maximum entropy implies the independent signals. Therefore, the objective of ICA is to optimize the unmixing matrix W by maximizing the entropy in the extracted signals y.
Entropy of the signal mixtures x is constant, but the change in entropy can be maximized by mapping the signals y = W x to the alternate set of signals Y= g (y) = g(W x). 25
This mapping spreads out Y so that the change in entropy from x Y can be maximized by optimizing the unmixing matrix W, and when entropy is maximized, the resulting signals are independent (Figure 6). The inverse y = g-1 (Y) is then taken resulting the extracted signals also independent. Since the extracted signals are independent they must be source signals s. The Infomax strategy can be depicted graphically in Figure.
Figure 6. Infomax Strategy
Mutual independence means that the outcome of one event has no influence on other event. A function y = f (x) is invertible if every value of x corresponds to only one value of y.
Entropy
In 1948, Claude Shannon introduced the concept of information entropy as a measure of uncertainty associated with random variable (Shannon, 1948). Stone (2004) describes entropy as a measure of uniformity distribution such that complete uniformity equals maximum entropy. Some also describe entropy as average surprise or average information.
Entropy for a set of events A is
Signal Mixtures…X
Extracted Signals….. y
Map Y=g(Wx)
Optimize W for Max Entropy of… Y
Invert y =g-1 (Y) 26
Entropy of a Multivariate Probability Density Function Y is given as
Infomax expression for Entropy is
The W that maximize above equation, maximizes the entropy in Y, implying the rows of Y are independent. Since y is the inverse of Y, this implies the rows of y are independent, which makes the W as the unmixing matrix that produces the original signals. To apply the above entropy equation we have assumed the following assumptions with the help of below properties.
Properties
1. Bounded signals with a uniform probability density function have maximum Joint Entropy.
2. Signals with the Maximum Joint Entropy are Mutually Independent.
3. Any invertible function of Mutually Independent signals yields Mutually Independent Signals.
4. If a function is invertible, its Inverse is also invertible.
Assumptions
1. All time samples of each signal are independent.
2. All source signals can be approximated by the same probability density function.
3. The model probability density function is an exact match for the probability density function of the source signals. 27
Gradient Ascent
The objective of ICA is to find an unmixing matrix W that maximizes the entropy of Y, that is H(Y) where Y = g(y) = g(W x). Gradient Ascent is the method used to optimize the unmixing matrix W. Gradient Ascent is an alternative process of taking a step in the direction of maximum gradient until a local maximum is reached. Gradient Ascent requires the expression for gradient of entropy .
Gradient Ascent algorithm for ICA
The optimal unmixing matrix W is found by maximizing entropy, that is through iteratively following the gradient until a local maximum is reached. This is accomplished with the following algorithm.
Where is a small constant. These equations are the general form of Infomax algorithm. However, the gradient algorithm only finds the local maximum instead of global maximum. So it is important to run the multiple trials of the gradient ascent algorithm with different starting points.
Matrix Determinant calculation in GPU
In the Independent Component Analysis, Matrix-Matrix operations like determinant and inverse calculations are taking most of time ICA execution time, so we concentrated on performing these computationally effective calculations inside the GPU with the help of JCUDA architecture. We have written kernels to be run inside GPU, in the CUDA and then from the Java we have invoked the kernels. 28
For the determinant calculation, we have made lower triangular matrix to zero, and then multiplying the diagonal elements will give the determinant. All the operations on the matrix will be written inside the kernels and we call the kernels from the Java code as shown below.
Java Code for the Matrix Determinant
We implemented matrix determinant algorithm in Java as shown in below code (Figure 7). 29
// code to copy the memory variables from CPU to GPU Device
cudaMemcpy(dDetW, dw, sizeWInBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(dwcopy, dw, sizeWInBytes, cudaMemcpyDeviceToDevice);
// Code to make the left diagonal Matrix (L) zero matrix.
for (int i = 0; i < rW; i += BLOCKSIZE)
{
int offset = i * rW + i;
// Calculate the triangle matrix // Step 1:
// Store the pivot elements to left part of the triangle
eliminateBlockL.setup(one, nThreads).call(at(dw, offset), rW);
cudaThreadSynchronize();
// Calculate the rest of the rows with the pivot // Step 2:
// elements from step 1
adjustRowL.setup(nBlocks, nThreads).call(
at(dw, i * rW), at(dw, offset),
at(dDetW, i * rW), rW, i);
cudaThreadSynchronize();
// Fill the columns below the block with the pivot elements. They // Step 3:
// are used to get the columns to zero and multiply with the row
eliminateColL.setup(nBlocks, nThreads).call(at(dw, i), rW, i);
cudaThreadSynchronize();
// Adjust the rest of the Matrix with the calculated pivot // Step 4:
// elements: El_new_0 -= (p0+p1+p2..+p15) * El_piv_0
eliminateRestL.setup(nBlocksRest, nThreads).call(dw, dDetW, rW, i);
cudaThreadSynchronize();
}
//calculate det inside GPU from multiplying dDetW elements
calDet.setup(one, one).call(dDetW,rW,dDetWValue);
cudaThreadSynchronize();
cudaMemcpy(dw, dwcopy, sizeWInBytes, cudaMemcpyDeviceToDevice);
Figure 7 Java implementation of Matrix Determinant
In the above code (Figure 7) eliminateBlockL, eliminateRowL, eliminateColL, eliminateRestL and calDet are the kernels. We are calling these kernels to perform 30
operations inside the GPU. Kernel codes are written in CUDA. The calDet kernel code is given below (Figure 8).
extern "C"
{
__global__ void calcDet (float *dDetW,int size,double *dDetWValue)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
if(tx==0 && ty==0)
{
double detW = 1;
for(int i=0;i<size;i++)
{
detW = detW * dDetW[i*size+ i];
}
dDetWValue[0] = detW;
}
}
}
Figure 8 Calculate Determinant Kernel Code
Above kernel code performs the multiplication of the diagonal elements, where matrix is present inside GPU. When the lower triangle matrix is zero, multiplication of diagonal elements will deduce the matrix determinant. Similarly we can implement code for inversion of Matrix inside the GPU.
Flowchart of Infomax ICA algorithm
Below is the Infomax ICA algorithm flowchart (Figure 7), in which the CPU serial execution blocks are represented in blue and GPU parallel execution blocks are represented in purple as shown. 31
Figure 9. Flowchart of Infomax ICA Algorithm (Stone, 2004)
y = x W
y = Extracted Source Signals
x = Input Signal matrix
W = Identity Matrix
y = x W
Y = tanh(y)
detW = |W|
H =
1
N
( Y
j
å
i
å )+
1
2
logW
G =W-1 -
2
N
xTY
Entropy
Maximized ?
W =W +h.G
No
Yes
CPU
GPU
32
In the above algorithm we have performed matrix multiplication, matrix determinant and matrix inversion in side GPU as they are computationally cost operations inside ICA Infomax algorithm. So performing these high computational cost operations parallel in GPU will reduce the total ICA algorithm execution time.
WEKA Results
Once the ICA package is installed into WEKA with the help of package manager, we can perform ICA on the input data. We have to open the WEKA Explorer to choose the input file to be loaded which consists of sources signals mixture. In the WEKA Explorer we have to click “Open file…” button, which is in top of the explorer window. When we click on Open file, a dialog box will open. We have to provide select the input file from the file browser dialog box. WEKA Explorer and File choose dialog box is as shown in Figure 10.
Figure 10. WEKA Explorer and File browser dialog box
33
ICA has been provided in the select attributes tab. Once the input file is loaded, select attributes tab is automatically enabled. We have to click on Choose button of Attribute Evaluator to select the ICA as shown in Figure 11.
Figure 11. Select attributes tab in the WEKA Explorer and ICA option we implemented
Once the ICA is selected, it will select the Ranker search method according to WEKA Tool by default, and it confirms the same with the user. This search method has no effect on the outcome of the ICA analysis. After selecting ICA, we have to provide the attributes like maximum number of iterations, using GPU or just CPU and . We can provide these attribute values by clicking on ICA with default parameters next to choose button of Attribute Evaluator. Then a window will appear to provide the attribute values as shown in Figure 12. 34
Figure 12. ICA Attributes Window in the WEKA Explorer
After providing the attribute values, we have to click on start button, which is present on left side to start the Independent component analysis. Once ICA is started, after the completion of analysis results are display in the window as shown in Figure 13.
Figure 13. ICA Results in the WEKA Explorer 35
We can perform Independent Component analysis with different attribute values. We can also change the input in preprocess tab to perform ICA on different input signals.
36
Chapter 4
EXPERIMENTAL RESULTS AND DISCUSSION
After careful development and verification of ICA algorithm and coding, the programs were executed on the Central Processing Unit and Graphical Processing Units if available on the system. This chapter shows the results implemented for ICA algorithm running times for the CPU and the GPU with various number of input signal mixtures and also explains the significance and the results. Overall the results illustrate the use of GPUs in ICA and the speed with which high computations can be processed. Finally, we embeded this code into the machine-learning tool WEKA as an additional package, to introduce WEKA in neuro-imaging field by introducing Independent Component Analysis on fMRI images with WEKA.
In the previous chapter, we specified the equations for the entropy and its gradient, which were dependent upon a model probability density function of the source signals. We extracted signals that are transformed through model cumulative distribution function Y = g(y). Stone (2004) refers to this as the cumulative distribution function probability distribution function property of the Infomax, and with careful selection of the model, the Infomax method is tuned towards the desired type of the extracted signal. So we chose function g as the hyperbolic tangent function tanh().
Y = g (y) = tanh (y)
Implementation in Java
Initially we implemented the above iterative ICA algorithm in Java only with CPU. To implement this CPU version ICA algorithm, we have used Jama-1.0.3.jar (Hicklin, Moler & Webb, 2012). JAMA is a basic linear algebra package for Java to perform 37
operations on Matrix efficiently. JAMA developed by Hicklin, Moler and Webb from the Mathworks (Mathworks, 2012). It provides the support for basic operations on the matrices like addition, multiplication, transpose, determinant and also inverse of the matrix. With the help of the JAMA library it is easy to develop our algorithm within the CPU.
Once the CPU version code is implemented successfully, we have started working on implementing basic matrix functionalities in GPU. We have used to Jcublas library of JCUDA Runtime API in order to implement addition and multiplication of matrices. To implement inverse of the Matrix we have taken the sample programs of JCUDA provided by NVIDIA in JCUDA Tutorial. This inversion algorithm is actually developed by Wagner (JCUDA, 2012) in a thesis study in the ZAFH-AMSER (JCuda, 2012) project. We have implement the Matrix inversion inside GPU with minimal modification to these kernels. Apart from the matrix inversion, we also need matrix determinant in our ICA algorithm. From the basic idea of finding determinant by decomposing lower triangular matrix to zero and then multiplication of diagonal elements will produce determinant. This logic is implemented in Java with the help of above kernels to find the matrix determinant.
After all the basic operations are implemented in Java using GPU, we have started implementation of our algorithm into GPU. We have also developed some more kernels to perform calculation of Entropy and Gradient inside the GPU, which further increases the speed of execution. To execute these kernels, we have prepared .ptx files for each of these kernels. But manual .ptx file preparation needs lot of structural compilation of each kernel. To make it faster we have implemented automated generation of .ptx files from the CUDA kernels. 38
While developing the Java code, some problems occurred within calculation of matrix singularity. As singular matrix will have zero determinant and hence don’t have inverse. To resolve this problem we have added a checking for singularity of the matrix before finding inverse of the matrix. This issue is resolved after adding singularity check.
We have also had to solve problem with higher matrix determinant values. We have used double variable, whose highest possible value is 1.79769313486231570e+308. But when there is high number of input signals, and with maximum number of iterations (empirically found, i >100) we have had matrix determinant indeed crossed this double limit and produced Infinity and NaN values. To avoid this issue we have restricted our analysis by number of input signals to below 300 and iterations to below 100.
Initially we have implemented logic to use GPU only when there are huge calculations on matrices. In this method, we implemented functions for each operation on Matrix separately. But to perform any operation on the GPU, we need to copy the memory from CPU memory (RAM) to GPU device memory. This is the bottleneck for the calculations inside GPU and it is taking more time to copy the matrix from CPU memory to Device memory than the entire calculation. Also after completing the computations on the matrix, we have to copy that back to CPU memory. This produced very high execution time for the GPU version of the code, even though GPU calculations are much faster than CPU alone.
After a lot of attempts to reduce this bottleneck, we came to conclusion to copy initial matrix variables into GPU and perform each and every code inside the GPU for all the iterations. Then we have to copy the final entropy, gradient, unmixing matrix back to GPU. This has drastically reduced the memory copying time from CPU to Device. To 39
accomplish this task we have modified the kernels and also stored the GPU device variable pointers without deleting them until the complete execution. When we don’t delete allocated memory inside the device, those variables will be exist in the device memory until you delete or end of the program execution. After that automatically all the GPU Device allocated memory is cleared.
Then we have performed extensive experiments in fMRI signals obtained from 577 ventral temporal area voxels as the input signals (Haxby, Gobbini, Furey, & Pietrini, 2001). After the complete implementation of the algorithm inside GPU along with above modifications, we have compared the execution times of the CPU and GPU by testing them with different number of input signals. Initially we have compared the GPU and CPU execution times by fixing the number of iterations to 30. Below is the CPU versus GPU execution time comparison for 30 iterations. In the Table 1, all the execution times are specified in milliseconds (ms).
Table 1. CPU and GPU execution time comparison for no. of iterations 30
Number of Input Signals
CPU
(ms)
GPU
(ms)
32
165
348
80
573
422
160
1700
532
192
2547
565
208
3002
639
40
We plotted bar chart for the above results as shown in Figure 14.
Figure 14. CPU and GPU ICA execution comparison bar chart for 30 iterations
We also tested our GPU and CPU based implementations for the 50 iterations. The new execution times for the 50 iterations are given in Table 2.
Table 2. CPU and GPU execution time comparison for no. of iterations 50
Number of Input Signals
CPU
(ms)
GPU
(ms)
32
309
458
80
992
701
160
2920
1175
192
4190
1359
208
4633
1491
240
6803
1735
0
500
1000
1500
2000
2500
3000
3500
32
80
160
192
208
CPU
GPU 41
Then, we prepared the bar chart for the above results as depicted in Figure 15.
Figure 15. CPU and GPU ICA execution comparison bar chart for 50 iterations
After the achievement of above results, we have tested our GPU and CPU code by running it for 100 iterations. The new execution times for 100 iterations are changed as shown below.
Table 3. CPU and GPU execution time comparison for no. of iterations 100
Number of Input Signals
CPU
(ms)
GPU
(ms)
32
571
597
80
1923
1076
160
5486
1934
192
7658
2347
208
8892
2553
240
11565
3014
0
1000
2000
3000
4000
5000
6000
7000
8000
32
80
160
192
208
240
CPU
GPU 42
Then, we prepared the bar chart for the above results in Figure 16.
Figure 16. CPU and GPU ICA execution comparison bar chart for 100 iterations
From the analysis of the above results, on the fMRI signals obtained from 577 ventral temporal area voxels, GPU based implementation produced 5X times better results in best case and 4X times better results in average case when compared to CPU based implementation. Especially when the number of desired ICA components is high (>32), our GPU based implementation is producing best results as expected.
Embedding ICA into WEKA
Once the satisfactory results have been produced by GPU based implementation of ICA algorithm, we have started our final aim to embed the ICA into the WEKA tool. As discussed earlier, we have to add ICA as a package to the WEKA, so that on requirement we can add ICA into the WEKA tool, which is very helpful to the neuroimaging community.
We have downloaded the basic package development structure from the WEKA tutorial to develop ICA package. We have started with embedding CPU based implementation of ICA algorithm as the initial step. CPU based implementation requires the
0
2000
4000
6000
8000
10000
12000
14000
32
80
160
192
208
240
CPU
GPU 43
JAMA library to run the basic Matrix operations like addition, multiplication, inverse and transpose. So we have used added JAMA jar to library folder of WEKA new package development structure.
While embedding CPU based implementation of ICA algorithm we have used the same class files that are developed previously. But we have many classes to separate different modules in previous implementation. Whereas WEKA requires only one class file, which needs to be added into package, so that whenever we perform Independent Component Analysis that corresponding class is loaded into memory and invoked by WEKA automatically. To achieve this, we have modified the previous implementation slightly by creating single ICA class with all the required variables and functions. This has resolved the problem with many classes.
Once the modifications done, we have successfully embedded CPU based implementation of the ICA algorithm into WEKA. This CPU based implementation is performing Independent Component Analysis in similar timing to that of previous Java based implementation. The next step is to start implementing the GPU based implementation of ICA algorithm into WEKA.
According to author knowledge, this is the first GPU based implementation algorithm into WEKA. Even though WEKA has many additional packages, there no GPU based implementation for any of those algorithms. Considering this as the challenging, we started implementing the basic GPU sample code inside the WEKA. We had the problems with loading of NVIDIA Runtime API jars. As specified earlier we have to invoke the kernel from the Java code. We had some problems with the location of the kernel as well. To solve both the problems, we have copied the NVIDIA Runtime API jars and kernels into 44
a specified library path of WEKA, like c:/weka-lib. To separate kernels from the libraries we have taken kernels in the new folder named kernels. As kernels and NVIDIA Runtime API is now available to the WEKA now, we have successfully run the GPU sample code in WEKA.
Thereafter we have started embedding the complete GPU based implementation code of ICA into the WEKA. As mentioned earlier, we have to write the entire code in one ICA class. In developing entire code in one class, we had problems with the serialization of the class variables. Generally all the class variables of a class need to be serializable, in the new package development. If the class variables are serializable then only they can be passed through a network. If not they can’t be sent through the network. But one of the variables of the type Pointer in our code is not serializable because of their transient nature by default. To resolve this problem we have taken Pointer variables as local variables to the function. As local variables are of lower scope and lifetime, they won’t come into the effect of transmitting through the network. This has solved the serializable problem of the class variables like pointer-type.
After the resolution to above problems, we have successfully run the GPU based implementation of ICA algorithm code in WEKA. Yet our code is able to run only for the small input mixtures only in comparison with previous Java implementation. This is happened because of the 8MB default heap size for GPU memory allocation provided by WEKA. To solve this issue we have increased heap size manually by specifying the heap size inside the Java code before initiating any kernels from the WEKA. This has increased the heap size and we are able to create the larger size matrix inside GPU. Now we have 45
successfully implemented GPU based implementation of ICA algorithm in WEKA for the larger input mixtures also.
We have tested with many different sizes of input mixtures. It is able to produce the results successfully for all the sizes but we have faced problem with running Independent Component Analysis many times in GPU. This is happened because of one time initialization of the static kernel addresses. To resolve this issue, we have added manual function call to initialization of static kernel addresses every time. Finally we have resolved all the relevant issues and successfully embedded GPU based implementation of ICA algorithm into the machine-learning tool WEKA.
We have performed ICA component analysis on the sound signals of each 10000 x 1 size. We have mixed the below chirp, gong, Handel and laughter signals randomly in a linear fashion and given the mixed signal mixtures to the independent component analysis to test the results.
46
a
b
c
d
Figure 17. Example sound signals … (a) Chirp, (b) Gong, (c) Handel, (d) Laughter 47
The source signals are independent and non-Gaussian signals. They are linearly mixed with a random constant as multiplier. We have performed Independent Component Analysis on the mixture signals and observed the entropy and gradient ascent as our goal is to maximize the entropy (H) and to minimize the change in entropy i.e. minimize the gradient ascent of entropy (G). When the entropy is maximum all the input signals independence is maximized to make them unmixed. We have plotted the entropy and gradient ascent for all the iterations (i.e., 100 in this case) to observe the change in entropy and gradient ascent. We draw the results as shown in Figures 18 and 19.
Figure 18. Entropy (H) 48
Figure 19. Gradient Ascent (G)
In the above entropy plot, entropy (H) is increasing for each iteration step. Also, gradient ascent (G) is decreasing, which represents the change in entropy. This proves that in extracted signals, entropy is maximized to make them independent source signals.
After the Independent Component Analysis on the mixture signals, we are able to generate the source signals from the mixtures. These extracted signals are obtained by multiplying unmixing matrix W to the mixture matrix X. W unmixes the X matrix to form the source signals as we optimized W to maximize the entropy. To compare the extracted signal from ICA to the source signals we have graphed the extracted signals as show below.
49
a
b
c
d
Figure 20. Extracted sound signals … (a) signal-I, (b) Signal-II, (c) Signal-III, (d) Signal-IV
50
From the above signal visuals, we can say that extracted source signal-I shown in Table 6a is similar to the Chirp source sound signal. Whereas extracted source signal-II shown in Table 6b is similar to the Gong source sound signal. Similarly extracted source signal–III shown in Table 6c is similar to the Handel source sound signal. Finally extracted source signal–IV shown in Table 6d is similar to the Laughter source sound signals. Even though is some noise exist in the extracted source signals they are almost similar to the source sound signals. With the above observed results, we can confirm that our Independent Component Analysis is successfully able to perform blind source separation.
51
Chapter 5
CONCLUSION AND FUTURE RESEARCH
The objective of this research was to implement Independent Component Analysis in GPU/CPU and embedding it into the WEKA to introduce WEKA into neuro-imaging field. Using graphical processing units can significantly increase the computational efficiency of the Independent Component Analysis. Using CUDA, the data can be processed faster on GPUs but the important point is to choose the GPU based implementation when number of input signal mixtures is greater than 32.
Independent Component Analysis algorithm implemented in this thesis is quite successful for extracting source signals out of same number of mixtures. Independent Component Algorithm is implemented both in CPU and GPU and the advantage of this implementation is not to learn how Independent Component Analysis works but how to speed up the slow sequential code by converting into fast parallel code on GPU for huge computations when high number of signals are mixed. GPU and CPU implementations provided identical results and this process don’t have major problems while converting from serial implementation to parallel processing that is fit to CUDA model.
Extensive experiments in fMRI signals obtained form 577 ventral temporal voxels showed that GPU based implementation speeds up the Independent Component Analysis calculation time. Generally, the data transfer between CPU and GPU memory is bottleneck and slows down the faster computation. To cope up with issue, we have optimized the algorithm in a way that data transfer between CPU and GPU is required only in the starting and ending. This made our algorithm more efficient and reduced computation time drastically by overcoming the bottleneck. When the number of components is higher, i.e. 52
above 32 our GPU implementation is up to 5X faster in the best case. Where as in the average case our implementation is 4X faster than the CPU based implementation.
This software is extended to mostly used data mining and machine learning software WEKA, which enables the user to upload any kind of data into WEKA and perform Independent Component Analysis on any kind of input not only FMRI. This is the first algorithm of WEKA, which is implemented in GPU and opening the doors for all other algorithms to be able to implement in the GPU for the faster processing.
This algorithm is novel implementation and it doesn’t depend on any obsolete libraries. It is implemented in Java and hence it is platform independent. Also, we can easily embed into any advanced medical devices apart from computers.
Future Study
We would like to extend our study, to use multiple GPUs on the system if available, in order to increase the computation speed further. Also ICA calculation can be further speed-up with more efficient matrix multiplication algorithms. Other mostly use ICA implementation, especially FastICA, is in the to-do list.
53
REFERENCES
About GPGPU.org. (2002). Retrieved from:
http://gpgpu.org/about
A. Hyv rinen and E. Oja, "Independent component analysis: algorithms and applications", Neural net- works, vol. 13, no. 4-5, pp. 411-430, 2000.
An Introduction to Independent Component Analysis: InfoMax and FastICA
algorithms. Retrieved from
http://www.tqmp.org/Content/vol06-1/p031/p031.pdf
A Mathematical Theory of Communication by C.E. Shannon published in The Bell System Techincal Journal, Vol XXVII, July, 1948. Retrieved from
http://www3.alcatel-lucent.com/bstj/vol27-1948/articles/bstj27-3-379.pdf
Cocktail party problem Introduction (2007). Retrieved from
http://en.wikipedia.org/wiki/Cocktail_party_effect
CUDA-GPU memory model. Retrieved from
http://geco.mines.edu/tesla/cuda_tutorial_mio/
CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis, by Federico Raimondo, Juan E Kamienkowski, Mariano Segman and Diego Fernandez Slezak (2012). Retrieved from
http://www.hindawi.com/journals/cin/2012/206972/
D. B., & Hwu, W.-m. W. (2010). Programming Massively Parallel Processors: A Hands-on Approach. Burlington, MA: Morgan Kaufmann Publishers. 54
Efficient Independent Component Analysis on GPU. by Rui Ramalho, Pedro Tomas and Leonel Sousa from Lisbon, Portugal(2010). Retrieved from
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5578558&tag=1
Elble, J. M., Sahinidis, N. V., & Vouzis, P. (2010). GPU computing with Kaczmarz's
and other iterative algorithms for linear systems. Parallel Computing archive, 215-231.
General Purpose Computation on Graphics Hardware by Mark Horris(2002), Retrieved from
http://gpgpu.org/about
Graphics Processing Unit (GPU). (n.d.). Retrieved from
http://www.nvidia.com/object/gpu.html
Green, S. (2008, December 2008). NVIDIA CUDA FAQ version 3.1. Retrieved on
March 10, 2014, from http://forums.nvidia.com/index.php?showtopic=84440
G. D. Brown, S. Yamada, and T. J. Sejnowski, “Independent component analysis at the neural cocktail party,” Trends in Neurosciences, vol. 24, no. 1, pp. 54–63, 2001.
Haxby, J., Gobbini, M., Furey, M., Ishai, A., Schouten, J., and Pietrini, P. (2001), Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293, 2425–2430.
Herault, Jutten and B. Ans, 1984. Circuits neuronaux a synapses modifiables: Decodage de messages composites par apprentissage non supervise. Comptes Rendus de l’Academie des Sciences, 299(III-13), 525–528.
Independent Component Analysis for Dummies. Retrieved from
http://sccn.ucsd.edu/~arno/indexica.html 55
Independent Component Analysis: A Tutorial Introduction book by James V. Stone (2004),
MIT Press
JCublas Application Program Intefrace Documentation. Retrieved from
http://www.jcuda.org/jcuda/jcublas/doc/index.html
JCUDA programming guide. Retrieved from
http://www.jcuda.org/tutorial/TutorialIndex.html
JCUDA Application Program Interface. Retrieved from
http://www.jcuda.org/jcuda/doc/
Joe Hicklin, Cleve Moler and Peter Webb from Mathworks(2012): Jama Initial Design. Retrieved from
http://math.nist.gov/Javanumerics/jama/
K. E. Hild II, D. Erdogmus, and J. C. Principe, "An analysis of entropy estimators for blind source separation," Signal Process., vol. 86, no. 1, pp. 182-194, 2006.
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
Matrix inversion in GPU, JCUDA samples from NVIDIA. Retrieved from
http://www.jcuda.org/samples/samples.html
Nickolls, J. (2007). NVIDIA GPU parallel computing architecture. IEEE Hot Chips 19 Nifti library. Retrieved from
http://niftilib.sourceforge.net
56
NVIDIA. (2007). CUDA C Programming Guide Version 1.0. Retrieved from
http://www.cs.berkeley.edu/~yelick/cs194f07/handouts/NVIDIA_CUDA_Programming_Guide.pdf
NVIDIA. (2007). The CUDA Compiler Driver NVCC. Retrieved on March 10, 2014, from http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf
Sanders, J., & Kandrot, E. (2010). CUDA by Example: An Introduction to General Purpose GPU Programming. Boston, Mass.: Addison-Wesley.S. Amari, “A new learning algorithm for blind signal separation,” in Advances in Neural Information Processing Systems, 1996.
S. Makeig, A. J. Bell, T. P. Jung, and T. J. Sejnowski, “Independent component analysis of electroencephalographic data,” in Advances in Neural Information Processing Systems, pp. 145–151, MIT Press, Cambridge, Mass, USA, 1996.
WEKA Documentation. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/documentation.html
WEKA software and wiki. Retrieved from
http://weka.wikispaces.com/
WEKA package manager information. Retrieved from
http://weka.wikispaces.com/How+do+I+use+the+package+manager%3F
57
APPENDICES
58
APPENDIX A
JAVA CODE FOR WEKA PACKAGE: ICA IN CPU AND GPU
59
/**
* ICA.java
* @author: Harish Ankam
* @date: 1 January, 2014
*
*/
package weka.attributeSelection;
import weka.core.Capabilities;
import weka.core.Instances;
import weka.core.Option;
import weka.core.OptionHandler;
import weka.core.RevisionUtils;
import weka.core.Utils;
import java.text.DecimalFormat;
import java.util.Enumeration;
import java.util.Vector;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.JCuda.cudaMemset;
import static jcuda.runtime.JCuda.cudaThreadSynchronize;
import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToDevice;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;
import java.io.*;
import java.util.Scanner;
import java.util.StringTokenizer;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.JCudaDriver;
import jcuda.jcublas.JCublas;
import jcuda.runtime.JCuda;
import jcuda.runtime.dim3;
import jcuda.utils.KernelLauncher;
//import weka.core.matrix.Matrix;
import Jama.Matrix;
import jcuda.Pointer;
public class ICA extends ASEvaluation implements AttributeEvaluator, OptionHandler
{
private int m_numComponents;
private int m_maxIteration;
private double m_eta;
private boolean m_useGPU;
private double[] m_stdDeviation;
public double[] hs;
public double[] gs;
private int executionTime;
// ICAStart ic = new ICAStart();
boolean useGPU=false;
int M; //M = number of source signals and signal mixtures.
int N; // N = number of data points per signal. 60
int maxiter;
float eta; // [0.25] Step size for gradient ascent.
Matrix w,x,y;
Matrix Y,g;
double[][] wArray,xArray,yArray,YArray,gArray;
float[] xf,yf,wf,xTf,invwT,wT,Yf,detW;
double[] detWValue = new double[1];
int rX,cX,rY,cY,rW,cW;
// The block size that is used in the kernels
public static final int BLOCKSIZE = 4;
// The kernel launchers for the individual kernel functions
private static KernelLauncher adjustRowL;
private static KernelLauncher adjustRowU;
private static KernelLauncher eliminateBlockL;
private static KernelLauncher eliminateBlockU;
private static KernelLauncher eliminateColL;
private static KernelLauncher eliminateColU;
private static KernelLauncher eliminateRestL;
private static KernelLauncher eliminateRestU;
private static KernelLauncher normalizeDiag;
private static KernelLauncher setIdentity;
private static KernelLauncher tanhy;
private static KernelLauncher calDet;
private static KernelLauncher calTransform;
private static KernelLauncher calSumWg;
private static KernelLauncher calGS;
private static KernelLauncher calHS;
public ICA()
{
resetOptions();
}
private void resetOptions()
{
m_numComponents = 2;
m_maxIteration = 10;
m_eta = 0.25;
m_useGPU = false;
}
@Override
public Enumeration listOptions ()
{
Vector newVector = new Vector(6);
newVector.addElement(new Option("\tNumber of components:", "C", 1, "-C <num>"));
newVector.addElement(new Option("\tMax Iteration:", "I", 1, "-I <num>"));
newVector.addElement(new Option("\tEta:", "L", 1, "-L <num>"));
newVector.addElement(new Option("\tUseGPU:", "V", 0, "-V"));
return newVector.elements();
}
@Override
public void setOptions (String[] options) throws Exception 61
{
resetOptions();
setNumComponents(Utils.getOptionPos('C', options));
setMaxIteration(Utils.getOptionPos('I', options));
setEta(Utils.getOptionPos('L', options));
setUseGPU(Utils.getFlag('V', options));
}
@Override
public String[] getOptions ()
{
String[] options = new String[6];
int current = 0;
options[current++] = "" + m_numComponents;
options[current++] = "" + m_maxIteration;
options[current++] = "" + m_eta;
if(getUseGPU())
{
options[current++] = "-V";
}
while (current < options.length)
{
options[current++] = "";
}
return options;
}
//A placeholder to be replaced.
public String numComponentsTipText()
{
return "Number of Components Tip Text.";
}
public void setNumComponents(int num)
{
m_numComponents = num;
}
public int getNumComponents()
{
return m_numComponents;
}
//A placeholder to be replaced.
public String maxIterationTipText()
{
return "Max Iteration Tip Text";
}
public void setMaxIteration(int num)
{
m_maxIteration = num;
}
public int getMaxIteration()
{
return m_maxIteration;
} 62
//A placeholder to be replaced.
public String etaTipText()
{
return "Learning Rate Tip Text";
}
public void setEta(double num)
{
m_eta = num;
}
public double getEta()
{
return m_eta;
}
//A placeholder to be replaced.
public String useGPUTipText()
{
return "UseGPU Tip Text";
}
public void setUseGPU(boolean b)
{
m_useGPU = b;
}
public boolean getUseGPU()
{
return m_useGPU;
}
@Override
public Capabilities getCapabilities()
{
Capabilities result = super.getCapabilities();
result.disableAll();
// attributes
result.enable(Capabilities.Capability.NOMINAL_ATTRIBUTES);
result.enable(Capabilities.Capability.NUMERIC_ATTRIBUTES);
result.enable(Capabilities.Capability.DATE_ATTRIBUTES);
result.enable(Capabilities.Capability.MISSING_VALUES);
// class
result.enable(Capabilities.Capability.NOMINAL_CLASS);
result.enable(Capabilities.Capability.MISSING_CLASS_VALUES);
return result;
}
@Override
public void buildEvaluator (Instances data) throws Exception
{
// can evaluator handle data?
getCapabilities().testWithFail(data);
int classIndex = data.classIndex();
int numInstances = data.numInstances();
int numAttributes = data.numAttributes();
int n = data.attributeToDoubleArray(0).length;
double[][] input = new double[numAttributes][n]; 63
//get standard deviation
m_stdDeviation = new double[numAttributes];
for(int i = 0; i < numAttributes; i++)
{
double[] current = data.attributeToDoubleArray(i);
int total = 0;
for(int j = 0; j < current.length; j++)
{
total += current[j];
}
double average = total / current.length;
total = 0;
for(int j = 0; j < current.length; j++)
{
total += Math.pow(current[j]-average, 2);
}
m_stdDeviation[i] = total / current.length;
input[i]= current;
}
this.rY = numAttributes;
this.rX = data.attributeToDoubleArray(0).length;
//Perform Independent Component Analysis
long startTime = System.currentTimeMillis();
startICA(this.m_numComponents, n, input,this.m_useGPU, this.m_maxIteration, this.m_eta);
long endTime = System.currentTimeMillis();
this.executionTime = (int)(endTime-startTime) +1;
}
@Override
public double evaluateAttribute (int attribute) throws Exception
{
if(m_stdDeviation[attribute] > m_eta)
{
return m_stdDeviation[attribute];
}
else
{
return 0;
}
}
//A placeholder to be replaced.
@Override
public String toString ()
{
StringBuffer text = new StringBuffer();
text.append("Performing Independent Component Analysis...");
text.append("\n");
text.append("ICA Excecution Time:"+this.executionTime+"ms");
text.append("\nGS:");
text.append("\n");
for(int i=0;i<gs.length;i++) 64
text.append(new DecimalFormat("####.###").format(gs[i])+"\t");
text.append("\n");
text.append("HS:");
text.append("\n");
for(int i=0;i<hs.length;i++)
text.append(new DecimalFormat("####.###").format(hs[i])+"\t");
text.append("\n");
// text.append("\n:X"+x.getRowDimension()+"\t"+this.x.getColumnDimension()+"\nW:"+this.w.getRowDimension()+"\t"+w.getColumnDimension());
// text.append("\nInput size:"+this.rX+"\t"+this.cX);
// text.append("\n 6 1 element"+this.xArray[6][1]);
text.append("UnMixing Matrix:\n");
for(int i=0;i<wArray.length;i++)
{
for(int j=0;j<wArray[0].length;j++)
{
text.append(new DecimalFormat("####.###").format(wArray[i][j])+"\t");
}
text.append("\n");
}
double[][] sArray;
w = new Matrix(wArray);
sArray =x.times(w).transpose().getArray();
text.append("Source signals Matrix:\n");
for(int i=0;i<sArray.length;i++)
{
for(int j=0;j<sArray[0].length;j++)
{
text.append(new DecimalFormat("####.###").format(sArray[i][j])+"\t");
}
text.append("\n");
}
return text.toString();
}
@Override
public String getRevision()
{
return RevisionUtils.extract("$Revision: $");
}
/**
* runs the evaluator with the given commandline options
*
* @param evaluator the evaluator to run
* @param options the commandline options
*/
public static void runEvaluator(ASEvaluation evaluator, String[] options) {
try {
System.out.println( 65
AttributeSelection.SelectAttributes(evaluator, options));
}
catch (Exception e) {
String msg = e.toString().toLowerCase();
if ( (msg.indexOf("help requested") == -1)
&& (msg.indexOf("no training file given") == -1) )
e.printStackTrace();
System.err.println(e.getMessage());
}
}
public double[] getHs() {
return hs;
}
public void setHs(double[] hs) {
this.hs = hs;
}
public double[] getGs() {
return gs;
}
public void setGs(double[] gs) {
this.gs = gs;
}
public void icaCode()
{
w = Matrix.identity(M,M);
wArray =w.getArray();
y=x.times(w);
yArray=y.getArray();
hs = new double[maxiter];
gs = new double[maxiter];
for(int iter=1;iter<=maxiter;iter++)
{
y=x.times(w);
YArray = new double[N][M];
for(int i=0;i<yArray.length;i++)
{
for(int j=0;j<yArray[0].length;j++)
{
YArray[i][j] = Math.tanh(yArray[i][j]);
if(YArray[i][j] != Math.tanh(yArray[i][j]));
}
}
Y = new Matrix(YArray);
double detW=0;
try{
detW = Math.abs(w.det());
}catch(Exception e)
{ System.out.println(e.getMessage());}
double detWCPU=Math.abs(w.det());
double sum=0;
for(int i=0;i<YArray.length;i++) 66
for(int j=0;j<YArray[0].length;j++)
sum+=YArray[i][j];
double h = (sum/N ) + 0.5 * Math.log(Math.abs(detW));
if(!useGPU)
{
long time1 = System.currentTimeMillis();
g = (w.transpose()).inverse();
}
g = g.minus(x.transpose().times(Y).times(2/(double)N));
// Update W to increase h ...
w = w.plus(g.times(eta));
//Record h and magnitude of gradient ...
hs[iter-1]=h;
double[][] gArray = g.getArray();
double norm=0;
for(int i=0;i<gArray.length;i++)
{
for(int j=0;j<gArray[0].length;j++)
{
norm += gArray[i][j] * gArray[i][j];
}
}
norm = Math.sqrt(norm);
gs[iter-1]= norm;
}
wArray = w.getArray();
}
public double[][] generateIdentityMatrix(int M)
{
double[][] ident = new double[M][M];
for(int i=0;i<M;i++)
{
for(int j=0;j<M;j++)
{
if(i==j)
ident[i][j]=1;
else
ident[i][j]=0;
}
}
return ident;
}
public double[][] matrixMultiply(double[][] a,int rA,int cA, double[][] b,int cB)
{
int rB = cA;
int rC = rA;
int cC = cB;
float[] aLinear = TwoDtoOneD(a,rA,cA);
float[] bLinear = TwoDtoOneD(b,rB,cB);
float[] cLinear = TwoDtoOneD(generateIdentityMatrix(rC),rC,cC);
Pointer dA = new Pointer(); 67
Pointer dB = new Pointer();
Pointer dC = new Pointer();
JCublas.cublasInit();
JCublas.cublasAlloc(rA*cA, Sizeof.FLOAT, dA);
JCublas.cublasAlloc(rB*cB, Sizeof.FLOAT, dB);
JCublas.cublasAlloc(rC*cC, Sizeof.FLOAT, dC);
JCublas.cublasSetVector(rA*cA, Sizeof.FLOAT, Pointer.to(aLinear), 1, dA, 1);
JCublas.cublasSetVector(rB*cB, Sizeof.FLOAT, Pointer.to(bLinear), 1, dB, 1);
JCublas.cublasSetVector(rC*cC, Sizeof.FLOAT, Pointer.to(cLinear), 1, dC, 1);
// C = 1 * A * B + 0 * C,
JCublas.cublasSgemm('n', 'n', rA, cB, cA, 1, dA, rA, dB, rB, 0, dC, rC);
JCublas.cublasGetVector(rC*cC, Sizeof.FLOAT, dC, 1, Pointer.to(cLinear), 1);
JCublas.cublasShutdown();
return oneDtoTwoD(cLinear,rC,cC);
}
public double[][] matrixScalarMultiply(double[][] A,double x)
{
int rows= A.length;
int cols= A[0].length;
double[][] res = new double[rows][cols];
for(int i=0;i<rows;i++)
for(int j=0;j<rows;j++)
res[i][j]= x * A[i][j];
return res;
}
public double[][] matrixAdd(double[][] A, double[][] B)
{
int rowsA= A.length;
int colsA= A[0].length;
int rowsB = B.length;
int colsB = B[0].length;
double[][] res = new double[rowsA][rowsB];
if(rowsA==rowsB && colsA==colsB)
{
for(int i=0;i<rowsA;i++)
for(int j=0;j<colsA;j++)
res[i][j]=A[i][j]+B[i][j];
}
else
System.out.println("addition not possible");
return res;
}
public double[][] matrixSub(double[][] A, double[][] B)
{
int rowsA= A.length;
int colsA= A[0].length;
int rowsB = B.length;
int colsB = B[0].length;
double[][] res = new double[rowsA][rowsB];
if(rowsA==rowsB && colsA==colsB)
{ 68
for(int i=0;i<rowsA;i++)
for(int j=0;j<colsA;j++)
res[i][j]=A[i][j]-B[i][j];
}
else
System.out.println("Subtraction not possible");
return res;
}
public void icaCodeWithGPU()
{
//wArray =generateIdentityMatrix(M);
//yArray = matrixMultiply(xArray,xArray.length,xArray[0].length, wArray,wArray[0].length);
hs = new double[maxiter];
gs = new double[maxiter];
performICA();
}
public double[][] matTranspose(double[][] a)
{
int rows=a.length;
int cols=a[0].length;
double[][] trans = new double[cols][rows];
for(int i=0;i<rows;i++)
{
for(int j=0;j<cols;j++)
{
trans[j][i] = a[i][j];
}
}
return trans;
}
public float[] TwoDtoOneD(double[][] a,int rA,int cA)
{
//System.out.println("rA:"+rA);
//System.out.println("cA:"+cA);
float[] res = new float[rA * cA];
for(int i=0;i<rA;i++)
{
for(int j=0;j<cA;j++)
{
res[(i*cA)+j]= (float)a[i][j];
}
}
return res;
}
public double[][] oneDtoTwoD(float[] a, int rowCount, int colCount)
{
double[][] res = new double[rowCount][colCount];
for(int i=0;i<rowCount;i++)
{
for(int j=0;j<colCount;j++) 69
{
res[i][j]=a[(i*colCount)+j];
}
}
return res;
}
private static void init()
{
if (adjustRowL != null)
{
//return;
}
JCudaDriver.setExceptionsEnabled(true);
JCuda.setExceptionsEnabled(true);
String kernelsPath = "c:/lib/kernels/";
String args =
"-D BLOCKSIZE=" + BLOCKSIZE + " " +
"-D BLOCKSIZEMINUS1=" + (BLOCKSIZE - 1) + " " +
"-D AVOIDBANKCONFLICTS=" + 0 + " ";
// System.out.println("Loading kernels from GPUadjustRow_kernel.cu");
adjustRowL = KernelLauncher.create(
kernelsPath + "GPUadjustRow_kernel.cu",
"adjustRowL_kernel", args);
adjustRowU = adjustRowL.forFunction(
"adjustRowU_kernel");
// System.out.println("Loading kernels from GPUeliminateBlock_kernel.cu");
eliminateBlockL = KernelLauncher.create(
kernelsPath + "GPUeliminateBlock_kernel.cu",
"eliminateBlockL_kernel", args);
eliminateBlockU = eliminateBlockL.forFunction(
"eliminateBlockU_kernel");
// System.out.println("Loading kernels from GPUeliminateCol_kernel.cu");
eliminateColL = KernelLauncher.create(
kernelsPath + "GPUeliminateCol_kernel.cu",
"eliminateColL_kernel", args);
eliminateColU = eliminateColL.forFunction(
"eliminateColU_kernel");
// System.out.println("Loading kernels from GPUeliminateRest_kernel.cu");
eliminateRestL = KernelLauncher.create(
kernelsPath + "GPUeliminateRest_kernel.cu",
"eliminateRestL_kernel", args);
eliminateRestU = eliminateRestL.forFunction(
"eliminateRestU_kernel");
// System.out.println("Loading kernels from GPUnormalizeDiag_kernel.cu");
normalizeDiag = KernelLauncher.create(
kernelsPath + "GPUnormalizeDiag_kernel.cu",
"normalizeDiag_kernel", args);
// System.out.println("Loading kernels from GPUsetIdentity_kernel.cu");
setIdentity = KernelLauncher.create(
kernelsPath + "GPUsetIdentity_kernel.cu",
"GPUsetIdentity", args); 70
// System.out.println("Loading kernels from GPUtanhofy_kernel.cu");
tanhy = KernelLauncher.create(
kernelsPath + "GPUtanhofy_kernel.cu",
"tanhY", args);
// System.out.println("Loading kernels from GPUcalcDet_kernel.cu");
calDet = KernelLauncher.create(
kernelsPath + "GPUcalcDet_kernel.cu",
"calcDet", args);
calTransform = KernelLauncher.create(
kernelsPath + "GPUcalTransform_kernel.cu",
"calcTranform",args);
calSumWg = KernelLauncher.create(
kernelsPath + "GPUcalSumWPlusetaXg_kernel.cu",
"calcWetaXg",args);
calGS = KernelLauncher.create(
kernelsPath + "GPUcalGSg_kernel.cu",
"calcNormg",args);
calHS = KernelLauncher.create(
kernelsPath + "GPUcalHS_kernel.cu",
"calcHSwithdetWY",args);
}
/**
* Returns a pointer which is computed from the given pointer by
* adding the given number of float elements to its address.
*
* @param p The input pointer
* @param floatOffset The offset in number of float elements
* @return The resulting pointer
*/
private static Pointer at(Pointer p, int floatOffset)
{
return p.withByteOffset(floatOffset * Sizeof.FLOAT);
}
public void performICA()
{
Pointer dx = new Pointer();
Pointer dy = new Pointer();
Pointer dw = new Pointer();
Pointer dwcopy = new Pointer();
Pointer dh = new Pointer();
Pointer dg = new Pointer();
Pointer dDetW = new Pointer();
Pointer dwT = new Pointer();
Pointer dinvWT = new Pointer();
Pointer dxT = new Pointer();
Pointer dY = new Pointer();
Pointer dDetWValue = new Pointer();
Pointer dgs = new Pointer();
Pointer dhs = new Pointer();
// if (adjustRowL == null)
// { 71
init();
//}
//jcuda.runtime.JCuda.cudaDeviceSetLimit(jcuda.runtime.cudaLimit.cudaLimitMallocHeapSize, 512*1024*1024);
JCuda.cudaDeviceReset();
JCublas.cublasInit();
JCublas.cublasAlloc(rX*cX, Sizeof.FLOAT, dx);
JCublas.cublasAlloc(rW*cW, Sizeof.FLOAT, dw);
JCublas.cublasAlloc(rY*cY, Sizeof.FLOAT, dy);
JCublas.cublasSetVector(rX*cX, Sizeof.FLOAT, Pointer.to(xf), 1, dx, 1);
JCublas.cublasSetVector(rW*cW, Sizeof.FLOAT, Pointer.to(wf), 1, dw, 1);
JCublas.cublasSetVector(rY*cY, Sizeof.FLOAT, Pointer.to(yf), 1, dy, 1);
int sizeWInBytes = rW * cW * Sizeof.FLOAT;
// Allocating memory for the identity matrix
cudaMalloc(dY,rY*cY*Sizeof.FLOAT);
cudaMalloc(dDetW, sizeWInBytes);
cudaMalloc(dDetWValue, Sizeof.DOUBLE);
cudaMalloc(dwT,sizeWInBytes);
cudaMalloc(dwcopy,sizeWInBytes);
cudaMalloc(dinvWT,sizeWInBytes);
cudaMalloc(dxT,rX*cX*Sizeof.FLOAT);
cudaMalloc(dgs,maxiter * Sizeof.DOUBLE);
cudaMalloc(dhs,maxiter * Sizeof.DOUBLE);
// cudaMalloc(dxy, size2InBytes);
cudaMemcpy(dxT, Pointer.to(xTf),
rX*cX*Sizeof.FLOAT, cudaMemcpyHostToDevice);
cudaMemcpy(dgs,Pointer.to(gs),
maxiter*Sizeof.DOUBLE,cudaMemcpyHostToDevice);
cudaMemcpy(dhs,Pointer.to(hs),
maxiter*Sizeof.DOUBLE,cudaMemcpyHostToDevice);
// Used SP/MP for calculations
dim3 one = new dim3(1, 1, 1);
dim3 idyThreads = new dim3(BLOCKSIZE, 1, 1);
dim3 idyBlocks = new dim3(rW / BLOCKSIZE, 1, 1);
dim3 nThreads = new dim3(BLOCKSIZE, BLOCKSIZE, 1);
dim3 nBlocks = new dim3(rW / BLOCKSIZE, 1, 1);
dim3 nBlocksRest = new dim3(rW / BLOCKSIZE, rW / BLOCKSIZE, 1);
dim3 yBlocks = new dim3(rY,1,1);
dim3 yThreads = new dim3(cY,1,1);
dim3 wBlocks = new dim3(rW,1,1);
dim3 wThreads = new dim3(cW,1,1);
setIdentity.setup(idyBlocks, idyThreads).call(dw, rW);
cudaDeviceSynchronize();
for(int iter=1;iter<=maxiter;iter++)
{
setIdentity.setup(idyBlocks, idyThreads).call(dinvWT, rW);
cudaThreadSynchronize();
//Performing yf = xf * wf
// C = 1 * A * B + 0 * C
JCublas.cublasSgemm('n', 'n', rX, cW, cX, 1, dx, rX, dw, rW, 0, dy, rY); 72
// JCublas.cublasGetVector(rY*cY, Sizeof.FLOAT, dy, 1, Pointer.to(yf), 1);
//code for Yf = tanh(yf);
tanhy.setup(yBlocks, yThreads).call(
dy,dY,rY,cY);
cudaThreadSynchronize();
// Transfer the matrix from device to device
cudaMemcpy(dDetW, dw,
sizeWInBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(dwcopy, dw,
sizeWInBytes, cudaMemcpyDeviceToDevice);
// Calculate the left diagonal Matrix (L)
for (int i = 0; i < rW; i += BLOCKSIZE)
{
int offset = i * rW + i;
// Step 1:
// Calculate the triangle matrix
// Store the pivot elements to left part of the triangle
// Original call:
// eliminateBlockL_kernel <<< 1, nThreads >>>
// (dDataIn + offset, size);
//--------code to copy dw to dDetW
eliminateBlockL.setup(one, nThreads).call(
at(dw, offset), rW);
cudaThreadSynchronize();
// Step 2:
// Calculate the rest of the rows with the pivot
// elements from step 1
// Original call:
// adjustRowL_kernel <<< nBlocks, nThreads >>>
// (dDataIn + i * size, dDataIn + offset,
// dDataInv + i * size, size, i);
adjustRowL.setup(nBlocks, nThreads).call(
at(dw, i * rW), at(dw, offset),
at(dDetW, i * rW), rW, i);
cudaThreadSynchronize();
// Step 3:
// Fill the columns below the block with the pivot elements. They
// are used to get the columns to zero and multiply with the row
// Original call:
// eliminateColL_kernel <<< nBlocks, nThreads >>>
// (dDataIn + i, size, i);
eliminateColL.setup(nBlocks, nThreads).call(
at(dw, i), rW, i);
cudaThreadSynchronize();
// Step 4:
// Adjust the rest of the Matrix with the calculated pivot
// elements: El_new_0 -= (p0+p1+p2..+p15) * El_piv_0
// Original call:
// eliminateRestL_kernel <<< nBlocksRest, nThreads >>>
// (dDataIn, dDataInv, size, i);
eliminateRestL.setup(nBlocksRest, nThreads).call( 73
dw, dDetW, rW, i);
cudaThreadSynchronize();
}
//calculate det inside GPU from multiplying dDetW elements
calDet.setup(one, one).call(
dDetW,rW,dDetWValue);
cudaThreadSynchronize();
cudaMemcpy(dw, dwcopy,
sizeWInBytes, cudaMemcpyDeviceToDevice);
//--------calculate sum of Yf elements and then calculate hs
calHS.setup(one, one).call(
dY,dhs,iter-1,rY,cY,rX,dDetWValue);
cudaThreadSynchronize();
//---------calculate wT inside GPU.
calTransform.setup(wBlocks, wThreads).call(
dw,dwT,rW);
cudaThreadSynchronize();
//code for inverse of wT
// Calculate the left diagonal Matrix (L)
for (int i = 0; i < rW; i += BLOCKSIZE )
{
int offset = i * rW + i;
// Step 1:
// Calculate the triangle matrix
// Store the pivot elements to left part of the triangle
// Original call:
// eliminateBlockL_kernel <<< 1, nThreads >>>
// (dDataIn + offset, size);
eliminateBlockL.setup(one, nThreads).call(
at(dwT, offset), rW);
cudaThreadSynchronize();
// Step 2:
// Calculate the rest of the rows with the pivot
// elements from step 1
// Original call:
// adjustRowL_kernel <<< nBlocks, nThreads >>>
// (dDataIn + i * size, dDataIn + offset,
// dDataInv + i * size, size, i);
adjustRowL.setup(nBlocks, nThreads).call(
at(dwT, i * rW), at(dwT, offset),
at(dinvWT, i * rW), rW, i);
cudaThreadSynchronize();
// Step 3:
// Fill the columns below the block with the pivot elements. They
// are used to get the columns to zero and multiply with the row
// Original call:
// eliminateColL_kernel <<< nBlocks, nThreads >>>
// (dDataIn + i, size, i);
eliminateColL.setup(nBlocks, nThreads).call(
at(dwT, i), rW, i);
cudaThreadSynchronize(); 74
// Step 4:
// Adjust the rest of the Matrix with the calculated pivot
// elements: El_new_0 -= (p0+p1+p2..+p15) * El_piv_0
// Original call:
// eliminateRestL_kernel <<< nBlocksRest, nThreads >>>
// (dDataIn, dDataInv, size, i);
eliminateRestL.setup(nBlocksRest, nThreads).call(
dwT, dinvWT, rW, i);
cudaThreadSynchronize();
}
// Set the left lower diagonal matrix to zero
for (int i = 1; i < rW; i++)
{
int offset = i * rW;
cudaMemset(at(dwT, offset), 0, i * Sizeof.FLOAT);
}
cudaThreadSynchronize();
// Calculate the right diagonal Matrix (U)
int size = cW;
for (int i = (size - BLOCKSIZE); i >= 0; i -= BLOCKSIZE)
{
int offset = i * size + i;
// Step 1:
// Calculate the triangle matrix
// Store the pivot elements to right part of the triangle
// Original call:
// eliminateBlockU_kernel <<< 1, nThreads >>>
// (dDataIn + offset, size);
eliminateBlockU.setup(one, nThreads).call(
at(dwT, offset), size);
cudaThreadSynchronize();
// Step 2:
// calculate the rest of the rows with the pivot
// elements from step 1
// Original call:
// adjustRowU_kernel <<< nBlocks, nThreads >>>
// (dDataIn + offset, dDataInv + i * size, size, i);
adjustRowU.setup(nBlocks, nThreads).call(
at(dwT, offset), at(dinvWT, i * size), size, i);
cudaThreadSynchronize();
// Step 3:
// Fill the columns below the block with the pivot elements. They
// are used to get the columns to zero and multiply with the row
// Original call:
// eliminateColU_kernel <<< nBlocks, nThreads >>>
// (dDataIn + i, size, i);
eliminateColU.setup(nBlocks, nThreads).call(
at(dwT, i), size, i);
cudaThreadSynchronize();
// Step 4:
// Adjust the rest of the Matrix with the calculated pivot 75
// elements: El_new_0 -= (p0+p1+p2..+p15) * El_piv_0
// Original call:
// eliminateRestU_kernel <<< nBlocksRest, nThreads >>>
// (dDataIn, dDataInv, size, i);
eliminateRestU.setup(nBlocksRest, nThreads).call(
dwT, dinvWT, size, i);
cudaThreadSynchronize();
}
// Force the diagonal entries to 1
for (int i = 0; i < rW; i += BLOCKSIZE)
{
int rowOffset = i * rW;
// Original call:
// normalizeDiag_kernel <<< nBlocks, nThreads >>>
// (dDataIn + rowOffset, dDataInv + rowOffset, size, i);
normalizeDiag.setup(nBlocks, nThreads).call(
at(dwT, rowOffset), at(dinvWT, rowOffset), rW, i);
cudaThreadSynchronize();
}
//--------code to perform dinvWT= dinvWT - (2/N)* dxy
// N = rX
// C = 1 * A * B + 0 * C
JCublas.cublasSgemm('n', 'n', rW, cW, rX, (-2 /(float) rX), dxT, cX, dY, rY, 1, dinvWT, rW);
//---------code to perform gs = norm(dinvWT)
//g=invWT
calGS.setup(one, one).call(
dinvWT,dgs,iter-1,rW);
cudaThreadSynchronize();
//-----------Code to perform wf = wf+ eta * g
// g = invWT
calSumWg.setup(wBlocks, wThreads).call(
dw,dinvWT,eta,rW);
cudaThreadSynchronize();
}
cudaMemcpy(Pointer.to(gs),dgs,
maxiter*Sizeof.DOUBLE,cudaMemcpyDeviceToHost);
cudaMemcpy(Pointer.to(hs),dhs,
maxiter*Sizeof.DOUBLE,cudaMemcpyDeviceToHost);
cudaMemcpy(Pointer.to(wf),dw,sizeWInBytes,cudaMemcpyDeviceToHost);
wArray = oneDtoTwoD(wf,rW,cW);
JCublas.cublasShutdown();
JCuda.cudaFree(dx);
JCuda.cudaFree(dy);
JCuda.cudaFree(dw);
JCuda.cudaFree(dwcopy);
JCuda.cudaFree(dh);
JCuda.cudaFree(dg);
JCuda.cudaFree(dDetW);
JCuda.cudaFree(dwT);
JCuda.cudaFree(dinvWT); 76
JCuda.cudaFree(dxT);
JCuda.cudaFree(dY);
JCuda.cudaFree(dDetWValue);
JCuda.cudaFree(dgs);
JCuda.cudaFree(dhs);
JCuda.cudaDeviceReset();
}
public void startICA(int m,int n,double[][] input,boolean useG,int iter,double etaVal)
{
this.M = m;
this.N = n;
this.xArray= new double[n][m];
for(int i=0;i<n;i++)
for(int j=0;j<m;j++)
this.xArray[i][j]=input[j][i];
this.x = new Matrix(this.xArray);
this.useGPU = useG;
this.maxiter = iter;
this.eta = (float)etaVal;
rX = N;
cX = M;
rY = N;
cY = M;
rW = M;
cW = M;
xf=TwoDtoOneD(xArray,rX,cX);
yf=TwoDtoOneD(xArray,rY,cY);
xTf = TwoDtoOneD(matTranspose(xArray),cX,rX);
w = Matrix.identity(M,M);
wArray =w.getArray();
wf=TwoDtoOneD(wArray,rW,cW);
long time1 = System.currentTimeMillis();
if(!useGPU)
icaCode();
else
icaCodeWithGPU();
long timeTaken = System.currentTimeMillis() - time1;
//System.out.println("\nTime taken is:"+(timeTaken)+"ms");
this.executionTime = (int)timeTaken;
}
}
77
VITA
Harish Ankam is pursuing Master of Science in Computer Science from Texas A&M University-Commerce at Commerce, TX from January 2013. His expected graduation is in May 2014. He received Bachelors of Technology in Electronics and Communication Engineering from Kakatiya University at Warangal, Andrapradesh, India in May 2009. He worked as Graduate Research Assistant under Dr. Mutlu Mete from January 2013 to August 2013 and conducted research on Independent Component Analysis with Graphical Processing Units. He presented poster in IEEE EMBS Medical Device Symposium 2013 at University of Texas, Dallas, TX in November 2013 and received the third best poster award in the symposium. He also presented poster in The Annual MCBIOS conference (MCBIOS-XI) at Oklahoma State University, Stillwater, OK in March 2014.
Harish Ankam can be reached at Department of Computer Science, Texas A&M University-Commerce, Commerce, TX 75429. His emails are hankam@leomail.tamuc.edu, ankamharish@yahoo.co.in.
78