SUMMIT EAST SUMMIT EAST Interactive	Scientific	Image Analysis	and	Analytics	using	Spark Kevin	Mader Spark	East,	NYC,	19	March	2015
SUMMIT EAST Outline Background:	Our	Technique	(why	we have	big	data) X-Ray	Tomographic	Microscopy Imaging	in	2015 The	Problem(s) The	Tools Spark	Imaging	Layer 3D	Imaging Hyperspectral	Imaging Interactive	Analysis	/	Streaming The	Science Genome	Scale	Studies Large	Datasets Outlook	/	Developments
SUMMIT EAST Synchrotron-based	X-Ray	Tomographic	Microscopy The	only	technique	which	can	do	all peer	deep	into	large	samples achieve	isotropic	spatial resolution with	1.8mm	field	of	view achieve	>10	Hz	temporal	resolution 8GB/s	of	images [1]	Mokso	et	al.,	J.	Phys.	D,	46(49),2013 < 1μm Courtesy	of	M.	Pistone	at	U.	Bristol
SUMMIT EAST Image	Science	in	2015:	More	and	faster X-Ray Swiss	Light	Source	(SRXTM)	images	at (>1000fps)	8GB/s,	diffraction patterns	(cSAXS)	at	30GB/s Nanoscopium	(Soleil),	10TB/day,	10- 500GB	file	sizes,	very	heterogenous	data Optical Light-sheet	microscopy	(see	of Jeremy	Freeman)	produces	images 500MB/s High-speed	confocal	images	at	(>200fps)	78Mb/s Geospatial New	satellite	projects	(Skybox,	etc)	will measure	hundreds	of	terabytes	to petabytes	of	images	a	year → talk → → Personal GoPro	4	Black	-	60MB/s	(3840	x	2160	x 30fps)	for	$600	-	400MB/s	(640	x	480	x	840	fps) for	$400 fps1000
SUMMIT EAST How	much	is	a	TB,	really? If	you	looked	at	one	1000	x	1000	sized image	every	second It	would	take	you 139	hours	to	browse	through	a	terabyte	of data. Year Time	to	1 TB Man	power	to keep	up Salary	Costs	/ Month 2000 4096	min 2	people 25	kCHF 2008 1092	min 8	people 95	kCHF 2014 32	min 260	people 3255	kCHF 2016 2	min 3906	people 48828	kCHF
SUMMIT EAST Computing	has	changed:	Parallel Moores	Law Based	on	data	from Transistors ∝ 2T/(18 months) https://gist.github.com/humberto- ortiz/de4b3a621602b78bf90d There	are	now	many	more	transistors	inside a	single	computer	but	the	processing	speed hasn't	increased.	How	can	this	be? Multiple	Core Many	machines	have	multiple	cores for	each	processor	which	can	perform tasks	independently Multiple	CPUs More	than	one	chip	is	commonly present New	modalities GPUs	provide	many	cores	which operate	at	slow	speed Parallel	Code	is	important
SUMMIT EAST Cloud	Computing	Costs The	figure	shows	the	range	of	cloud	costs (determined	by	peak	usage)	compared	to	a local	workstation	with	utilization	shown	as the	average	number	of	hours	the	computer is	used	each	week. The	figure	shows	the	cost	of	a	cloud	based solution	as	a	percentage	of	the	cost	of buying	a	single	machine.	The	values	below	1 show	the	percentage	as	a	number.	The panels	distinguish	the	average	time	to replacement	for	the	machines	in	months
SUMMIT EAST The	Problem There	is	a	flood	of	new	data What	took	an	entire	PhD	3-4	years	ago,	can	now	be	measured	in	a	weekend,	or	even	several seconds.	Analysis	tools	have	not	kept	up,	are	difficult	to	customize,	and	usually	highly specific. Optimized	Data-Structures	do	not	fit Data-structures	that	were	fast	and	efficient	for	computers	with	640kb	of	memory	do	not make	sense	anymore Single-core	computing	is	too	slow CPU's	are	not	getting	that	much	faster	but	there	are	a	lot	more	of	them.	Iterating	through	a huge	array	takes	almost	as	long	on	2014	hardware	as	2006	hardware
SUMMIT EAST Exploratory	Image	Processing	Priorities Correctness The	most	important	job	for	any	piece	of analysis	is	to	be	correct. A	powerful	testing	framework	is essential Avoid	repetition	of	code	which	leads	to inconsistencies Use	compilers	to	find	mistakes	rather than	users Easily	understood,	changed,	and used Almost	all	image	processing	tasks	require	a number	of	people	to	evaluate	and implement	them	and	are	almost	always moving	targets Flexible,	modular	structure	that	enables Fast The	last	of	the	major	priorities	is	speed which	covers	both	scalability,	raw performance,	and	development	time. Long	waits	for	processing	discourages exploration Manual	access	to	data	on	separeate	disks is	a	huge	speed	barrier Real-time	image	processing	requires millisecond	latencies Implementing	new	ideas	can	be	done quickly
SUMMIT EAST The	Framework	First Rather	than	building	an	analysis	as quickly	as	possible	and	then	trying	to hack	it	to	scale	up	to	large	datasets chose	the	framework	first then	start	making	the	necessary	tools. Google,	Amazon,	Yahoo,	and	many	other companies	have	made	huge	in-roads	into these	problems The	real	need	is	a	fast,	flexible framework	for	robustly,	scalably performing	complicated	analyses,	a	sort of	Excel	for	big	imaging	data. Apache	Spark	and	Hadoop	2 The	two	frameworks	provide	a	free	out	of the	box	solution	for scaling	to	>10000	computers storing	and	processing	exabytes	of	data fault	tolerance 2/3rds	of	computers	can	crash	and	a request	still	accurately	finishes hardware	and	software	platform indpendence	(Mac,	Windows,	Linux)
SUMMIT EAST Spark	->	Microscopy? These	frameworks	are	really	cool	and	Spark has	a	big	vocabulary,	but	flatMap,	filter, aggregate,	join,	groupBy,	and	fold	still	do	not sound	like	anything	I	want	to	do	to	an image. I	want	to filter	out	noise,	segment,	choose	regions of	interest contour,	component	label measure,	count,	and	analyze … Spark	Image	Layer Developed	at , ,	and The	Spark	Image	Layer	is	a	Domain Specific	Language	for	Microscopy	for Spark. It	converts	common	imaging	tasks	into coarse-grained	Spark	operations 4Quant ETH	Zurich Paul	Scherrer	Institut
SUMMIT EAST Spark	Image	Layer We	have	developed	a	number	of	commands for	SIL	handling	standard	image	processing tasks Fully	exensible	with
SUMMIT EAST Use	case:	Hyperspectral	Imaging Hyperspectral	imaging	is	a	rapidly	growing area	with	the	potentially	for	massive datasets	and	a	severe	deficit	of	usuable tools. The	scale	of	the	data	is	large	and	standard image	processing	tools	are	ill-suited	for handling	them,	although	the	ideas	used	in image	processing	are	equally	applicable	to hyperspectral	data	(filtering,	thresholding, segmentation,…)	and	distributed,	parallel approaches	make	even	more	sense	on	such massive	datasets
SUMMIT EAST Flexibility	through	Types Developing	in	Scala	brings	additional flexibility	through	types[1],	with	microscopy the	standard	formats	are	2-,	3-	and	even	4- or	more	dimensional	arrays	or	matrices which	can	be	iterated	through	quickly	using CPU	and	GPU	code.	While	still	possible	in Scala,	there	is	a	great	deal	more	flexibility for	data	types	allowing	anything	to	be stored	as	an	image	and	then	processed	as long	as	basic	functions	make	sense. [1]	Fighting	Bit	Rot	with	Types	(Experience Report:	Scala	Collections),	M	Odersky, FSTTCS	2009,	December	2009 What	is	an	image? A	collection	of	positions	and	values,	maybe more	(not	an	array	of	double).	Arrays	are efficient	for	storing	in	computer	memory, but	often	a	poor	way	of	expressing	scientific ideas	and	analyses. Filter	Noise? combine information from nearby pixels Find	objects determine groups of pixels which are very similar to desired result
SUMMIT EAST Making	Coding	Simpler	with	Types trait BasicMathSupport[T] extends Serializable { def plus(a: T, b: T): T def times(a: T, b: T): T def scale(a: T, b: Double): T def negate(a: T): T = scale(a,-1) def invert(a: T): T def abs(a: T): T def minus(a: T, b: T): T = plus(a, negate(b)) def divide(a: T, b: T): T = times(a, invert(b)) def compare(a: T, b: T): Int }
SUMMIT EAST Continuing	with	Types Simple	filter	implementation Spectra	as	well	supported	types def SimpleFilter[T](inImage: Image[T]) (implicit val wst: BasicMathSupport[T]) = { val width: Double = 1 kernel = (pos: D3int,value: T) => value * exp(- (pos.mag/width)**2) kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5 runFilter(inImage,kernel,kernelReduce) } implicit val SpectraBMS = new BasicMathSupport[Array[Double]] { def plus(a: Array[Double], b: Array[Double]) = a.zip(b).map(_ + _) ... def scale(a: Array[Double], b: Double) = a.map(_*b)
SUMMIT EAST Interactive	Analysis Combining	many	different	components together	inside	of	the	Spark	Shell,	IPython or	Zeppelin,	make	it	easier	to	assemble workflows
SUMMIT EAST Scientific	Cases:	Genome-scale	Imaging We	want	to	understand	the	relationship between	genetic	background	and	bone structure With	existing	tools,	analysis	is	possible and	a	number	of	publications	have	been made,	even	ones	that	show	differences between	strains	of	mice But n<12 time-consuming	(years	between measurement	and	publication) not	flexible	or	reproducible not	cloud-based
SUMMIT EAST Genome-Scale	Imaging Genetic	studies	require	hundreds	to thousands	of	samples,	in	this	case	the difference	between	717	and	1200	samples is	the	difference	between	finding	the	links and	finding	nothing. 2008	approach	-	120	years Hand	Identification	->	30s	/	object 30-40k	objects	per	sample One	Sample	in	6.25	weeks 2014	approach	-	1.5	years ImageJ	macro	for	segmentation	(2-4 hours	/	sample) Python	script	for	shape	analysis	(3	hours /	sample) Paraview	macro	for	network	and connectivity	(2	hours	/	sample) Python	script	to	pool	results	(3-4	hours) MySQL	Database	storing	results	(5 minutes	/	query)
SUMMIT EAST Genetic	Studies	using	Spark	Image	Layer Analysis	could	be	completed	in	several	months	(instead	of	120	years,	could	now	be completed	in	days	in	the	cloud) Data	can	be	freely	explored	and	analyzed val bones = sc.loadImages("work/f2_bones/*/bone.tif") Segment	hard	and	soft	tissues Label	cells Export	results val hardTissue = bones.threshold(OTSU) val softTissue = hardTissue.invert val cells = hardTissue.componentLabel. filter(c=>c.size>100 & c.size<1000) cells.shapeAnalysis.WriteOutput("lacuna.csv")
SUMMIT EAST Parallel	Tools	for	Image	and	Quantitative	Analysis val cells = sqlContext.csvFile("work/f2_bones/*/cells.csv") val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROM cells GROUP BY SAMPLE") Collaborators	/	Competitors	can	verify	results	and	extend	on	analyses Combine	Images	with	Results avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones) See	immediately	in	datasets	of	terabytes	which	image	had	the	largest	cells New	hypotheses	and	analyses	can	be	done	in	seconds	/	minutes Task Single	Core	Time Spark	Time	(40	cores) Load	and	Preprocess 360	minutes 10	minutes Single	Column	Average 4.6s 400ms 1	K-means	Iteration 2	minutes 1s
SUMMIT EAST Science	Problems:	Full	Brain	Imaging Collaboration	with	A.	Astolfo	and	A. Patera Measure	a	full	mouse	brain	(1cm )	with cellular	resolution	(1	m) 10	x	10	x	10	scans	at	2560	x	2560	x	2160	14	TVoxels 0.000004%	of	the	entire	dataset 3 μ → 14TVoxels	=	56TB Each	scan	needs	to	be	registered	and aligned	together There	are	no	computers	with	56TB	of memory Even	multithreaded	approachs	are	not feasible	and	require	many	logistics Analysis	of	the	stitched	data	is	also	of interest	(segmentation,	vessel	analysis, distribution	and	network	connectivity)
SUMMIT EAST Science	Problems:	Big	Stitching Images : RDD[((x, y, z), Img[Double])] = [( , Img), …]x⃗ dispField = Images. cartesian(Images).map{ case ((xA,ImA), (xB,ImB)) => xcorr(ImA,ImB,in=xB-xA) }
SUMMIT EAST From	Matching	to	Stitching From	the	updated	information	provided	by the	cross	correlations	and	by	applying appropriate	smoothing	criteria	(if necessary). The	stitching	itself,	rather	than	rewriting the	original	data	can	be	done	in	a	lazy fashion	as	certain	regions	of	the	image	are read. This	also	ensures	the	original	data	is	left unaltered	and	all	analysis	is	reversible. def getView(tPos,tSize) = stImgs. filter(x=>abs(x-tPos) <img.size). map { case (x,img) => val oImg = new Image(tSize) oImg.copy(img,x,tPos) }.addImages(AVG)
SUMMIT EAST Viewing	Regions getView(Pos(26.5,13),Size(2,2))
SUMMIT EAST Real-time	with	Spark	Streaming:	Webcam In	the	biological	imaging	community,	the open	source	tools	of	ImageJ2	and	Fiji	are widely	accepted	and	have	a	large	number	of readily	available	plugins	and	tools. We	can	integrate	the	functionality	directly into	Spark	and	perform	operations	on	much larger	datasets	than	a	single	machine	could have	in	memory.	Additionally	these analyses	can	be	performed	on	streaming data.
SUMMIT EAST Streaming	Analysis	Real-time	Webcam	Processing Filter	images Create	a	background	image val wr = new WebcamReceiver() val ssc = sc.toStreaming(strTime) val imgList = ssc.receiverStream(wr) val filtImgs = allImgs.mapValues(_.run("Median...","radius=3")) val totImgs = inImages.count() val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)
SUMMIT EAST Identify	Outliers	in	Streams Remove	the	background	image	and	find	the	mean	value Show	the	outliers val eventImages = filtImgs. transform{ inImages => val corImage = inImages.map { case (inTime,inImage) => val corImage = inImage.subtract(bgImage) (corImage.getImageStatistics().mean, (inTime,corImage)) } corImage } eventImages.filter(iv => Math.abs(iv._1)>20). foreachRDD(showResultsStr("outlier",_))
SUMMIT EAST Streaming	Demo	with	Webcam
SUMMIT EAST As	a	scientist	(not	a	data-scientist) Apache	Spark	is	brilliant	platform	and utilizing	GraphX,	MLLib,	and	other	packages there	unlimited	possibilities Scala	can	be	a	beautiful	but	not	easy language Python	is	an	easier	language Both	suffer	from Non-obvious	workflows Scripts	depending	on	scripts depending	on	scripts	(can	be	very fragile) Although	all	analyses	can	be	expressed as	a	workflow,	this	is	often	difficult	to	see from	the	code Non-technical	persons	have	little	ability to	understand	or	make	minor adjustments	to	analysis Parameters	require	recompiling	to change or	GUIs	need	to	be	placed	on	top
SUMMIT EAST A	basic	image	filtering	operation Thanks	to	Spark,	it	is	cached,	in	memory,	approximate,	cloud-ready Thanks	to	Map-Reduce	it	is	fault-tolerant,	parallel,	distributed Thanks	to	Java,	it	is	hardware	agnostic But	it	is	also	not	really	so	readable def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = { val wind=(-windSize to windSize) val pos=pvec._1 val scalevalue=pvec._2/(wind.length*wind.length) for(x<-wind; y<-wind) yield ((pos._1+x,pos._2+y),scalevalue) } val filtImg=roiImg. flatMap(cvec => spread_voxels(cvec)). filter(roiFun).reduceByKey(_ + _)
SUMMIT EAST Little	blocks	for	big	data Here	we	use	a	-based	workflow	and our	Spark	Imaging	Layer	extensions	to create	a	workflow	without	any	Scala	or programming	knowledge	and	with	an	easily visible	flow	from	one	block	to	the	next without	any	performance	overhead	of	using other	tools. KNIME
SUMMIT EAST Reality	Check Spark	is	not	performant	dedicated, optimized	CPU	and	GPU	codes	will perform	slightly	to	much	much	better when	evaulated	by	pixels	per	second	per processing	power	unit these	codes	will	be	wildly outperformed	by	dedicated	hardware /	FPGA	solutions Serialization	overhead	and	network congestion	are	not	neglible	for	large datasets → But Scala	/	Python	in	Spark	is	substantially easier	to	write	and	test Highly	optimized	codes	are	very inflexible Human	time	is	400x	more	expensive than	AWS	time Mistakes	due	to	poor	testing	can	be fatal Spark	scales	smoothly	to	enormous datasets GPUs	rarely	have	more	than	a	few gigabytes Writing	code	that	pages	to	disk	is painful Spark	is	hardware	agnostic	(no	drivers	or vendor	lock-in)
SUMMIT EAST We	have	a	cool	tool,	but	what	does	this	mean	for	me? A	spinoff	-	4Quant:	From	images	to	insight Cloud	Image	Processing Use	our	distributed	version	of	ImageJ in	the	cloud	to	analyze	thousands	of remote	datasets	using	your	own,	ours, or	community	provided	processing routines Custom	Analysis	Solutions Custom-tailored	software	to	solve your	problems One	Stop	Shop Measurement,	analysis,	and	statistical analysis Education	/	Training Consulting Advice	on	imaging	techniques,	analysis possibilities Development	of	new	analysis	tools and	workflows Education Workshops	on	Image	Analysis Courses	/	Training Quantitative	Big	Imaging
SUMMIT EAST Acknowledgements AIT	at	PSI	and	Scientific	Computer	at ETH TOMCAT	Group We	are	interested	in	partnerships	and collaborations Learn	more	at 4Quant:	From	Images	to	Statistics	- X-Ray	Imaging	Group	at	ETH	Zurich	- http://www.4quant.com http://bit.ly/1gD8wKb Quantitative	Big	Imaging	Course	at	ETH Zurich
SUMMIT EAST Feature	Vectors A	pairing	between	spatial	information (position)	and	some	other	kind	of information	(value). We	are	used	to	seeing	images	in	a	grid format	where	the	position	indicates	the	row and	column	in	the	grid	and	the	intensity (absorption,	reflection,	tip	deflection,	etc)	is shown	as	a	different	color →x⃗ f ⃗ The	alternative	form	for	this	image	is	as	a list	of	positions	and	a	corresponding	value x y Intensity 1 1 12 2 1 68 3 1 81 4 1 89 5 1 87 1 2 40 This	representation	can	be	called	the feature	vector	and	in	this	case	it	only	has Intensity = ( , )I^ x⃗f ⃗
SUMMIT EAST Why	Feature	Vectors If	we	use	feature	vectors	to	describe	our image,	we	are	no	longer	to	worrying	about how	the	images	will	be	displayed,	and	can focus	on	the	segmentation/thresholding problem	from	a	classification	rather	than	a image-processing	stand	point. Example So	we	have	an	image	of	a	cell	and	we	want to	identify	the	membrane	(the	ring)	from the	nucleus	(the	point	in	the	middle). A	simple	threshold	doesn't	work	because we	identify	the	point	in	the	middle	as	well. We	could	try	to	use	morphological	tricks	to get	rid	of	the	point	in	the	middle,	or	we could	better	tune	our	segmentation	to	the ring	structure.
SUMMIT EAST Adding	a	new	feature In	this	case	we	add	a	very	simple	feature	to the	image,	the	distance	from	the	center	of the	image	(distance). x y Intensity Distance -10 -10 0.9350683 14.14214 -10 -9 0.7957197 13.45362 -10 -8 0.6045178 12.80625 -10 -7 0.3876575 12.20656 -10 -6 0.1692429 11.66190 We	now	have	a	more	complicated	image, which	we	can't	as	easily	visualize,	but	we can	incorporate	these	two	pieces	of information	together.
SUMMIT EAST Applying	two	criteria Now	instead	of	trying	to	find	the	intensity for	the	ring,	we	can	combine	density	and distance	to	identify	it iff(5 < Distance < 10 &0.5 < Intensity > 1.0)
SUMMIT EAST Common	Features The	distance	while	illustrative	is	not	a commonly	used	features,	more	common various	filters	applied	to	the	image Gaussian	Filter	(information	on	the values	of	the	surrounding	pixels) Sobel	/	Canny	Edge	Detection (information	on	edges	in	the	vicinity) Entroy	(information	on	variability	in vicinity) x y Intensity Sobel Gaussian 1 1 0.94 0.32 0.53 1 10 0.48 0.50 0.45 1 11 0.50 0.50 0.46 1 12 0.48 0.64 0.46 1 13 0.43 0.78 0.45 1 14 0.33 0.94 0.42
SUMMIT EAST Analyzing	the	feature	vector The	distributions	of	the	features	appear very	different	and	can	thus	likely	be	used for	identifying	different	parts	of	the	images. Combine	this	with	our	a	priori	information (called	supervised	analysis)
SUMMIT EAST Using	Machine	Learning Now	that	the	images	are	stored	as	feature vectors,	they	can	be	easily	analyzed	with standard	Machine	Learning	tools.	It	is	also much	easier	to	combine	with	training information. x y Absorb Scatter Training 700 4 0.3706262 0.9683849 0.0100140 704 4 0.3694059 0.9648784 0.0100140 692 8 0.3706371 0.9047878 0.0183156 696 8 0.3712537 0.9341989 0.0334994 700 8 0.3666887 0.9826912 0.0453049 704 8 0.3686623 0.8728824 0.0453049 Want	to	predict	Training	from	x,y, Absorb, and Scatter	MLLib: Logistic	Regression,	Random	Forest,	K- Nearest	Neighbors,	… →
SUMMIT EAST Beyond	Image	Processing For	many	datasets	processing, segmentation,	and	morphological	analysis	is all	the	information	needed	to	be	extracted. For	many	systems	like	bone	tissue,	cellular tissues,	cellular	materials	and	many	others, the	structure	is	just	the	beginning	and	the most	interesting	results	come	from	the application	to	physical,	chemical,	or biological	rules	inside	of	these	structures. = m ∑ j F⃗ij x¨i Such	systems	can	be	easily	represented	by	a graph,	and	analyzed	using	GraphX	in	a distributed,	fault	tolerant	manner.
SUMMIT EAST Hadoop	Filesystem	(HDFS	not	HDF5) Bottleneck	is	filesystem	connection,	many nodes	(10+)	reading	in	parallel	brings	even GPFS-based	infiniband	system	to	a	crawl One	of	the	central	tenants	of	MapReduce™ is	data-centric	computation	instead	of data	to	computation,	move	the	computation to	the	data. Use	fast	local	storage	for	storing everything	redundantly	less	transfer and	fault-tolerance Largest	file	size:	512	yottabytes,	Yahoo has	14	petabyte	filesystem	in	use → →

Interactive Scientific Image Analysis using Spark