ROLE
Author only
· Editor only
· Advisor only
· Other only
· All roles
KEYWORDS
See all author supplied keywords
BOOKMARK & SHARE
|
|
1
March 2017
ACM Transactions on Reconfigurable Technology and Systems (TRETS) - Special Section on Field Programmable Logic and Applications 2015 and Regular Papers: Volume 10 Issue 2, April 2017
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 22, Downloads (12 Months): 32, Downloads (Overall): 32
Full text available:
PDF
High-level abstractions separate algorithm design from platform implementation, allowing programmers to focus on algorithms while building complex systems. This separation also provides system programmers and compilers an opportunity to optimize platform services on an application-by-application basis. In field-programmable gate arrays (FPGAs), platform-level malleability extends to the memory system: Unlike general-purpose ...
Keywords:
scalable cache, memory hierarchy, resource-aware optimization, FPGA
2
February 2017
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 16, Downloads (12 Months): 105, Downloads (Overall): 105
Full text available:
PDF
Memory systems play a key role in the performance of FPGA applications. As FPGA deployments move towards design entry points that are more serial, memory latency has become a serious design consideration. For these applications, memory network optimization is essential in improving performance. In this paper, we examine the automatic, ...
Keywords:
network on chip, FPGA, compiler optimization, memory network, memory system, reconfigurable computing
3
CLARA: Circular Linked-List Auto and Self Refresh Architecture
Aditya Agrawal,
Mike O'Connor,
Evgeny Bolotin,
Niladrish Chatterjee,
Joel Emer,
Stephen Keckler
October 2016
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 5, Downloads (12 Months): 47, Downloads (Overall): 47
Full text available:
PDF
With increasing DRAM densities, the performance and energy overheads of refresh operations are increasingly significant. When the system is active, refresh commands render DRAM banks unavailable for increasing periods of time. These refresh operations can interfere with regular memory operations and hurt performance. In addition, when the system is idle, ...
Keywords:
Self refresh, DRAM, Auto refresh
4
Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks
June 2016
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
Publisher: IEEE Press
Bibliometrics:
Citation Count: 7
Downloads (6 Weeks): 27, Downloads (12 Months): 240, Downloads (Overall): 240
Full text available:
PDF
Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. ...
Also published in:
October 2016
ACM SIGARCH Computer Architecture News - ISCA'16: Volume 44 Issue 3, June 2016
5
LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning
February 2016
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 6, Downloads (12 Months): 100, Downloads (Overall): 180
Full text available:
PDF
As FPGAs have grown in size and capacity, FPGA memory systems have become both richer and more diverse in order to support the increased computational capacity of FPGA fabrics. Using these resources, and using them well, has become commensurately more difficult, especially in the context of legacy designs ported from ...
Keywords:
resource-aware optimization, fpga memory partitioning
6
A scalable architecture for ordered parallelism
December 2015
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 15, Downloads (12 Months): 201, Downloads (Overall): 356
Full text available:
PDF
We present Swarm, a novel architecture that exploits ordered irregular parallelism , which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ...
Keywords:
ordered parallelism, speculative execution, multicore, fine-grain parallelism, irregular parallelism, synchronization
7
A fast and accurate analytical technique to compute the AVF of sequential bits in a processor
Steven Raasch,
Arijit Biswas,
Jon Stephan,
Paul Racunas,
Joel Emer
December 2015
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 3, Downloads (12 Months): 53, Downloads (Overall): 150
Full text available:
PDF
The rate of particle induced soft errors in a processor increases in proportion to the number of bits. This soft error rate (SER) can limit the performance of a system by placing an effective limit on the number of cores, nodes or clusters. The vulnerability of bits in a processor ...
Keywords:
fault injection, sequentials, AVF, fault simulation, soft error, ACE analysis, reliability
8
Michael Pellauer,
Angshuman Parashar,
Michael Adler,
Bushra Ahsan,
Randy Allmon,
Neal Crago,
Kermin Fleming,
Mohit Gambhir,
Aamer Jaleel,
Tushar Krishna,
Daniel Lustig,
Stephen Maresh,
Vladimir Pavlov,
Rachid Rayess,
Antonia Zhai,
Joel Emer
September 2015
ACM Transactions on Computer Systems (TOCS): Volume 33 Issue 3, September 2015
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 13, Downloads (12 Months): 79, Downloads (Overall): 286
Full text available:
PDF
There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the ...
Keywords:
Spatial programming, reconfigurable accelerators
9
LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories
May 2014
FCCM '14: Proceedings of the 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines
Publisher: IEEE Computer Society
Parallel programming has been widely used in many scientific and technical areas to solve large problems. While general-purpose processors have rich infrastructure to support parallel programming on shared memory, such as coherent caches and synchronization libraries, parallel programming infrastructure for FPGAs is limited. Thus, development of FPGA-based parallel algorithms remains ...
Keywords:
FPGA shared memory, coherency, synchronization
10
December 2013
ACM Transactions on Architecture and Code Optimization (TACO): Volume 10 Issue 4, December 2013
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 2, Downloads (12 Months): 23, Downloads (Overall): 323
Full text available:
PDF
As microprocessor designs integrate more cores, scalability of cache coherence protocols becomes a challenging problem. Most directory-based protocols avoid races by using blocking tag directories that can impact the performance of parallel applications. In this article, we first quantitatively demonstrate that state-of-the-art blocking protocols significantly constrain throughput at large core ...
Keywords:
tag directories, Cache coherence, nonblocking, synchronization
11
Triggered instructions: a control paradigm for spatially-programmed architectures
Angshuman Parashar,
Michael Pellauer,
Michael Adler,
Bushra Ahsan,
Neal Crago,
Daniel Lustig,
Vladimir Pavlov,
Antonia Zhai,
Mohit Gambhir,
Aamer Jaleel,
Randy Allmon,
Rachid Rayess,
Stephen Maresh,
Joel Emer
June 2013
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
Publisher: ACM
Bibliometrics:
Citation Count: 12
Downloads (6 Weeks): 19, Downloads (12 Months): 129, Downloads (Overall): 1,050
Full text available:
PDF
In this paper, we present triggered instructions , a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication ...
Keywords:
reconfigurable accelerators, spatial programming
Also published in:
June 2013
ACM SIGARCH Computer Architecture News - ICSA '13: Volume 41 Issue 3, June 2013
12
A Hierarchical Architectural Framework for Reconfigurable Logic Computing
May 2013
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Publisher: IEEE Computer Society
Recently there has been growing interest in using Reconfigurable Logic (RL) for computation because of the significant performance gains that they can provide over traditional architectures on many classes of workloads. While there is a rich body of prior work proposing a variety of reconfigurable systems, we believe there hasn't ...
Keywords:
reconfigurable logic architecture taxonomy
13
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)
June 2012
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 51
Downloads (6 Weeks): 13, Downloads (12 Months): 174, Downloads (Overall): 1,422
Full text available:
PDF
Single-ISA heterogeneous multi-core processors are typically composed of small (e.g., in-order) power-efficient cores and big (e.g., out-of-order) high-performance cores. The effectiveness of heterogeneous multi-cores depends on how well a scheduler can map workloads onto the most appropriate core type. In general, small cores can achieve good performance if the workload ...
Also published in:
September 2012
ACM SIGARCH Computer Architecture News - ISCA '12: Volume 40 Issue 3, June 2012
14
CRUISE: cache replacement and utility-aware scheduling
March 2012
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Publisher: ACM
Bibliometrics:
Citation Count: 20
Downloads (6 Weeks): 7, Downloads (12 Months): 59, Downloads (Overall): 934
Full text available:
PDF
When several applications are co-scheduled to run on a system with multiple shared LLCs, there is opportunity to improve system performance. This opportunity can be exploited by the hardware, software, or a combination of both hardware and software. The software, i.e., an operating system or hypervisor, can improve system performance ...
Keywords:
cache replacement, shared cache, scheduling
Also published in:
April 2012
ACM SIGARCH Computer Architecture News - ASPLOS '12: Volume 40 Issue 1, March 2012 June 2012
ACM SIGPLAN Notices - ASPLOS '12: Volume 47 Issue 4, April 2012
15
Leveraging latency-insensitivity to ease multiple FPGA design
February 2012
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Publisher: ACM
Bibliometrics:
Citation Count: 6
Downloads (6 Weeks): 4, Downloads (12 Months): 34, Downloads (Overall): 323
Full text available:
PDF
Traditionally, hardware designs partitioned across multiple FPGAs have had low performance due to the inefficiency of maintaining cycle-by-cycle timing among discrete FPGAs. In this paper, we present a mechanism by which complex designs may be efficiently and automatically partitioned among multiple FPGAs using explicitly programmed latency-insensitive links. We describe the ...
Keywords:
high-level synthesis, switch architecture, FPGA, programming languages, DSP, compiler, design automation
16
The gradient-based cache partitioning algorithm
William Hasenplaugh,
Pritpal S. Ahuja,
Aamer Jaleel,
Simon Steely Jr.,
Joel Emer
January 2012
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers: Volume 8 Issue 4, January 2012
Publisher: ACM
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 5, Downloads (12 Months): 41, Downloads (Overall): 496
Full text available:
PDF
This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources proportional to the miss-rate of each competing thread irrespective of whether the cache space will be utilized ...
Keywords:
adaptive caching, chernoff bound, hill climbing, dynamic control, gradient descent, Cache replacement, dynamic cache partitioning, insertion policy
17
PACMan: prefetch-aware cache management for high performance caching
December 2011
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Publisher: ACM
Bibliometrics:
Citation Count: 22
Downloads (6 Weeks): 6, Downloads (12 Months): 69, Downloads (Overall): 874
Full text available:
PDF
Hardware prefetching and last-level cache (LLC) management are two independent mechanisms to mitigate the growing latency to memory. However, the interaction between LLC management and hardware prefetching has received very little attention. This paper characterizes the performance of state-of-the-art LLC management policies in the presence and absence of hardware prefetching. ...
Keywords:
prefetch-aware replacement, set dueling, reuse distance prediction, shared cache
18
SHiP: signature-based hit predictor for high performance caching
December 2011
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Publisher: ACM
Bibliometrics:
Citation Count: 30
Downloads (6 Weeks): 9, Downloads (12 Months): 195, Downloads (Overall): 969
Full text available:
PDF
The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental challenge, ...
Keywords:
replacement, reuse distance prediction, shared cache
19
February 2011
FPGA '11: Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Publisher: ACM
Bibliometrics:
Citation Count: 18
Downloads (6 Weeks): 4, Downloads (12 Months): 96, Downloads (Overall): 577
Full text available:
PDF
Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and ...
Keywords:
memory management, caches, fpga
20
HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing
February 2011
HPCA '11: Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Publisher: IEEE Computer Society
In this paper we present the HAsim FPGA-accelerated simulator. HAsim is able to model a shared-memory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses of time-multiplexing in the core pipeline and ...
|
|