ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (188)
  • 2015-2019  (188)
  • 2010-2014
  • 2017  (188)
  • 2010
  • IEEE Transactions on Computers (T-C)  (188)
  • 1288
  • Computer Science  (188)
  • Natural Sciences in General
  • 1
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-23
    Description: In this paper we propose a scheme to perform homomorphic evaluations of arbitrary depth with the assistance of a special module recryption box . Existing somewhat homomorphic encryption schemes can only perform homomorphic operations until the noise in the ciphertexts reaches a critical bound depending on the parameters of the homomorphic encryption scheme. The classical approach of bootstrapping also allows for arbitrary depth evaluations, but has a detrimental impact on the size of the parameters, making the whole setup inefficient. We describe two different instantiations of our recryption box for assisting homomorphic evaluations of arbitrary depth. The recryption box refreshes the ciphertexts by lowering the inherent noise and can be used with any instantiation of the parameters, i.e. there is no minimum size unlike bootstrapping. To demonstrate the practicality of the proposal, we design the recryption box on a Xilinx Virtex 6 FPGA board ML605 to support the FV somewhat homomorphic encryption scheme. The recryption box requires 0.43 ms to refresh one ciphertext. Further, we use this recryption box to boost the performance of encrypted search operation. On a 40 core Intel server, we can perform encrypted search in a table of $2^{16}$ entries in around 20 seconds. This is roughly 20 times faster than the implementation without recryption box.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Non-Uniform Cache Architecture (NUCA) is a viable solution to mitigate the problem of large on-chip wire delay due to the rapid increase in the cache capacity of chip multiprocessors (CMPs). Through partitioning the last-level cache (LLC) into smaller banks connected by on-chip network, the access latency will exhibit non-uniform distribution. Various works have well explored the NUCA design, including block migration, block replication and block searching. However, all of the previous mechanisms designed for NUCA are thread-oblivious when multi-threaded applications are deployed on CMP systems. Due to the interference on shared resources, threads often demonstrate unbalanced progress wherein the lagging threads with slow progress are more critical to overall performance. In this paper, we propose a novel NUCA design called thread C riticality A ssisted R eplication and M igration (CARM). CARM exploits the runtime thread criticality information as hints to adjust the block replication and migration in NUCA. Specifically, CARM aims at boosting parallel application execution through prioritizing block replication and migration for critical threads. Full-system experimental results show that CARM reduces the execution time of a set of PARSEC workloads by 13.7 and 6.8 percent on average compared with the tradition D-NUCA and Re-NUCA respectively. Moreover, CARM also consumes much less energy compared with the evaluated schemes.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Lightweight stream ciphers have received serious attention in the last few years. The present design paradigm considers very small state (less than twice the key size) and use of the secret key bits during pseudo-random stream generation. One such effort, Sprout, had been proposed two years back and it was broken almost immediately. After carefully studying these attacks, a modified version named Plantlet has been designed very recently. While the designers of Plantlet do not provide any analysis on fault attacks, we note that Plantlet is even weaker than Sprout in terms of Differential Fault Attack (DFA). Our investigation, following the similar ideas as in the analysis against Sprout, shows that we require only around 4 faults to break Plantlet by DFA in a few hours time. While fault attack is indeed difficult to implement and our result does not provide any weakness of the cipher in normal mode, we believe that these initial results will be useful for further understanding of Plantlet.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit streams. The method naturally and effectively tolerates very high clock skew. Exploiting this advantage, this paper introduces polysynchronous clocking, a design strategy in which clock domains are split at a very fine level. Each domain is synchronized by an inexpensive local clock. Alternatively, the skew requirements for a global clock distribution network can be relaxed. This allows for a higher working frequency and so lower latency. The benefits of both approaches are quantified. Polysynchronous clocking results in significant latency, area, and energy savings for wide variety of applications.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: In today's world of the internet, billions of computer systems are connected to one another in a global network. The internet provides an unsecured channel in which hundreds of terabytes of data is being transmitted daily. Computer and software systems rely on encryption algorithms such as block ciphers to ensure that sensitive data remains confidential and secure. However, adversaries can leverage the statistical behavior of underlying ciphers to recover encryption keys. Accurate evaluation of the security margins of these encryption algorithms remains to be a big challenge. In this paper, we tackle this issue by introducing several searching strategies based on differential cryptanalysis. By clustering differential paths, the searching algorithm derives more accurate distinguishers as compared to examining individual paths, which in turn provides a more accurate estimation of cipher security margins. We verify the effectiveness of this technique on ciphers with the generalized Feistel and SPN structures, whereby the best distinguishers for each of the investigated ciphers were obtained by discovering clusters with thousands of paths. With the KATAN block cipher family as a test case, we also show how to apply the searching algorithm alongside other cryptanalysis techniques such as the boomerang attack and related-key model to obtain the best cryptanalytic results. This also depicts the flexibility of the proposed searching scheme, which can be tailored to improve upon other differential attack variants. In short, the proposed searching strategy realizes an automated security evaluation tool with higher accuracy compared to previous techniques. In addition, it is applicable to a wide range of encryption schemes which makes it a flexible tool for both academic research and industrial purposes.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    Publication Date: 2017-09-09
    Description: A time instant is said to be a critical instant for a task, if the task’s arrival at the instant makes the duration between the task’s arrival and completion the longest. Critical instants for a task, once revealed, make it possible to check the task’s schedulability by investigating situations associated with the critical instants. This potentially results in efficient and tight schedulability tests, which is important in real-time systems. For example, existing studies have discovered critical instants under preemptive fixed-priority scheduling (P-FP), which limit interference from carry-in jobs, yielding the state-of-the-art schedulability tests on both uniprocessor and multiprocessor platforms. However, studies on schedulability tests associated with critical instants have not matured yet for non-preemptive scheduling, especially on a multiprocessor platform. In this paper, we find necessary conditions for critical instants for non-preemptive global fixed-priority scheduling (NP-FP) on a multiprocessor platform, and develop a new schedulability test that takes advantage of the finding for reducing carry-in jobs’ interference. Evaluation results show that the proposed schedulability test finds up to 14.3 percent additional task sets schedulable by NP-FP, which are not deemed schedulable by the state-of-the-art NP-FP schedulability test.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: The design of high-performance adders has experienced a renewed interest in the last few years; among high performance schemes, parallel prefix adders constitute an important class. They require a logarithmic number of stages and are typically realized using AND-OR logic; moreover with the emergence of new device technologies based on majority logic, new and improved adder designs are possible. However, the best existing majority gate-based prefix adder incurs a delay of $2{\mathbf{lo}}{{\mathbf{g}}_2}(\boldsymbol{n}) - 1$ (due to the $\boldsymbol{n}$ th carry); this is only marginally better than a design using only AND-OR gates (the latter design has a $2{\mathbf{lo}}{{\mathbf{g}}_2}(\boldsymbol{n}) + 1$ gate delay). This paper initially shows that this delay is caused by the output carry equation in majority gate-based adders that is still largely defined in terms of AND-OR gates. In this paper, two new majority gate-based recursive techniques are proposed. The first technique relies on a novel formulation of the majority gate-based equations in the used group generate and group propagate hardware; this results in a new definition for the output carry, thus reducing the delay. The second contribution of this manuscript utilizes recursive properties of majority gates (through a novel operator) to reduce the circuit complexity of prefix adder designs. Overall, the proposed techniques result in the calculation of th- output carry of an $\boldsymbol{n}$ -bit adder with only a majority gate delay of ${\mathbf{lo}}{{\mathbf{g}}_2}(\boldsymbol{n}) + 1$ . This leads to a reduction of 40percent in delay and 30percent in circuit complexity (in terms of the number of majority gates) for multi-bit addition in comparison to the best existing designs found in the technical literature.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: NAND flash memory is the major storage media for both mobile storage cards and enterprise Solid-State Drives (SSDs). Log-block-based Flash Translation Layer (FTL) schemes have been widely used to manage NAND flash memory storage systems in industry. In log-block-based FTLs, a few physical blocks called log blocks are used to hold all page updates from a large amount of data blocks. Frequent page updates in log blocks introduce big overhead so log blocks become the system bottleneck. To address this problem, this paper presents BLog , a block-level log-block management scheme for MLC NAND flash memory storage system. In BLog, with block-level management, the update pages of a data block can be collected together and put into the same log block as much as possible; therefore, we can effectively reduce the associativities of log blocks so as to reduce the garbage collection overhead. We also propose a novel partial merge operation strategy called reduced-order merge by which we can effectively postpone the garbage collection of log blocks so as to maximally utilize valid pages and reduce unnecessary erase operations in log blocks. Based on BLog, we design an FTL called BLogFTL for Multi-Level Cell (MLC) NAND flash. We conduct a set of experiments on a real hardware platform. Both representative FTL schemes and the proposed BLogFTL have been implemented in the hardware evaluation board. The experimental results show that our scheme can effectively reduce the garbage collection operations and reduce the system response time compared to the previous log-block-based FTLs for MLC NAND flash.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: The delay upper-bound analysis problem is of fundamental importance to real-time applications in Network-on-Chips (NoCs). In the paper, we revisit two state-of-the-art analysis models for real-time communication in wormhole NoCs with priority-based preemptive arbitration and show that the models only support specific router architectures with large buffer sizes. We then propose an extended analysis model to estimate delay upper-bounds for all router architectures and buffer sizes by identifying and analyzing the differences between upstream and downstream indirect interferences according to the relative positions of traffic flows and taking the buffer influence into consideration. Simulated evaluations show that our model supports one more router architecture and applies to small buffer sizes compared to the previous models.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: This paper focuses on parallel hash functions based on tree modes of operation for an inner Variable-Input-Length function. This inner function can be either a single-block-length (SBL) and prefix-free MD hash function, or a sponge-based hash function. We discuss the various forms of optimality that can be obtained when designing parallel hash functions based on trees where all leaves have the same depth. The first result is a scheme which optimizes the tree topology in order to decrease the running time. Then, without affecting the optimal running time we show that we can slightly change the corresponding tree topology so as to minimize the number of required processors as well. Consequently, the resulting scheme decreases in the first place the running time and in the second place the number of required processors.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: Aggressive technology scaling has enabled the fabrication of many-core architectures while triggering challenges such as limited power budget and increased reliability issues, like aging phenomena. Dynamic power management and runtime mapping strategies can be utilized in such systems to achieve optimal performance while satisfying power constraints. However, lifetime reliability is generally neglected. We propose a novel lifetime reliability/performance-aware resource co-management approach for many-core architectures in the dark silicon era. The approach is based on a two-layered architecture, composed of a long-term runtime reliability controller and a short-term runtime mapping and resource management unit. The former evaluates the cores’ aging status w.r.t. a target reference specified by the designer, and performs recovery actions on highly stressed cores by means of power capping. The aging status is utilized in runtime application mapping to maximize system performance while fulfilling reliability requirements and honoring the power budget. Experimental evaluation demonstrates the effectiveness of the proposed strategy, which outperforms most recent state-of-the-art contributions.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: Ballooning is a popular solution for dynamic memory balancing. However, existing solutions may perform poorly in the presence of heavy guest swapping. Furthermore, when the host has sufficient free memory, guest virtual machines (VMs) under memory pressure is not be able to use it in a timely fashion. Even after the guest VM has been recharged with sufficient memory via ballooning, the applications running on the VM are unable to utilize the free memory in guest VM to quickly recover from the severe performance degradation. To address these problems, we present MemFlex , a shared memory swapper for improving guest swapping performance in virtualized environment with three novel features: (1) MemFlex effectively utilizes host idle memory by redirecting the VM swapping traffic to the host-guest shared memory area. (2) MemFlex provides a hybrid memory swapping model, which treats a fast but small shared memory swap partition as the primary swap area whenever it is possible, and smoothly transits to the conventional disk-based VM swapping on demand. (3) Upon ballooned with sufficient VM memory, MemFlex provides a fast swap-in optimization, which enables the VM to proactively swap in the pages from the shared memory using an efficient batch implementation. Instead of relying on costly page faults, this optimization offers just-in-time performance recovery by enabling the memory intensive applications to quickly regain their runtime momentum. Performance evaluation results are presented to demonstrate the effectiveness of MemFlex when compared with existing swapping approaches.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines. Such pipelines can use standard Unix tools, as well as third-party and custom-built components. Dgsh allows the specification of pipelines that perform non-uniform non-linear processing. These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput. A number of existing Unix tools have been adapted to take advantage of the new shell’s multiple pipe input/output capabilities. The shell supports visualization of the process graphs, which can also aid debugging. Dgsh was evaluated through a number of common data processing and domain-specific examples, and was found to offer an expressive way to specify processing topologies, while also generally increasing processing throughput.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: The size of write unit in PCM, namely the number of bits allowed to be written concurrently at one time, is restricted due to high write energy consumption. It typically needs several serially executed write units to finish a cache line service when using PCM as the main memory, which results in long write latency and high energy consumption. To address the poor write performance problem, we propose a novel PCM write scheme called Min-WU (Minimize the number of Write Units). We observe data access locality that some frequent zero-extended values dominate the write data patterns in typical multi-threaded applications (more than 40 and 44.9 percent of all memory accesses in PARSEC workloads and SPEC 2006 benchmarks, respectively). By leveraging carefully designed chip-level data redistribution method, the data amount is balanced and the data pattern is the same among all PCM chips. The key idea behind Min-WU is to minimize the number of serially executed write units in a cache line service after data redistribution through sFPC (simplified Frequent Pattern Compression), eRW (efficient Reordering Write operations method) and fWP (fine-tuned Write Parallelism circuits). Using Min-WU, the zero parts of write units can be indicated with predefined prefixes and the residues can be reordered and written simultaneously under power constraints. Our design can improve the performance, energy consumption and endurance of PCM-based main memory with low space and time overhead. Experimental results of 12 multi-threaded PARSEC 2.0 workloads show that Min-WU reduces 44 percent read latency, 28 percent write latency, 32.5 percent running time and 48 percent energy while receiving 32 percent IPC improvement compared with the conventional write scheme with few memory cycles and less than 3 percent storage space overhead. Evaluation results of 8 SPEC 2006 benchmarks demonstrate that Min-WU earns 57.8/46.0 percent read/write latency reduction, 28.7 percent IPC improvement, 28 percent r- nning time reduction and 62.1 percent energy reduction compared with the baseline under realistic memory hierarchy configurations.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: In the post-silicon debug of multicore designs, the debug time has increased significantly because the number of cores undergoing debug has increased; however the resources available to debug the design are limited. This paper proposes a new DRAM-based error detection method to overcome this challenge. The proposed method requires only three debug sessions even if multiple cores are present. The first debug session is used to detect the error intervals of each core using golden signatures. The second session is used to detect the error clock cycles in each core using a golden data stream. Instead of storing all of the golden data, the golden data stream is generated by selecting error-free debug data for each interval which are guaranteed by the first session. Finally, the error data in all cores are only captured during the third session. The experimental results on various debug cases show significant reductions in total debug time and the amount of DRAM usage compared to previous methods.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: Attribute-based encryption (ABE) has opened up a popular research topic in cryptography over the past few years. It can be used in various circumstances, as it provides a flexible way to conduct fine-grained data access control. Despite its great advantages in data access control, current ABE based access control system cannot satisfy the requirement well when the system judges the access behavior according to attribute comparison, such as “greater than $x$ ” or “less than $x$ ”, which are called comparable attributes in this paper. In this paper, based on a set of well-designed sub-attributes representing each comparable attribute, we construct a comparable attribute-based encryption scheme (CABE for short) to address the aforementioned problem. The novelty lies in that we provide a more efficient construction based on the generation and management of the sub-attributes with the notion of 0-encoding and 1-encoding. Extensive analysis shows that: Compared with the existing schemes, our scheme drastically decreases the storage, communication and computation overheads, and thus is more efficient in dealing with the applications with comparable attributes.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: In this paper, the joint optimization problem with energy efficiency and effective resource utilization is investigated for heterogeneous and distributed multi-core embedded systems. The system model is considered to be fully a heterogeneous model, that is, all nodes have different maximum speeds and power consumption levels from the perspective of hardware while they can employ different scheduling strategies from the perspective of applications. Since the concerned problem by nature is a multi-constrained and multi-variable optimization problem in which a closed-form solution cannot be obtained, our aim is to propose a power allocation and load balancing strategy based on Lagrange theory. Furthermore, when the problem cannot be fully solved by Lagrange approach, a data fitting method is employed to obtain core speed first, and then load balancing schedule is solved by Lagrange method. Several numerical examples are given to show the effectiveness of the proposed method and to demonstrate the impact of each factor to the present optimization system. Finally, simulation and practical evaluations show that the theoretical results are consistent with the practical results. To the best of our knowledge, this is the first work that combines load balancing, energy efficiency, hardware heterogeneity and application heterogeneity in heterogeneous and distributed embedded systems.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: Extreme multi-threading and fast thread switching in modern GPGPU require a large, power-hungry register file (RF), which quickly becomes one of major obstacles on the upscaling path of energy-efficient GPGPU computing. In this work, we propose to implement a power-efficient GPGPU RF built on the newly emerged racetrack memory. Racetrack memory has small cell area, low dynamic power, and nonvolatility. Its unique access mechanism, however, results in a long and location-dependent access latency, which offsets the energy saving benefit it introduces and probably harms the performance. In order to conquer the adverse impacts of racetrack memory based RF designs, we first propose a register mapping scheme to reduce the average access latency. Based on the register mapping, we develop a racetrack memory aware warp scheduling (RMWS) algorithm to further suppress the access latency. RMWS design includes a new write buffer structure that improves the scheduling efficiency as well as energy saving. We also investigate and optimize the design where multiple concurrent RMWS schedulers are employed. Experiment results show that our propose techniques can keep a GPGPU performance similar to the baseline with SRAM based RF while the RF energy is significantly reduced by 48.5 percent.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: We present a new type of bit-parallel non-recursive Karatsuba multiplier over $GF(2^m)$ generated by an arbitrary irreducible trinomial. This design effectively exploits Mastrovito approach and shifted polynomial basis (SPB) to reduce the time complexity and Karatsuba algorithm to reduce its space complexity. We show that this type of multiplier is only one $T_X$ slower than the fastest bit-parallel multiplier for all trinomials, where $T_X$ is the delay of one 2-input XOR gate. Meanwhile, its space complexity is roughly 3/4 of those multipliers. To the best of our knowledge, it is the first time that our scheme has reached such a time delay bound. This result outperforms previously proposed non-recursive Karatsuba multipliers.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-08-12
    Description: The growing demand for video content is reshaping our view of the current Internet, and mandating a fundamental change for future Internet paradigms. A current focus on Information-Centric Networks (ICN) promises a novel approach to intrinsically handling large content dissemination, caching and retrieval. While ubiquitous in-network caching in ICNs can expedite video delivery, a pressing challenge lies in provisioning scalable video streaming over adaptive requests for different bit rates. In this paper, we propose novel video caching schemes in ICN, to address variable bit rates and content sizes for best cache utilization. Our objective is to maximize overall throughput to improve the Quality of Service (QoS). In order to achieve this goal, we model the dynamic characteristics of rate adaptation, deriving caps on average delay, and propose DaCPlace which optimizes cache placement decisions. Building on DaCPlace , we further present a heuristic scheme, StreamCache , for low-overhead adaptive video caching. We conduct comprehensive simulations on NS-3 (specifically under the ndnSIM module). Results demonstrate how DaCPlace enables users to achieve the least delay per bit and StreamCache outperforms existing schemes, achieving near-optimal performance to DaCPlace .
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: In spite of escalating thermal challenges imposed by high power consumption, most reported 3D Network-on-chip (NoC) systems that adopt classic 3D cube (mesh) topology are unable to tackle the thermal management issues directly at the architectural level. Rather, to avoid chip being overheated, tasks running in a “hot” node have to be migrated to a “cooler” one, resulting in increased distance between communicating nodes and ultimately poor performance. In this paper, we propose a new 3D NoC architecture that genuinely supports runtime thermal-aware task management. Dubbed Hierarchical Ring Cluster (HRC), this new hierarchical 3D NoC architecture has three levels across its entire network hierarchy: 1) nodes are grouped as rings, 2) rings are then grouped into cubes, and 3) multiple cubes are connected to form the whole network. Routing in a HRC system is also performed in a hierarchical manner: Paths are set up within rings using low latency circuit switching, and data that need to cross the rings or cubes are routed following dimension-order routing supported by wormhole switching. In this organization, “hot” tasks that need to migrate can move along the rings without incurring increased communication distances. Our experimental results have confirmed that the proposed HRC architecture has a much lower network latency than other known 3D NoC architectures. When working with runtime thermal-aware task migration approaches, HRC can help reduce latency by as much as 80 percent compared to thermal-aware task migration approaches applied to 3D mesh NoC topologies.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Energy consumption is one of the prominent design constraints of multi-core embedded systems. Since the memory subsystem is responsible for a considerable portion of energy consumption of embedded systems, Non-Volatile Memories (NVMs) have been proposed as a candidate for replacing conventional memories such as SRAM and DRAM. The advantages of NVMs compared to conventional memories are that they consume less leakage power and provide higher density. However, these memories suffer from increased overhead of write operations and limited lifetime. In order to address these issues, researchers have proposed NVM-aware memory management techniques that consider the characteristics of the memories of the system when deciding on the placement of the application data. In systems equipped with memory management unit (MMU), the application data is partitioned into pages during the compile phase and the data is managed at page level during the runtime phase. In this paper we present an NVM-aware data partitioning and mapping technique for multi-core embedded systems equipped with MMU that specifies the placement of the application data based on access pattern of the data and characteristics of the memories. The experimental results show that the proposed technique improves the energy consumption of the system by 28.10 percent on average.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: External sorting algorithms are commonly used by data-centric applications to sort quantities of data that are larger than the main-memory. Many external sorting algorithms were proposed in state-of-the-art studies to take advantage of SSD performance properties to accelerate the sorting process. In this paper, we demonstrate that unfortunately, many of those algorithms fail to scale when it comes to increasing the dataset size under memory pressure. In order to address this issue, we propose a new sorting algorithm named MONTRES. MONTRES relies on SSD performance model while decreasing the overall number of I/O operations. It does this by reducing the amount of temporary data generated during the sorting process by continuously evicting small values in the final sorted file. MONTRES scales well with growing datasets under memory pressure. We tested MONTRES using several data distributions, different amounts of main-memory workspace and three SSD models. Results showed that MONTRES outperforms state-of-the-art algorithms as it reduces the sorting execution time of TPC-H datasets by more than 30 percent when the file size to main-memory size ratio is high.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: For full-text indexing of massive data, the suffix and LCP (longest common prefix) arrays have been recognized as fundamental data structures, and there are at least two needs in practice for checking their correctness, i.e., program debugging and verifying the arrays constructed by probabilistic algorithms. Two probabilistic methods are proposed to check the suffix and LCP arrays of constant or integer alphabets in external memory using a Karp-Rabin fingerprinting technique, where the checking is wrong only with a negligible error probability. The first method checks the lexicographical order and the LCP-value of two suffixes by computing and comparing the fingerprints of their LCPs. This method is general in terms of that it can verify any full or sparse suffix/LCP array of any order. The second method uses less space, it first employs the fingerprinting technique to verify a subset of the given suffix and LCP arrays, from which two new suffix and LCP arrays are induced and compared with the given arrays for verification, where the induced suffix and LCP arrays can be removed for constant alphabets to save space.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Phishing is a major problem on the Web. Despite the significant attention it has received over the years, there has been no definitive solution. While the state-of-the-art solutions have reasonably good performance, they suffer from several drawbacks including potential to compromise user privacy, difficulty of detecting phishing websites whose content change dynamically, and reliance on features that are too dependent on the training data. To address these limitations we present a new approach for detecting phishing webpages in real-time as they are visited by a browser. It relies on modeling inherent phisher limitations stemming from the constraints they face while building a webpage. Consequently, the implementation of our approach, Off-the-Hook , exhibits several notable properties including high accuracy, brand-independence and good language-independence, speed of decision, resilience to dynamic phish and resilience to evolution in phishing techniques. Off-the-Hook is implemented as a fully-client-side browser add-on, which preserves user privacy. In addition, Off-the-Hook identifies the target website that a phishing webpage is attempting to mimic and includes this target in its warning. We evaluated Off-the-Hook in two different user studies. Our results show that users prefer Off-the-Hook warnings to Firefox warnings.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Understanding the role of competition and cooperation among multiple interacting species of microorganisms that constitute the microbiome and decipher how they enforce homeostasis or trigger diseases requires the development of multi-scale computational models capable of capturing both intra-cell processing (i.e., gene-to-protein interactions) and inter-cell interactions. The multi-scale interdependency that governs the interactions from genes to proteins within a cell and from molecular messengers to cells to microbial communities within the environment raises numerous computation and communication challenges. Internal cell processing cannot be simulated without knowledge of the surroundings. Similarly, cell-cell communication cannot be fully abstracted without stated of internal processing and diffusion effects of molecular messengers. To address the compute- and communication-intensive nature of modeling microbial communities, in this paper, we propose a novel reconfigurable NoC-based manycore architecture capable of simulating a large scale microbial community. The reconfiguration of the NoC topology is achieved through the fractal analysis of NoC traffic and use of the on-chip wireless interfaces. More precisely, we analyze the computational and communication workloads and exploit the observed fractal characteristics for proposing a mathematical strategy for NoC reconfiguration. Experimental results demonstrate that the proposed NoC architecture achieves 56.6 and 62.8 percent improvement in energy delay product over the conventional wireline mesh and flatten butterfly-based high radix NoC architectures, respectively.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Multiple parallel queues are versatile hardware data structures that are extensively used in modern digital systems. To achieve maximum scalability, the multiple queues are built on top of a dynamically-allocated shared buffer that allocates the buffer space to the various active queues, based on a linked-list organization. This work focuses on dynamically-allocated multiple-queue shared buffers that allow their read and write ports to operate in different clock domains. The proposed dual-clock shared buffer follows a tightly-coupled organization that merges the tasks of signal synchronization across asynchronous clock domains and queueing (buffering), in a common hardware module. When compared to other state-of-the-art dual-clock multiple-queue designs, the new architecture is demonstrated to yield a substantially lower-cost implementation. Specifically, hardware area savings of up to 55 percent are achieved, while still supporting full-throughput operation.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Bloom filters have been used to reduce the delay in networking and computing applications when a set membership check is to be applied. Error sources can affect the behavior of Bloom filters resulting in a wrong outcome of this membership test and a possible effect in the system's output. Single event transients are a type of temporary errors altering the operation of combinational logic. A single event transient affecting the hash generation logic of a hardware-implemented Bloom filter can produce errors such as false negatives. This paper presents different approaches to build Bloom filters that are tolerant to single event transients occurring in the hash generation circuitry. They are compared to the use of traditional Modular Redundancy approaches. The results show that the new schemes can reduce significantly the circuit area needed to implement the Bloom filter.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: We investigate the design of a new instruction set for the Keccak  permutation, a cryptographic kernel for hashing, authenticated encryption, keystream generation and random-number generation. Keccak  is the basis of the SHA-3 standard and the newly proposed Keyak  and Ketje  authenticated ciphers. We develop the instruction extensions for a 128-bit interface, commonly available in the vector-processing unit of many modern processors. We examine the trade-off between flexibility and efficiency, and we propose a set of six custom instructions to support a broad range of Keccak -based cryptographic applications. We motivate our custom-instruction selections using a design space exploration that considers various methods of partitioning the state and the operations of the Keccak  permutation, and we demonstrate an efficient implementation of this permutation with the proposed instructions. To evaluate their performance, we integrate a simulation model of the proposed ARM NEON vector instructions into the GEM5 micro-architecture simulator. With this simulation model, we evaluate the performance improvement for several cryptographic operations that use the Keccak  permutation. Compared to a state-of-the-art NEON software implementation, we demonstrate a performance improvement of 2.2x for SHA-3. Compared to optimized 32-bit assembly programming, we demonstrate a performance improvement of 2.6x, 1.6x, and 1.4x for River   Keyak , Ketje SR and Ketje JR respectively. The proposed instructions require 4,658 gate-equivalent (GE) in 90 nm, which represents only a tiny fraction of the hardware cost of a modern processor.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-09-09
    Description: Modern vehicles employ a large amount of distributed computation and require the underlying communication scheme to provide high bandwidth and low latency. Existing communication protocols like Controller Area Network (CAN) and FlexRay do not provide the required bandwidth, paving the way for adoption of Ethernet as the next generation network backbone for in-vehicle systems. Ethernet would co-exist with safety-critical communication on legacy networks, providing a scalable platform for evolving vehicular systems. This requires a high-performance network gateway that can simultaneously handle high bandwidth, low latency, and isolation; features that are not achievable with traditional processor based gateway implementations. We present VEGa, a configurable vehicular Ethernet gateway architecture utilising a hybrid FPGA to closely couple software control on a processor with dedicated switching circuit on the reconfigurable fabric. The fabric implements isolated interface ports and an accelerated routing mechanism, which can be controlled and monitored from software. Further, reconfigurability enables the switching behaviour to be altered at run-time under software control, while the configurable architecture allows easy adaptation to different vehicular architectures using high-level parameter settings. We demonstrate the architecture on the Xilinx Zynq platform and evaluate the bandwidth, latency, and isolation using extensive tests in hardware.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Deriving deadline and period for update transactions to maintain temporal consistency has long been recognized as an important problem in real-time database research. Despite years of active study, most of the past work only focuses on the scheduling of update transactions, and neglects the impact of control transactions by assuming an Update First policy where control transactions are always assigned lower priorities than the update transactions. On the other hand, most existing work on co-scheduling of update and control transactions has been focused on meeting the deadlines of all the control transactions while maximizing the quality of data of the real-time data objects. In this paper, we study the co-scheduling problem of update and control transactions by satisfying the deadline constraints of control transactions and the temporal validity constraints of update transactions simultaneously. Specifically, we consider the problem of how to derive deadline and period for update transactions to maintain the temporal consistency of real-time data objects, while guaranteeing the hybrid transaction set to be EDF-schedulable. To address this problem, we first borrow the idea from $\mathsf{minD}$   [14] to derive a solution called $\mathsf{minD}^\ast$ , which can compute deadline and period for update transactions effectively. Next, based on a sufficient condition to derive the minimum possible deadline for each update transaction, we propose a more efficient algorithm Minimum Deadline Calculation  ( - \mathsf{MDC}$ ), which can guarantee to derive a solution, given that one does exist. Finally, the effectiveness and efficiency of the proposed algorithms are validated through extensive simulation experiments.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Emerging non-volatile memories (NVMs) are notable candidates for replacing traditional DRAMs. Although NVMs are scalable, dissipate lower power, and do not require refreshes, they face new challenges including shorter lifetime and security issues. Efforts toward securing the NVMs against probe attacks pose a serious downside in terms of lifetime. Cryptography algorithms increase the information density of data blocks and consequently handicap the existing lifetime enhancement solutions like Flip-N-Write. In this paper, based on the insight that compression can relax the constraints of lifetime-security trade-off, we propose CryptoComp, an architecture that, taking the advantage of block size reduction after compression, aims to enhance the memory system lifetime and security. Our idea is to limit the avalanche effect caused by encryption algorithms in a lower space through compression and selective encryption. This way, for highly compressible data blocks, we follow a fully-encryption approach while for poorly compressible data blocks, we rely on a non-deterministic fine-grain selective-encryption mechanism. Additionally, a simple and block-oriented wear-leveling scheme is presented to fairly distribute the bit flips on memory cells. Our experimental results show 3.59 $\times$ and 3.66 $\times$ lifetime improvements over two state-of-the-art schemes, DEUCE and i-NVMM, while imposing a negligible performance degradation of 2.1 percent, on average.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: The significant process parameter variations occurring during fabrication of high performance sequential circuits, such as microprocessors, are posing relevant uncertainties on the power that such circuits will consume in the field, while executing workloads typical for the diverse products they are oriented to (e.g., cellular phones, notebooks, servers, etc). On the other hand, different kinds of products have different constraints on the maximal power that could be consumed during the execution of typical workloads, due to diverse needs in terms of charge autonomy, heat dissipation, etc. Consequently, the power that will be consumed by microprocessors during the execution of typical workloads in the field needs to be accurately characterized at the end of fabrication. Such a power consumption characterization (hereinafter referred to as “power binning”), will enable to classify microprocessors in “power bins”, each one containing microprocessors suitable for different kinds of products, thus enabling to introduce them all into the market for different kinds of products. Based on these considerations, in this paper we propose an approach to characterize accurately at the end of fabrication, and at low-cost (in terms of characterization time), the power that microprocessors will consume in the in-field during the execution of workloads typical for different kinds of products. Our approach exploits scan-based Logic Built-In Self-Test (LBIST) to apply to microprocessors’ sequential blocks test vectors that induce on their internal nodes an activity factor (AF) similar to that experienced during the in-field execution of workloads typical for different kinds of products, thus enabling to perform power binning by simply measuring their consumed power. Our approach enables to scale the AF from 0 percent up to 97.6 percent (on average for the considered benchmark circuits) comp- red to conventional LBIST, with a granularity of the 2 percent, thus enabling to emulate accurately the AF induced by workloads typical of a wide range of products. We propose a hardware implementation for our approach requiring a limited area overhead (lower than 3 percent) over conventional LBIST.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: The algorithm proposed in this article provides high accuracy and requires only double-precision arithmetic. The calculated value obtained by the algorithm is proved to be accurate to $0.785 \cdot \text{ulp}$ as long as the result does not underflow. In the case of subnormal results the error does not exceed $0.875 \cdot \text{ulp}$ . The testing results have also shown that the implementation of the proposed algorithm has reasonable computational time compared to other implementations.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Motivated by the significant storage footprint of OS/Apps in mobile devices, this paper studies the realization of OS/Apps transparent compression. In spite of its obvious advantage, this feature however is not widely available in commercial mobile devices, which is due to the justifiable concern on the read latency penalty. In conventional practice on implementing transparent compression, read latency overhead comes from two aspects, including read amplification and decompression computational latency. This paper presents simple yet effective design solutions to eliminate the read amplification at the filesystem level and eliminate the computational latency overhead at the computer architecture level. To demonstrate its practical feasibility, we first implemented a prototyping filesystem to empirically verify the realization of transparent compression with zero read amplification. We further demonstrated that the OS/Apps footprint can be reduced by up to 39 percent on a Nexus 7 tablet installed with Android 5.0. Through application-specific integrated circuit (ASIC) synthesis, we show that the proposed computer architecture level design solution can eliminate the decompression latency overhead with very small silicon cost.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Approximate computing is an emerging design paradigm that leverages the inherent error tolerance present in many applications to improve their power consumption and performance. Due to the forgiving nature of these error-resilient applications, precise input data is not always necessary for them to produce outputs of acceptable quality. This makes the memory subsystem (i.e., the place where data is stored), a suitable component for introducing approximations in return for substantial energy savings. Towards this end, this paper proposes a systematic methodology for constructing a quality configurable approximate DRAM system. Our design is based upon an extensive experimental characterization of memory errors as a function of the DRAM refresh-rate. Leveraging the insights gathered from this characterization, we propose four novel strategies for partitioning the DRAM in a system into a number of quality bins based on the frequency, location, and nature of bit errors in each of the physical pages, while also taking into account the property of variable retention time exhibited by DRAM cells. During data allocation, critical data is placed in the highest quality bin (that contains only accurate pages) and approximate data is allocated to bins sorted in descending order of quality, with the refresh rate serving as the quality control knob. We validate our proposed scheme on several error-resilient applications implemented using an Altera Stratix IV GX FPGA based Terasic TR4-230 development board containing a 1GB DDR3 DRAM module. Experimental results demonstrate a significant improvement in the energy-quality trade-off compared to previous work and show a reduction in DRAM refresh power of up to 73 percent on average with minimal loss in output quality.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: The scaling problem in Residue Number System was addressed in many publications. A significant focus was given to scale the three moduli set $(2^{n}-1, 2^{n}, 2^{n}+1)$ by $2^n$ , where $n$ is a positive integer. This paper presents a scaling structure design for the moduli sets $(2^{n}-1, 2^{n+p}, 2^{n}+1)$ scaled by $2^n$ , where $0\leq p \leq n$ . The new design has the delay of a full-adder and a modular adder. The proposed structure is smaller, faster and more power-efficient than the most recent published work for the same moduli set and the same scaling factor. VLSI synthesis results showed an average area reduction of $(9.8-27.9)$ percent, an average time reduction of $(13.9-20.8)$ percent and an average power reduction of $(15.9-22.3)$ percent. The paper also presents a time-efficient scaling structure for the extended three moduli set $(2^{n}-1, 2^{n+p}, 2^{n}+1)$ scaled by $2^{n+p}$ , where $1\leq p \leq n$ . The delay of this structure is one half-adder more than the delay of the aforementioned structure.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: We provide a framework to analyze multi-level checkpointing protocols, by formally defining a $k$ -level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of $\sum _{\ell =1}^{k}\sqrt{2\lambda _\ell C_\ell}$ , where $\lambda _\ell$ is the error rate at level  $\ell$ , and $C_\ell$ the checkpointing cost at level  $\ell$ . This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpo- nting with the optimal subset of levels.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Many distributed graph processing frameworks have emerged for helping doing large scale data analysis for many applications including social network and data mining. The existing frameworks usually focus on the system scalability without consideration of local computing performance. We have observed two locality issues which greatly influence the local computing performance in existing systems. One is the locality of the data associated with each vertex/edge. The data are often considered as a logical undividable unit and put into continuous memory. However, it is quite common that for some computing steps, only some portions of data (called as some properties) are needed. The current data layout incurs large amount of interleaved memory access. The other issue is their execution engine applies computation at a granularity of vertex. Making optimization for the locality of source vertex of each edge will often hurt the locality of target vertex or vice versa. We have built a distributed graph processing framework called Photon to address the above issues. Photon employs Property View to store the same type of property for all vertices and edges together. This will improve the locality while doing computation with a portion of properties. Photon also employs an edge-centric execution engine with Hilbert-Order that improve the locality during computation. We have evaluated Photon with five graph applications using five real-world graphs and compared it with four existing systems. The results show that Property View and edge-centric execution design improve graph processing by 2.4X.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: The compaction of test programs for processor-based systems is of utmost practical importance: Software-Based Self-Test (SBST) is nowadays increasingly adopted, especially for in-field test of safety-critical applications, and both the size and the execution time of the test are critical parameters. However, while compacting the size of binary test sequences has been thoroughly studied over the years, the reduction of the execution time of test programs is still a rather unexplored area of research. This paper describes a family of algorithms able to automatically enhance an existing test program, reducing the time required to run it and, as a side effect, its size. The proposed solutions are based on instruction removal and restoration, which is shown to be computationally more efficient than instruction removal alone. Experimental results demonstrate the compaction capabilities, and allow analyzing computational costs and effectiveness of the different algorithms.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Heterogeneous System Architectures (HSA) that integrate cores of different architectures (CPU, GPU, etc.) on single chip are gaining significance for many class of applications to achieve high performance. Networks-on-Chip (NoCs) in HSA are monopolized by high volume GPU traffic, penalizing CPU application performance. In addition, building efficient interfaces between systems of different specifications while achieving optimal performance is a demanding task. Homogeneous NoCs, widely used for many core systems, fall short in meeting these communication requirements. To achieve high performance interconnection in HSA, we propose HyWin topology using mm-wave wireless links. The proposed topology implements sandboxed heterogeneous sub-networks, each designed to match needs of a processing subsystem, which are then interconnected at second level using wireless network. The sandboxed sub-networks avoid conflict of network requirements, while providing optimal performance for their respective subsystems. The long range wireless links provide low latency and low energy inter-subsystem network to provide easy access to memory controllers, lower level caches across the entire system. By implementing proposed topology for CPU/GPU HSA, we show that it improves application performance by 29 percent and reduces latency by 50 percent, while reducing energy consumption by 64.5 percent and area by 17.39 percent as compared to baseline mesh.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Conventional cache tag matching identifies the requested data based on a memory address. However, this address-based tag matching is inefficient because it requires unnecessarily many tag bits. Previous studies show that translation look-aside buffer (TLB) index-based tagging (TLBIT) can be adopted in instruction caches because there are not many different tags at a given moment due to spatial locality, and those tags can be captured by TLBs. For the TLBIT scheme, extra TLB indices are added to each TLB entry and conventional cache tags are replaced with TLB indices to identify the requested data in the cache. TLBIT reduces the number of required tag bits in tag arrays; therefore, the cache energy consumption and area are decreased. In this paper, we show that naively adopting TLBIT for data caches is inefficient, in terms of performance and energy consumption, because of cache line searches and invalidations on TLB misses. To achieve the true potential of TLBIT, we propose four novel techniques: search zone, c-LRU, TLB buffer and demand address fetching. The search zone reduces unnecessary cache line searching and c-LRU reduces the cache line invalidations. The TLB buffer prevents immediate cache line invalidations on TLB misses. Furthermore, we present demand address fetching to reduce energy consumption in the TLB. From our experiments, we observed that the proposed techniques reduce the overall dynamic energy consumption of the data cache by 14.3 percent on average. The overall tag array area and leackage power of the data cache are also reduced by 54 and 45 percent, respectively. The TLB energy consumption is reduced by 22.7 percent. The performance impact is small, less than 0.4 percent on average. We also demonstrate that TLBIT can be applied to large caches, and set-associative TLBs.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: This paper relies on the principles of inexact computing to alleviate the issues arising in static masking by voting for reliable computing in the nanoscales. Two schemes that utilize in different manners approximate voting, are proposed. The first scheme is referred to as inexact double modular redundancy (IDMR). IDMR does not resort to triplication, thus saving overhead due to modular replication. This scheme is crudely adaptive in its operation, i.e., it allows a threshold to determine the validity of the module outputs. IDMR operates by initially establishing the difference between the values of the outputs of the two modules; only if the difference is below a preset threshold, then the voter calculates the average value of the two module outputs. The second scheme (ITDMR) combines IDMR with TMR (triple modular redundancy) by using novel conditions in the comparison of the outputs of the three modules. Within an inexact framework, the majority is established using different criteria; in ITDMR, adaptive operation is carried further than IDMR to include approximate voting in a pairwise fashion. So, the validity of the three inputs is established and when only two of the three inputs satisfy the threshold condition, the IDMR operation is utilized. An extensive analysis that includes the voting circuits as well as a probabilistic framework is included. The proposed IDMR and ITDMR schemes improve the power dissipation and tolerance to variations compared to a traditional TMR. To further validate the applicability of the proposed schemes, inexact voting has been used in two applications (image processing and FIR filtering); the simulation results show that performance is substantially improved over TMR.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: DRAM systems are hierarchically organized: Channel-Rank-Bank. A channel is connected to multiple ranks, and each rank has multiple banks. This hierarchical structure facilitates creating parallelisms in DRAM. The current DRAM architecture supports bank-level parallelism; as many rows as banks can be moved simultaneously at bank-level. However, rank-level parallelism is not supported. For this reason, only one column can be accessed at a time, although each rank has its own data bus that can carry a column. Namely, current DRAM operations do not exploit the structural opportunity created by multiple ranks. We, therefore, propose a novel DRAM architecture supporting rank-level parallelism. Thereby, as many columns as ranks can be moved concurrently at rank-level. In this paper, we illustrate the rank-level parallelism and its benefit in DRAM operations.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Fault-tolerant minimal routing algorithms aim at finding a Manhattan path between the source and destination nodes and route around all faulty nodes. Additionally, some non-faulty nodes that are helpless to make up of a fault-tolerant minimal path should also be routed around. How to label such non-faulty nodes efficiently is a major challenge. State-of-the-art solutions could not address it very well. We propose a path-counter method. It can label every node that are helpless to make up of a fault-tolerant minimal path with low time complexity. By counter the number of fault-tolerant minimal paths, it can: support arbitrary fault distribution, check the existence of fault-tolerant minimal paths, not sacrifice any available fault-tolerant minimal paths.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-06-10
    Description: Soft real-time applications often show bursty memory access patterns—requiring high memory bandwidth for a short duration of time—that are often critical for timely data processing and performance. We call the code sections that exhibit such characteristics as Memory-Performance Critical Sections (MPCSs) . Unfortunately, in multicore architectures, non-real-time applications on different cores may also demand high memory bandwidth at the same time. Resulting bandwidth contention can substantially increase the time spent on MPCSs of soft real-time applications, which in turn could result in missed deadlines. In this paper, we present a memory access control framework called BWLOCK , which is designed to protect MPCSs of soft real-time applications. BWLOCK consists of a user-level libarary and a kernel-level memory bandwidth control mechanism. The user-level library provides a lock-like API to declare MPCSs for real-time applications. When a real-time application enters a MPCS, the kernel-level bandwidth control mechanism dynamically throttles memory bandwidth of the rest of the cores to protect the MPCS, until it is completed. We evaluate BWLOCK using CortexSuite benchmarks. By selectively applying BWLOCK, based on the memory intensity of the code blocks in each benchmark, we achieve significant performance improvements, up to 150 percent reduction in slowdown, at a controllable throughput impact to non real-time applications.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: The emerging large-scale multi-socket systems make the need for more sophisticated large-scale coherence management of necessity. Directory-based coherence has been an ad hoc solution and a clear candidate for large-scale shared-memory systems. A vanilla directory design, however, suffers from inefficient use of storage to keep coherence metadata, resulting in a high storage overhead for large-scale systems. In this paper, we propose a dynamic multi-grain directory for large multi-socket systems. The idea is to track coherence of regions of different sizes which requires storing much less information in the directory than having a directory entry per each data block. It dynamically refines granularity according to the application phase and therefore tracks coherence information for regions of varying sizes. The results show that the proposal allows to reduce the directory storage by an order of magnitude, while the loss of precision does not cause performance penalty. The paper demonstrates that different applications and different application phases have different requirements for the region size. Performance results are compared against two state-of-the-art multi-grain directories and it is the one that obtains the better results.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: High reliability, efficient I/O performance and flexible consistency provided with low storage cost are all desirable properties of cloud storage systems. Due to the inherent conflicts, however, simultaneously achieving optimum on all these properties is impractical. N-way Replication and Erasure Coding, two extensively-applied storage schemes with high reliability, adopt opposite and unbalanced strategies on the tradeoff among these properties, thus considerably restraining their effectiveness on wide range of workloads. To address the aforementioned obstacle, we propose a novel storage scheme called ASSER , an ASS embling chain of E rasure coding and R eplication. ASSER stores each object in two parts: a full copy and a certain amount of erasure-coded segments. We establish dedicated read/write protocols for ASSER leveraging the unique structural advantages. On the basis of elementary protocols, we implement sequential and PRAM (Pipeline-RAM) consistency to make ASSER feasible for various services with different performance/consistency requirements. Evaluation results demonstrate that under the same fault tolerance and consistency level, ASSER outperforms N-way replication and pure erasure coding in I/O throughput under diverse system and workload configurations with superior performance stability. More importantly, ASSER delivers stably efficient I/O performance at much lower storage cost than the other comparatives.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: A binary de Bruijn sequence with period $2^n$ is a sequence in which every tuple of $n$ bits occurs exactly once. De Bruijn sequence generators have randomness properties that make them attractive for pseudorandom number generators and as building blocks for stream ciphers. Unfortunately, it is very difficult to find de Bruijn sequence generators with long periods (e.g., $2^{128}$ ) and most known de Bruijn sequence generators are computationally quite expensive. In this article, we present “OcDeb- $k$ - $n$ ” and the first hardware implementation of de Bruijn sequence generators. OcDeb- $k$ - $n$ efficiently computes a composited de Bruijn sequence where $k$ levels of composi- ion are added to a de Bruijn sequence of period $2^n$ . Numerically, OcDeb reduces the bit operations used for computing the feedback function significantly from ${\Theta}(k^2+nk)$ to ${\Theta}(k\;\log {k} + \log {n})$ . Furthermore, it enables efficient parallelization and hardware retiming. Comprehensive result analysis is conducted for 65 nm ASIC technology. For example, OcDeb-32-32 has an area of 643 GE with 1.45 Gbps performance, and with parallelization it generates up to 25.4 Gbps at the cost of 4,787 GE. The area of OcDeb-512-32 generating a de Bruijn sequence of period $2^{544}$ is 7,304 GE and the performance is 1.25 Gbps.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: Redundancy repair is a commonly used technique for memory yield improvement. In order to ensure high repair rate and final product yield, it is necessary to develop a repair scheme for the coming three-dimensional (3D) architecture of stacked DRAM. According to the JEDEC mobile memory technology roadmap, the interface of 3D DRAM, including the Wide I/O and High-Bandwidth Memory (HBM), is mainly classified as channel-based memories. In this paper, we propose a built-off self-test (BOSR) scheme at the controller level for channel-based 3D memory to enhance final product yield after the bonding of a memory cube to its corresponding logic die. The logic die contains the Channel controller, in which the BOSR circuit resides. Experimental results show that the repair rate is high with higher cluster failure ratio due to the flexible algorithm we choose. The area overhead is low and it decreases significantly when the memory size or channel count increases. The performance penalty is also low due to the parallel execution of address comparison and repair. Moreover, the manufacture cost is lower than conventional DRAM architecture due to allocator-based redundancies. Finally, the proposed scheme can easily be applied to other channel-based 3D memories.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: While MapReduce is inherently designed for batch and high throughput processing workloads, there is an increasing demand for non-batch processes on big data, e.g., interactive jobs, real-time queries, and stream computations. Emerging Apache Spark fills in this gap, which can run on an established Hadoop cluster and take advantages of existing HDFS. As a result, the deployment model of Spark-on-YARN is widely applied by many industry leaders. However, we identify three key challenges to deploy Spark on YARN, inflexible reservation-based resource management, inter-task dependency blind scheduling, and the locality interference between Spark and MapReduce applications. The three challenges cause inefficient resource utilization and significant performance deterioration. We propose and develop a cross-platform resource scheduling middleware, iKayak , which aims to improve the resource utilization and application performance in multi-tenant Spark-on-YARN clusters. iKayak relies on three key mechanisms: reservation-aware executor placement to avoid long waiting for resource reservation, dependency-aware resource adjustment to exploit under-utilized resource occupied by reduce tasks, and cross-platform locality-aware task assignment to coordinate locality competition between Spark and MapReduce applications. We implement iKayak in YARN. Experimental results on a testbed show that iKayak can achieve 50 percent performance improvement for Spark applications and 19 percent performance improvement for MapReduce applications, compared to two popular Spark-on-YARN deployment models, i.e., YARN-client model and YARN-cluster model.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: In this paper, we present a fast implementation for QC-MDPC Niederreiter encryption. Existing high-speed implementations are considerably resource involving but the solution we propose here mitigates such situation while maintaining the high throughputs. In particular, new arithmetic for lightweight Hamming weight computation and a fast sorting network for MDPC decoding are proposed. A novel constant weight coding unit is proposed to enable standard asymmetric encryptions. For now, the design presented in this work is the fastest one of existing QC-MDPC code based encryptions in the public domain. The area-time product of this work drops by at least 53 percent compared to previous fast speed designs of QC-MDPC based encryptions. It is shown for instance that our implementation of encrypting engine can sign one encryption in 3.86 $\mu s$ on a Xilinx Virtex-6 FPGA with 3371 slices. Our iterative decrypting engine can decrypt one ciphertext in 114.64 $\mu s$ with 5271 slices and our faster non-iterative decrypting engine can decrypt in 65.76 $\mu s$ with 8781 slices.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    Publication Date: 2017-07-15
    Description: In this paper, we present new parallel polynomial multiplication formulas which result in subquadratic space complexity. The schemes are based on a recently proposed block recombination of polynomial multiplication formula. The proposed two-way, three-way, and four-way split polynomial multiplication formulas achieve the smallest space complexities. Moreover, by providing area-time tradeoff method, the proposed formulas enable one to choose a parallel formula for polynomial multiplication which is suited for a design environment.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: Multipliers requiring large bit lengths have a major impact on the performance of many applications, such as cryptography, digital signal processing (DSP) and image processing. Novel, optimised designs of large integer multiplication are needed as previous approaches, such as schoolbook multiplication, may not be as feasible due to the large parameter sizes. Parameter bit lengths of up to millions of bits are required for use in cryptography, such as in lattice-based and fully homomorphic encryption (FHE) schemes. This paper presents a comparison of hardware architectures for large integer multiplication. Several multiplication methods and combinations thereof are analysed for suitability in hardware designs, targeting the FPGA platform. In particular, the first hardware architecture combining Karatsuba and Comba multiplication is proposed. Moreover, a hardware complexity analysis is conducted to give results independent of any particular FPGA platform. It is shown that hardware designs of combination multipliers, at a cost of additional hardware resource usage, can offer lower latency compared to individual multiplier designs. Indeed, the proposed novel combination hardware design of the Karatsuba-Comba multiplier offers lowest latency for integers greater than 512 bits. For large multiplicands, greater than 16,384 bits, the hardware complexity analysis indicates that the NTT-Karatsuba-Schoolbook combination is most suitable.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: The fact that graphics processors (GPUs) are today’s most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. However, unlike CPUs, GPUs are optimized for processing 3D graphics (e.g., graphics rendering), a kind of data-parallel applications, and consequently, several GPUs do not support strong synchronization primitives to coordinate their cores. This prevents the GPUs from being deployed more widely for general-purpose computing. This paper aims at bridging the gap between the lack of strong synchronization primitives in the GPUs and the need for strong synchronization mechanisms in parallel applications. Based on the intrinsic features of typical GPU architectures, we construct strong synchronization objects such as wait-free and $t$ -resilient read-modify-write objects for a general model of GPU architectures without hardware synchronization primitives such as test-and-set and compare-and-swap . Accesses to the wait-free objects have time complexity $O(N)$ , where $N$ is the number of processes. The wait-free objects have the optimal space complexity $O(N^2)$ . Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without strong synchronization primitives in hardware and tha- wait-free programming is possible for such GPUs.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: With the proliferation of Internet of Things (IoT), the IEEE 802.15.4 physical layer is becoming increasingly popular due to its low power consumption. However, secure data communication over the network is a challenging issue because vulnerabilities in the existing security primitives lead to several attacks. The mitigation of these attacks separately adds significant computing burden on the legitimate node. In this paper, we propose a secure IEEE 802.15.4 transceiver design that mitigates multiple attacks simultaneously by using a physical layer encryption approach that reduces the computations at the upper layers. In addition to providing confidentiality and integrity services, the proposed transceiver provides sufficient complexity to various attacks, such as cryptanalysis and traffic analysis attacks. It also significantly improves the lifetime of the node in the presence of a ghost attacker by preventing the legitimate node from processing the bogus messages and hence combats against energy depletion attacks. The simulation results show that a high symbol error rate at the adversary can be achieved using the proposed transceiver without affecting the throughput at the legitimate node. In this paper, we also analyze the hardware complexity by developing an FPGA and ASIC prototype of the proposed transceiver.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: Approximate computing is an attractive design methodology to achieve low power, high performance (low delay) and reduced circuit complexity by relaxing the requirement of accuracy. In this paper, approximate Booth multipliers are designed based on approximate radix-4 modified Booth encoding (MBE) algorithms and a regular partial product array that employs an approximate Wallace tree. Two approximate Booth encoders are proposed and analyzed for error-tolerant computing. The error characteristics are analyzed with respect to the so-called approximation factor that is related to the inexact bit width of the Booth multipliers. Simulation results at 45 nm feature size in CMOS for delay, area and power consumption are also provided. The results show that the proposed 16-bit approximate radix-4 Booth multipliers with approximate factors of 12 and 14 are more accurate than existing approximate Booth multipliers with moderate power consumption. The proposed R4ABM2 multiplier with an approximation factor of 14 is the most efficient design when considering both power-delay product and the error metric NMED. Case studies for image processing show the validity of the proposed approximate radix-4 Booth multipliers.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: This paper models 1-out-of- N standby computing systems with a dynamic checkpointing policy. The system performs a real-time mission task that has to be accomplished within an allowed mission time. During the mission, to facilitate an effective failure recovery the system undergoes checkpointing procedures according to a policy that dynamically determines a checkpointing frequency based on the activated element and remaining work for completing the mission. System elements are heterogeneous; they can follow different, arbitrary types of time-to-failure distributions, have different performance and wait in different standby modes before their activation. A new numerical algorithm based on state space event transitions is first proposed to evaluate mission success probability of the real-time standby systems considered in this work. Additional new contributions are made by formulating and solving optimal dynamic checkpointing policy problems, as well as an integrated optimization problem that finds the optimal combination of checkpointing policy and element activation sequence maximizing mission success probability. Advantages of using the dynamic checkpointing policy over fixed even checkpoints are demonstrated through examples. Examples and results are also provided to illustrate effects of different mission and element parameters on mission success probability as well as on the optimal dynamic checkpointing policy.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: Given the needs of data-intensive web services and cloud computing applications, storage centers play an important role in serving the demanded data access while jointly considering low cost, qualified performance, and good scalability. To manage peak workloads with performance requirements for read/write latencies, overprovisioning more storage nodes is common but also increases total cost as well as power consumption. Recently, due to the growing capacity and dropping price, NAND-flash-based Solid-State Drives (SSDs) have become an attractive storage solution in datacenters. In this work, we exploit the write heterogeneity in Multi-Level-Cell (MLC) NAND flash memory to meet Service-Level Objectives (SLOs) of applications and to avoid storage overprovision. In MLC NAND flash memory, a memory cell can be programmed as a Single-Level Cell (SLC) or a multi-level cell at runtime, and SLC writes take shorter latency with the cost of larger consumed capacity. The proposed SLO-aware morphable SSD design seeks to meet the SLO requirement by deciding the write mode of each write request while minimizing the number of SLC writes. Experimental results show that the proposed design meets the SLO requirement for all of the tested I/O traces with less than 2.8 percent extra erase counts in average, while conventional MLC SSDs require up to 2.375 times storage overprovision to meet the SLO requirement.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: DDR4 SDRAM introduced a new hierarchy in DRAM organization: bank-group (BG). The main purpose of BG is to increase I/O bandwidth without growing DRAM-internal bus-width. We, however, found that other benefits can be derived from the new hierarchy. To achieve the benefits, we propose a new DRAM architecture using the BG-hierarchy, leading to a creation of BG-Level Parallelism (BGLP). By exploiting BGLP, the overall parallelism grows in DRAM operations. We also argue that BGLP is a feasible solution in the cost-sensitive DRAM industry because the additional cost is negligible and only cost-insensitive area needs to be modified.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-07-15
    Description: A joint numerical representation based on both Gaussian and Eisenstein integers is proposed. This Gauss-Eisenstein representation maps complex numbers into four-tuples of integers with arbitrarily high precision. The representation furnishes the computation of the 3-, 6-, and 12-point discrete Fourier transform (DFT) at any desired accuracy. The associated fast algorithms based on the Gauss-Eisenstein integers are error-free up to the final reconstruction step, which can be realized in hardware as a multiplierless implementation. The introduced methods are compared with competing algorithms in terms of arithmetic complexity. We propose three FRS architectures based on the following methods: Dempster-McLeod representation, expansion factor, and addition aware quantization. The Gauss-Eisenstein 12-point DFT is physically realized on a Xilinx Virtex 6 FPGA device with maximum clock frequency of 302 MHz for the expansion factor FRS with real-time throughput of $3.62\times 10^9$ coefficients/s. The FPGA verified digital designs were synthesized, mapped, placed and finally routed for $0.18\mu$ m CMOS technology assuming a 1.8 V DC supply employing Austria Micro Systems (AMS) standard-cell library (hitkit version 4.11). The routed ASIC is predicted to operate at a maximum frequency of 505 MHz for the expansion factor FRS with potential real-time throughput of $6.06\times 10^9$ coefficients/s.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPC-X2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Recent smart devices have adopted heterogeneous multi-core processors which have high-performance big cores and low-power small cores. Unfortunately, the conventional task scheduler for heterogeneous multi-core processors does not provide appropriate amount of CPU resources for multimedia applications (whose QoS is important to users), resulting in energy waste; it often executes multimedia applications and non-multimedia applications on the same core. In this paper, we propose an advanced task scheduler for heterogeneous multi-core processors, which provides appropriate amount of CPU resources for multimedia applications. Our proposed task scheduler isolates multimedia applications from non-multimedia applications at runtime, exploiting the fact that multimedia applications have a specific thread for video/audio playback (to play video/audio, a multimedia application should use a function that generates the specific thread). Since multimedia applications usually require a smaller amount of CPU resources than non-multimedia applications due to dedicated hardware decoders, our proposed task scheduler allocates the former to the small cores and the latter to the big cores. In our experiments on an Android-based development board, our proposed task scheduler saves system-wide (not just CPU) energy consumption by 8.9 percent, on average, compared to the conventional task scheduler, preserving QoS of multimedia applications. In addition, it improves performance of non-multimedia applications by 13.7 percent, on average, compared to the conventional task scheduler.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Flash memory-based SSD-RAIDs are swiftly replacing conventional hard disk drives by exhibiting improved performance and stability, especially in I/O-intensive environments. However, the variations in latency and throughput occurring due to uncoordinated internal garbage collection cripples further boosting of performance. In addition, the unwanted variations in each SSD can influence the overall performance of the entire flash storage adversely. This performance bottleneck can be essentially reduced by an internal write cache in the RAID controller designed prudently by considering the crucial device characteristics. The state-of-the-art cache write for the RAID controller fails to incorporate device characteristics of flash memory-based SSDs and mitigates the performance gain. In this paper, we propose a novel cache design namely Layout-Aware Write Cache (LAWC) to overcome the performance barrier inculcated by independent garbage collections. LAWC implements (i) improved I/O scheduling for logically partitioned write caches, (ii) a destage write synchronization mechanism to allow individual write caches to flush write blocks into the SSD array in a coordinated manner, and (iii) a two-level hybrid cache algorithm utilizing small front level cache for the improved write cache efficiency. LAWC shows significant reduction in response time by 82.39 percent on RAID-0 and 68.51 percent on RAID-5 types of SSDs when compared with state-of-the-art write cache algorithms.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: In network-on-chip (NoC) based CMPs, DVFS is commonly used to co-optimize performance and power. To achieve optimal efficiency, it is important to gain proportional performance growth with power. However, power over/under provisioning often exists. To properly evaluate and guide NoC DVFS techniques, it is highly desirable to formalize and quantify power over/under provisioning. In this paper, we first show that application performance does not grow linearly with network power in an NoC-based CMP. Instead, their relationship is non-linear and can be captured using performance-power characteristics curve (PPCC) with three distinct regions: an inertial region, a linear region, and a saturation region. We note that conventional DVFS metrics such as Performance Per Watt (PPW) cannot accurately evaluate such non-linear relationship. Based on PPCC, we propose a new figure of merit called Marginal Performance (MP) which evaluates the incremental performance per power increment after the inertial region. The MP concept enables to formally define power over- and under-provisioning with reference to the linear region in which an efficient NoC DVFS should operate. Applying the PPCC and MP concepts in full-system simulations with PARSEC and SPEC OMP2012 benchmarks, we are able to identify power over/under provisioning occurrences, measure and compare their statistics in two latest NoC DVFS techniques. Moreover, we show evidences that MP can accurately and consistently evaluate the NoC DVFS techniques, avoiding the misjudgement and inconsistency of PPW-based evaluations.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Large-scale neural network accelerators are often implemented as a many-core chip and rely on a network-on-chip to manage the huge amount of inter-neuron traffic. The baseline and different variations of the well-known mesh and tree topologies are the most popular topologies in prior many-core implementations of neural networks. However, the grid-like mesh and hierarchical tree topologies suffer from high diameter and low bisection bandwidth, respectively. In this paper, we present ClosNN, a customized Clos topology for N eural N etworks. The inherent capability of Clos to support multicast and broadcast traffic in a simple and efficient way, as well as its adaptable bisection bandwidth, is the major motivation behind proposing a customized version of this topology as the communication infrastructure of large-scale neural network implementations. We compare ClosNN with some state-of-the-art NoC topologies adopted in recent neural network hardware accelerators and show that it offers lower average message hop count and higher throughput, which directly translates to faster neural information processing.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Smartphones are getting increasingly high-performance with advances in mobile processors and larger main memories to support feature-rich applications. However, the storage subsystem has always been a prohibitive factor that slows down the pace of reaching even higher performance while maintaining good user experience. Despite today’s smartphones are equipped with larger-than-ever main memories, they consume more energy and still run out of memory. But the slow NAND flash based storage vetoes the possibility of swapping—an important technique to extend main memory—and leaves a system that constantly terminates user applications under memory pressure. In this paper, we propose NVM-Swap by revisiting swapping for smartphones with fast, byte-addressable, non-volatile memory (NVM) technologies. Instead of using flash, we build the swap area with NVM, to allow high performance without sacrificing user experience. NVM-Swap supports Lazy Swap-in , which can reduce memory copy operations by giving the swapped out pages a second chance to stay in byte-addressable NVM backed swap area. To avoid fast worn-out of certain NVM, we also propose Heap-Wear , a wear leveling algorithm that distributes writes in NVM more evenly. Evaluation results based on the Google Nexus 5 smartphone show that our solution can effectively enhance smartphone performance and achieve better wear-leveling of NVM.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    Publication Date: 2017-10-11
    Description: Recent years, Software Defined Routers (SDRs) (programmable routers) have emerged as a viable solution to provide a cost-effective packet processing platform with easy extensibility and programmability. Multi-core platforms significantly promote SDRs’ parallel computing capacities, enabling them to adopt artificial intelligent techniques, i.e., deep learning, to manage routing paths. In this paper, we explore new opportunities in packet processing with deep learning to inexpensively shift the computing needs from rule-based route computation to deep learning based route estimation for high-throughput packet processing. Even though deep learning techniques have been extensively exploited in various computing areas, researchers have, to date, not been able to effectively utilize deep learning based route computation for high-speed core networks. We envision a supervised deep learning system to construct the routing tables and show how the proposed method can be integrated with programmable routers using both Central Processing Units (CPUs) and Graphics Processing Units (GPUs). We demonstrate how our uniquely characterized input and output traffic patterns can enhance the route computation of the deep learning based SDRs through both analysis and extensive computer simulations. In particular, the simulation results demonstrate that our proposal outperforms the benchmark method in terms of delay, throughput, and signaling overhead.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Synchronous dynamic random access memories (SDRAMs) are widely employed in multi- and many-core platforms due to their high-density and low-cost. Nevertheless, their benefits come at the price of a complex two-stage access protocol, which reflects their bank-based structure and an internal level of explicitly managed caching. In scenarios in which requestors demand real-time guarantees, these features pose a predictability challenge and, in order to tackle it, several SDRAM controllers have been proposed. In this context, recent research shows that a combination of bank privatization and open-row policy (exploiting the caching over the boundary of a single request) represents an effective way to tackle the problem. However, such approach uncovered a new challenge: the data bus turnaround overhead. In SDRAMs, a single data bus is shared by read and write operations. Alternating read and write operations is, consequently, highly undesirable, as the data bus must remain idle during a turnaround. Therefore, in this article, we propose a SDRAM controller that reorders read and write commands, which minimizes data bus turnarounds. Moreover, we compare our approach analytically and experimentally with existing real-time SDRAM controllers both from the worst-case latency and power consumption perspectives.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    Publication Date: 2017-10-11
    Description: To lower on-chip SRAM area overhead for chip multiprocessors (CMPs), this work treats a novel directory design which compresses p resent-bit v ectors (PVs) by dropping “runs of zeros” commonly existing and lets PVs be transformed to their variations after sharer relinquishment for hashing alternative table sets to lift table utilization. Featured with re linquishment c oherence and co mpressed s harer t racking (ReCoST), the proposed design attains superior directory efficiency and maintains “exact” directory representations, as a result of dropping abound long runs of zeros present in PVs. According to full-system simulation using gem5 for a range of core counts under PARSEC benchmarks, ReCoST is found to enjoy 3.21 $\times$ (or 2.64 $\times$ ) more efficiency in directory storage than conventional bit-tracking directories (or the best directory known so far, called SCD) for a 64-core CMP under monotasking (or multitasking) workloads while ensuring execution slowdowns to stay within 2.4 percent (or 3.3 percent).
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Shingled Magnetic Recording (SMR) drives can benefit large-scale storage systems by reducing the Total Cost of Ownership (TCO) of dealing with explosive data growth. Among all existing SMR models, Host Aware SMR (HA-SMR) looks the most promising for its backward compatibility with legacy I/O stacks and its ability to use new SMR-specific APIs to support host I/O stack optimization. Building storage systems using HA-SMR drives calls for a deep understanding of the drive’s performance characteristics. To accomplish this, we conduct in-depth performance evaluations on HA-SMR drives with a special emphasis on the performance implications of the SMR-specific APIs and how these drives can be deployed in large storage systems. We discover both favorable and adverse effects of using HA-SMR drives under various workloads. We also investigate the drive’s performance under legacy production environments using real-world enterprise traces. Finally, we propose a novel host-controlled buffer that can help to reduce the severity of the decline in HA-SMR performance under our discovered unfavorable I/O access patterns. Without a detailed comprehensive design, we show the potential of the host-controlled buffer by a case study.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: Approximate multipliers are gaining importance in energy-efficient computing and require careful error analysis. In this paper, we present the error probability analysis for recursive approximate multipliers with approximate partial products. Since these multipliers are constructed from smaller approximate multiplier building blocks, we propose to derive the error probability in an arbitrary bit-width multiplier from the probabilistic model of the basic building block and the probability distributions of inputs. The analysis is based on common features of recursive multipliers identified by carefully studying the behavioral model of state-of-the-art designs. By building further upon the analysis, Probability Mass Function (PMF) of error is computed by individually considering all possible error cases and their inter-dependencies. We further discuss the generalizations for approximate adder trees, signed multipliers, squarers and constant multipliers. The proposed analysis is validated by applying it to several state-of-the-art approximate multipliers and comparing with corresponding simulation results. The results show that the proposed analysis serves as an effective tool for predicting, evaluating and comparing the accuracy of various multipliers. Results show that for the majority of the recursive multipliers, we get accurate error performance evaluation. We also predict the multipliers’ performance in an image processing application to demonstrate its practical significance.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-10-11
    Description: In cryptography, counters (classically encoded as bit strings of fixed size for all inputs) are employed to prevent collisions on the inputs of the underlying primitive which helps us to prove the security. In this paper we present a unified notion for counters, called counter function family , and identify some necessary and sufficient conditions on counters which give (possibly) simple proof of security for various counter-based cryptographic schemes. We observe that these conditions are trivially true for the classical counters. We also identify and study two variants of the classical counter which satisfy the security conditions. The first variant has message length dependent counter size, whereas the second variant uses universal coding to generate message length independent counter size. Furthermore, these variants provide better performance for shorter messages. For instance, when the message size is $2^{19}$ bits, AES-LightMAC with $64$ -bit (classical) counter takes $1.51$ cycles per byte (cpb), whereas it takes $0.81$ cpb and $0.89$ cpb for the first and second variant, respectively. We benchmark the software performance of these variants against the classical counter by implementing them in MA- s and HAIFA hash function.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: A novel hierarchical fault-tolerance methodology for reconfigurable devices is presented. A bespoke multi-reconfigurable FPGA architecture, the programmable analogue and digital array (PAnDA), is introduced allowing fine-grained reconfiguration beyond any other FPGA architecture currently in existence. Fault blind circuit repair strategies, which require no specific information of the nature or location of faults, are developed, exploiting architectural features of PAnDA. Two fault recovery techniques, stochastic and deterministic strategies, are proposed and results of each, as well as a comparison of the two, are presented. Both approaches are based on creating algorithms performing fine-grained hierarchical partial reconfiguration on faulty circuits in order to repair them. While the stochastic approach provides insights into feasibility of the method, the deterministic approach aims to generate optimal repair strategies for generic faults induced into a specific circuit. It is shown that both techniques successfully repair the benchmark circuits used after random faults are induced in random circuit locations, and the deterministic strategies are shown to operate efficiently and effectively after optimisation for a specific use case. The methods are shown to be generally applicable to any circuit on PAnDA, and to be straightforwardly customisable for any FPGA fabric providing some regularity and symmetry in its structure.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: This paper presents a silicon-proven fault tolerant FPGA architecture that can repair a wide range of hardware faults. The proposed architecture does not require fine-grained fault location, and the error map is stored in non-volatile resistive memory that is monolithically integrated on top of the CMOS circuit. Redundancy operations are fully self-contained and do not affect data streaming in and out of the FPGA. The power gating scheme is implemented to save idle leakage power and fix hardware faults in the power network. Significant yield enhancement is expected using this architecture. The architecture has been verified in a test chip fabricated in 28nm technology. Redundancy operation is solely controlled by on chip fault locators which are HfO 2 -based resistive memories monolithically integrated after CMOS process. The maximum shift in performance is about 2 percent when the redundancy is engaged, and the power footprint is unaffected.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    Publication Date: 2017-05-10
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) allow area- and power-efficient acceleration of complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability and lifetime of such systems. Aging mitigation and fault tolerance techniques for the reconfigurable fabric become essential to realize dependable reconfigurable architectures. This article presents an accelerator diversification method that creates multiple configurations for runtime reconfigurable accelerators that are diversified in their usage of Configurable Logic Blocks (CLBs). In particular, it creates a minimal number of configurations such that all single-CLB and some multi-CLB faults can be tolerated. For each fault we ensure that there is at least one configuration that does not use that CLB. Second, a novel runtime accelerator placement algorithm is presented that exploits the diversity in resource usage of these configurations to balance the stress imposed by executions of the accelerators on the reconfigurable fabric. By tracking the stress due to accelerator usage at runtime, the stress is balanced both within a reconfigurable region as well as over all reconfigurable regions of the system. The accelerator placement algorithm also considers faulty CLBs in the regions and selects the appropriate configuration such that the system maintains a high performance in presence of multiple permanent faults. Experimental results demonstrate that our methods deliver up to 3.7 $times$ higher performance in presence of faults at marginal runtime costs and 1.6 $times$ higher MTTF than state-of-the-art aging mitigation methods.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    Publication Date: 2017-05-10
    Description: Significant increase of static power in nano-CMOS era and, subsequently, the end of Dennard scaling has put a Power Wall to further integration of CMOS technology in Field-Programmable Gate Arrays (FPGAs). An efficient solution to cope with this obstacle is power gating inactive fractions of a single die, resulting in Dark Silicon . Previous studies employing power gating on SRAM-based FPGAs have primarily focused on using large-input Look-up Tables (LUTs). The architectures proposed in such studies inherently suffer from poor logic utilization which limits the benefits of power gating techniques. This paper proposes a Power-Efficient Architecture for FPGAs (PEAF) based on combination of Reconfigurable Hard Logics (RHLs) and a small-input LUT. In the proposed architecture, we selectively turn off unused RHLs and/or LUTs within each logic block by employing a reconfigurable controller. By mapping a majority of logic functions to simple-design RHLs, PEAF is able to significantly improve power efficiency without deteriorating the performance. Experimental results over a comprehensive set of benchmarks (MCNC, IWLS’05, and VTR) demonstrate that compared with baseline four-LUT architecture, PEAF reduces the total static power and Power-Delay-Product (PDP), on average, by 24.5 and 21.7 percent, respectively. This is while the overall system performance is also improved by 1.8 percent. PEAF increases total area by 18.9 percent, however, it still occupies 22.1 percent less area footprint than the six-LUT architecture with 31.5 percent improvement in PDP.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Field-programmable gate-arrays (FPGAs) have evolved to include embedded memory, high-speed I/O interfaces and processors, making them both more efficient and easier-to-use for compute acceleration and networking applications. However, implementing on-chip communication is still a designer’s burden wherein custom system-level buses are implemented using the fine-grained FPGA logic and interconnect fabric. Instead, we propose augmenting FPGAs with an embedded network-on-chip (NoC) to implement system-level communication. We design custom interfaces to connect a packet-switched NoC to the FPGA fabric and I/Os in a configurable and efficient way and then define the necessary conditions to implement common FPGA design styles with an embedded NoC. Four application case studies highlight the advantages of using an embedded NoC. We show that access latency to external memory can be $sim$ 1.5 $times$ lower. Our application case study with image compression shows that an embedded NoC improves frequency by 10-80%, reduces utilization of scarce long wires by 40% and makes design easier and more predictable. Additionally, we leverage the embedded NoC in creating a programmable Ethernet switch that can support up to 819 Gb/s-5 $times$ more switching bandwidth and 3 $times$ l- wer area compared to previous work. Finally, we design a 400 Gb/s NoC-based packet processor that is very flexible and more efficient than other FPGA-based packet processors.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Reconfigurable systems are gaining an increasing interest in the domain of safety-critical applications, for example in the space and avionic domains. In fact, the capability of reconfiguring the system during run-time execution and the high computational power of modern Field Programmable Gate Arrays (FPGAs) make these devices suitable for intensive data processing tasks. Moreover, such systems must also guarantee the abilities of self-awareness, self-diagnosis and self-repair in order to cope with errors due to the harsh conditions typically existing in some environments. In this paper we propose a self-repairing method for partially and dynamically reconfigurable systems applied at a fine-grain granularity level. Our method is able to detect correct and recover errors using the run-time capabilities offered by modern SRAM-based FPGAs. Fault injection campaigns have been executed on a dynamically reconfigurable system embedding a number of benchmark circuits. Experimental results demonstrate that our method achieves full detection of single and multiple errors, while significantly improving the system availability with respect to traditional error detection and correction methods.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Providing low-latency and high-throughput is an important design feature of an I/O scheduler. Especially, when multiple applications share and compete for a storage resource, the operating system is required to schedule the IO requests for the maximum throughput. Recently, the flash-based storage, solid-state drive (SSD) has been popularly used in various computing systems and traditional scheduling algorithms have been researched and tuned for the emerging flash-based storages. However, the SSDs suffer from the contention problem caused by multiple I/O requests and experience significant performance degradation. This is mainly due to the concurrently accesses to a finite set of flash memory chips. In this paper we propose Dynamic Load Balanced Queuing (DLBQ) that reorders the I/O requests and evenly distributes the accesses on flash memory chips to avoid contention. For that purpose, we have introduced a virtual time method which chases the run-time status of the SSD. We have evaluated the throughput and latency of DLBQ versus the four I/O schedulers with micro benchmarks and server benchmarks. The experimental results show that the throughput of DLBQ is improved by 11 percent on a 128 GB SSD and 15 percent on a 256 GB SSD while ensuring a bounded latency.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Content-addressable memories (CAMs) are a type of memory that receives an input search key and compares it to every entry of a table of stored keys. If there is a match, they return the corresponding address where the value was found. Alternatively, they can include a related Random Access Memory (RAM) that is accessed with the matching address, returning the corresponding data values. To protect a CAM with associated RAM against errors, parity or error-correction codes (ECCs) are typically used. They usually protect the CAM and the RAM information separately incurring in additional storage needs. This paper proposes a scheme to protect some configurations of CAM with its associated RAM from errors with a single ECC code. This ECC code can be used to provide advanced error correction to the combination of the key and data values stored in the CAM and the RAM, but it can also be applied in a modular way to provide simpler protection to the key or to the values individually.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: We present a computationally efficient technique to build concise and accurate computational models for large (60 or more inputs, 1 output) Boolean functions, only a very small fraction of whose truth table is known during model building. We use Genetic Programming with Boolean logic operators, and enhance the accuracy of the technique using Reduced Ordered Binary Decision Diagram based representations of Boolean functions, whereby we exploit their canonical forms. We demonstrate the effectiveness of the proposed technique by successfully modeling several common Boolean functions, and ultimately by accurately modeling a 63-input Physically Unclonable Function circuit design on Xilinx Field Programmable Gate Array. We achieve better accuracy (at lesser computational overhead) in predicting truth table entries not seen during model building, than a previously proposed machine learning based modeling technique for similar Physically Unclonable Function circuits using Support Vector Machines . The success of this modeling technique has important implications in determining the acceptability of Physically Unclonable Functions as useful hardware security primitives, in applications such as anti-counterfeiting of integrated circuits.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: In this paper, we deal with the problem of efficiently assessing the higher order vulnerability of a hardware cryptographic circuit. Our main concern is to provide methods that allow a circuit designer to detect early in the design cycle if the implementation of a Boolean-additive masking countermeasure does not hold up to the required protection order. To achieve this goal, we promote the search for vulnerabilities from a statistical problem to a purely symbolical one and then provide a method for reasoning about this new symbolical interpretation. Eventually we show, with a synthetic example, how the proposed conceptual tool can be used for exploring the vulnerability space of a cryptographic primitive.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Collective I/O is a widely used middleware technique that exploits I/O access correlation among multiple processes to improve I/O system performance. However, most existing implementations of collective I/O strategies are designed and optimized for homogeneous I/O systems. In practice, the homogeneity assumptions do not hold in heterogeneous parallel I/O systems, which consist of multiple HDD and SSD-based servers and become increasingly promising. In this paper, we propose a heterogeneity-aware collective-I/O (HACIO) strategy to enhance the performance of conventional collective I/O operations. HACIO reorganizes the order of I/O requests for each aggregator with awareness of the storage performance of heterogeneous servers, so that the hardware of the systems can be better utilized. We have implemented HACIO in ROMIO, a widely used MPI-IO library. Experimental results show that HACIO can significantly increase the I/O throughputs of heterogeneous I/O systems.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Since the discovery of identity-based encryption schemes in 2000, bilinear pairings have been used in the design of hundreds of cryptographic protocols. The most commonly used pairings are constructed from elliptic curves over finite fields with small embedding degree. These pairings can have different security, performance, and functionality characteristics, and were therefore classified into Types 1, 2, 3 and 4. In this paper, we observe that this conventional classification is not applicable to pairings from elliptic curves with embedding degree one. It is important to understand the security, efficiency, and functionality of these pairings in light of recent attacks on certain pairings constructed from elliptic curves with embedding degree greater than one. We define three kinds of pairings from elliptic curves with embedding degree one, discuss some subtleties with using them to implement pairing-based protocols, and provide an estimated cost of implementing them on modern processors.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Parallel file system (PFS) is commonly used in high-end computing systems. With the emergence of solid state drives (SSDs), hybrid PFS, which consists of both HDD and SSD servers, provides a practical I/O system solution for data-intensive applications. However, most existing data layout schemes are inefficient for hybrid PFS due to their unawareness of server heterogeneities and workload changes in different parts of a file. In this study, we propose a heterogeneity-aware region-level data layout scheme, HARL, to improve the data distribution of a hybrid PFS. HARL first divides a file into fine-grained, varying sized regions according to the workload features of an application, then determines appropriate file stripe sizes on servers for each region based on the performance of heterogeneous servers. Furthermore, to further improve the performance of a hybrid PFS, we propose a dynamic region-level layout scheme, HARL-D, which creates multiple replicas for each region and redirects file requests to the proper replicas with the lowest access costs at the runtime. Experimental results of representative benchmarks and a real application show that HARL can greatly improve I/O system performance, and demonstrate the advantages of HARL-D over HARL.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-10
    Description: Fast non-volatile memory (NVM) technologies (e.g., phase change memory, spin-transfer torque memory, and MRAM) provide high performance to legacy storage systems. These NVM technologies have attractive features, such as low latency and high throughput to satisfy application performance. Accordingly, fast storage devices based on fast NVM lead to a rapid increase in the demand for diverse computer systems and environments (e.g., cloud platforms, web servers, and database systems) where they are expected to be used as primary storage. Despite the promised benefits provided by fast storage devices, modern file systems do not take advantage of the storage’s full performance. In this article, we analyze and explore existing I/O strategies in read, write, journal I/O, and recovery paths between the file system and the storage device. The analysis shows that existing I/O strategies are an obstacle to get maximum performance of fast storage devices. To address this issue, we propose efficient I/O strategies that enable file systems to fully exploit the performance of fast storage devices. Our main idea is to transfer requests from discontiguous host memory buffers in the file systems to discontiguous storage segments in one I/O request to get maximize I/O performance. We implemented our scheme to read, write, journal I/O and recovery operations in the EXT4 file system and the JBD2 module. We demonstrate the implication of our idea in terms of application performance through well-known benchmarks. The experimental results show that our optimized file system achieves better performance than the existing file system, with improvements of up to 1.54 $times$ , 1.96 $times$ , and 2.28 $times$ on ordered mode, data journaling mode, and recovery, respectively.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-13
    Description: Deep neural networks (DNN) have been shown to be very effective at solving challenging problems in several areas of computing, including vision, speech, and natural language processing. However, traditional platforms for implementing these DNNs are often very power hungry, which has lead to significant efforts in the development of configurable platforms capable of implementing these DNNs efficiently. One of these platforms, the IBM TrueNorth processor, has demonstrated very low operating power in performing visual computing and neural network classification tasks in real-time. The neuron computation, synaptic memory, and communication fabrics are all configurable, so that a wide range of network types and topologies can be mapped to TrueNorth. This reconfigurability translates into the capability to support a wide range of low-power functions in addition to feed-forward DNN classifiers, including for example, the audio processing functions presented here.In this work, we propose an end-to-end audio processing pipeline that is implemented entirely on a TrueNorth processor and designed to specifically leverage the highly-parallel, low-precision computing primitives TrueNorth offers. As part of this pipeline, we develop an audio feature extractor (LATTE) designed for implementation on TrueNorth, and explore the tradeoffs among several design variants in terms of accuracy, power, and performance. We customize the energy-efficient deep neuromorphic networks structures that our design utilizes as the classifier and show how classifier parameters can trade between power and accuracy. In addition to enabling a wide range of diverse functions, the reconfigurability of TrueNorth enables re-training and re-programming the system to satisfy varying energy, speed, area, and accuracy requirements. The resulting system's end-to-end power consumption can be as low as $14.43text{mW}$ , which would give up to 100 hours of continuous usage with button cell batteries (CR3023 $1.5; text{Whr}$ ) or 450 hours with cellphone batteries (iPhone 6s $6.55; text{Whr}$ ).
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-05
    Description: Most computer systems authenticate users only once at the time of initial login, which can lead to security concerns. Continuous authentication has been explored as an approach for alleviating such concerns. Previous methods for continuous authentication primarily use biometrics, e.g., fingerprint and face recognition, or behaviometrics, e.g., key stroke patterns. We describe CABA, a novel continuous authentication system that is inspired by and leverages the emergence of sensors for pervasive and continuous health monitoring. CABA authenticates users based on their BioAura, an ensemble of biomedical signal streams that can be collected continuously and non-invasively using wearable medical devices. While each such signal may not be highly discriminative by itself, we demonstrate that a collection of such signals, along with robust machine learning, can provide high accuracy levels. We demonstrate the feasibility of CABA through analysis of traces from the MIMIC-II dataset. We propose various applications of CABA, and describe how it can be extended to user identification and adaptive access control authorization. Finally, we discuss possible attacks on the proposed scheme and suggest corresponding countermeasures.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-05
    Description: For large-scale graph analysis on a single PC, asynchronous processing methods are known to converge more quickly than the synchronous approach, because of more efficient propagation of vertices state. However, current asynchronous methods are still very suboptimal in propagating state across different graph partitions. This presents a bottleneck for cross-partition state update and slows down the convergence of the processing task. To tackle this problem, we propose a new method, named the HotGraph , to faster graph processing by extracting a backbone structure, called hot graph , that spans all the partitions of the original graph. With this approach, most cross-partition state propagations in traditional solutions now take place within only a few hot graph partitions, thus removing the cross-partition bottleneck. We also develop a partition scheduling algorithm to maximize the hot graph’s effectiveness by keeping it in memory and assigning it the highest priority for processing as much as possible. A forward and backward sweeping execution strategy is then proposed to further accelerate the convergence. Experimental results show that HotGraph can reduce the number of vertex state updates processed by 51.5 percent, compared with state-of-the-art schemes. Applying our optimizations further reduces this number by 72.6 percent and the execution time by 80.8 percent.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-05
    Description: Hybrid memory comprised of a big SCM and a little DRAM (BSLD) is widely studied to address the growing power consumption challenge of pure DRAM. However, the performance degradation, limited endurance and immature mass production of ultra-high-density SCM are still the painful points of BSLD. Here we propose a Retention-Aware Hybrid Main Memory (RAHMM) architecture with a big DRAM and a little SCM (BDLS) for the first time. DRAM is refreshed at a much longer interval by using SCM to store the small quantity of leaky tail bits in DRAM. A two-step search technology combined with outcome forecasting is put forward to get ultra-fast read access, as well as to diminish the power and performance overheads. A hidden buffer strategy (HBS) is proposed to optimize write performance and endurance hurt. The experimental results show 45 percent reduction of power consumption and 30 percent performance optimization, which are significantly improved compared to that of both serial and parallel BSLD with a counterpart capacity
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    Publication Date: 2017-05-05
    Description: Due to physical limitations, mobile devices are restricted in memory, battery, processing, among other characteristics. This results in many applications that cannot be run in such devices. This problem is fixed by Edge Cloud Computing, where the users offload tasks they cannot run to cloudlet servers in the edge of the network. The main requirement of such a system is having a low Service Delay, which would correspond to a high Quality of Service. This paper presents a method for minimizing Service Delay in a scenario with two cloudlet servers. The method has a dual focus on computation and communication elements, controlling Processing Delay through virtual machine migration and improving Transmission Delay with Transmission Power Control. The foundation of the proposal is a mathematical model of the scenario, whose analysis is used on a comparison between the proposed approach and two other conventional methods; these methods have single focus and only make an effort to improve either Transmission Delay or Processing Delay, but not both. As expected, the proposal presents the lowest Service Delay in all study cases, corroborating our conclusion that a dual focus approach is the best way to tackle the Service Delay problem in Edge Cloud Computing.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-05-05
    Description: GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. As a result register file consumes a large fraction of the total GPU chip power. This paper explores register file data compression for GPUs to improve power efficiency. Compression reduces the width of the register file read and write operations, which in turn reduces dynamic power. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Compression exploits the value similarity by removing data redundancy of register values. Without decompressing operand values some instructions can be processed inside register file, which enables to further save energy by minimizing data movement and processing in power hungry main execution unit. Evaluation results show that the proposed techniques save 25 percent of the total register file energy consumption and 21 percent of the total execution unit energy consumption with negligible performance impact.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-03-11
    Description: Making modern computer systems energy-efficient is of paramount importance. Dynamic Voltage and Frequency Scaling (DVFS) is widely used to manage the energy and power consumption in modern processors; however, for DVFS to be effective, we need the ability to accurately predict the performance impact of scaling a processor's voltage and frequency. No accurate performance predictors exist for multithreaded applications, let alone managed language applications. In this work, we propose DEP+BURST, a new performance predictor for managed multithreaded applications that takes into account synchronization, inter-thread dependencies, and store bursts, which frequently occur in managed language workloads. Our predictor lowers the performance estimation error from 27 percent for a state-of-the-art predictor to 6 percent on average, for a set of multithreaded Java applications when the frequency is scaled from 1 to 4 GHz. We also novelly propose an energy management framework that uses DEP+BURST to reduce energy consumption. We first target reducing the processor's energy consumption by lowering its frequency and hence its power consumption, while staying within a user-specified maximum slowdown threshold. For a slowdown of 5 and 10 percent, our energy manager reduces on average 13 and 19 percent of energy consumed by the memory-intensive benchmarks. We then use the energy manager to optimize total system energy, achieving an average reduction of 15.6 percent for a set of Java benchmarks. Accurate performance predictors are key to achieving high performance while keeping energy consumption low for managed language applications using DVFS.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    Publication Date: 2017-03-11
    Description: In the last decade, improvements on technology scaling have enabled the design of a novel generation of wearable bio-sensing monitors. These smart Wireless Body Sensor Nodes (WBSNs) are able to acquire and process biological signals, such as electrocardiograms, for periods of time extending from hours to days. The energy required for the on-node digital signal processing (DSP) is a crucial limiting factor in the conception of these devices. To address this design challenge, we introduce a domain-specific ultra-low power (ULP) architecture dedicated to bio-signal processing. The platform features a light-weight strategy to support different operating modes and synchronization among cores. Our approach effectively reduces the power consumption, harnessing the intrinsic parallelism and the workload requirements characterizing the target domain. Operations at low voltage levels are supported by a heterogeneous memory subsystem comprising a standard-cell based ultra-low voltage reliable partition. Experimental results show that, when executing real-world bio-signal DSP applications, a state-of-the-art multi-core architecture can improve its energy efficiency in up to 50 percent by utilizing our proposed approach, outperforming traditional single-core alternatives.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-03-11
    Description: Multi-core processors are increasingly popular because they yield higher performance, but they also present new challenges for hard real-time systems in that they make it much more difficult to estimate a task's worst-case execution time (WCET). Partitioned cache architecture is being used to ease the problem by providing an isolated execution environment for each thread. Although simple to implement and use, this method may be sub-optimal with respect to both energy consumption and performance since it prevents taking advantage of information shared across threads for both instructions and data. This work presents a new cache architecture termed SPACE (Semi-Partitioned CachE) that makes it possible to leverage information sharing, yielding in turn a tighter WCET. The SPACE architecture together with our new WCET algorithm can be used to maintain the predictability of the execution time of the parallel threads while reducing the overall energy consumption of the system. The new proposed cache architecture was implemented using Verilog and deployed on a Xilinx MicroBlaze multi-core design for testing, validation and measurements. The application level experiments were conducted using the Chronos tool for estimation and the Wattch/SimpleScalar simulator for execution. Using three real-time programs–a radar tracker, a DES encryption algorithm, and an FM radio–we showed that SPACE together with the enhanced WCET algorithm reduce the average system WCET of these applications by 31 percent and reduce the actual energy consumption by 18 percent in comparison with other cache architectures.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-03-11
    Description: Current computer systems require large memory capacities to manage the tremendous volume of datasets. A DRAM cell consists of a transistor and a capacitor, and their size has a direct impact on DRAM density. While technology scaling can provide higher density, this benefit comes at the expense of low drivability, due to the increase in series resistance of the smaller transistor, which slows the process of restoring the charge in cells. DRAM operations require recovery processes due to the destructive nature of DRAM cells. Among such operations, the write recovery process has the most difficulty in meeting the timing constraints. In this paper, we explore an intrinsic mechanism in the DRAM write operation, and find a relation between restoration and retention times. Based on our observation, we propose a practical mechanism, Relaxed Refresh with Compensated Write Recovery (RRCW), which efficiently mitigates refresh overheads by providing longer restoration periods. Furthermore, to minimize the penalty of the longer restoration, we also introduce another mechanism, Refresh-Aware Write Recovery (RAWR), which appropriately curtails longer recovery time according to the waiting time until being refreshed. Lastly, we introduce a scheduling policy to efficiently utilize RAWR. Evaluations show that the benefits of our mechanisms increase as memory intensity increases.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-03-14
    Description: This paper presents a fault-tolerance technique for H.264's Context-Adaptive Variable Length Coding (CAVLC) on unreliable computing hardware. The application-specific knowledge is leveraged at both algorithm and architecture levels to protect the CAVLC process (especially context adaptation and coding tables) in a reliable yet power-efficient manner. Specifically, the statistical analysis of coding syntax and video content properties are exploited for: (1) selective redundancy of coefficient/header data of video bitstreams; (2) partitioning the coding tables into various sub-tables to reduce the power overhead of fault tolerance; and (3) run-time power management of memory parts storing the sub-tables and their parity computations. Experimental results demonstrate that leveraging application-specific knowledge reduces area and performance overhead by 2x compared to a double-parity table protection technique. For functional verification and area comparison, the complete H.264 CAVLC architecture is prototyped on a Xilinx Virtex-5 FPGA (though not limited to it).
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2017-03-11
    Description: Recent technological advances have led to an increasing gap between memory and processor performance, since memory bandwidth is progressing at a much slower pace than processor bandwidth. Pre-fetching techniques are traditionally used to bridge this gap and achieve high processor utilization while tolerating high memory latencies. Following this trend, new computational models have been proposed to split task execution in two consecutive phases: a memory phase in which the required instructions and data are pre-fetched to local memory (M-phase), and an execution phase in which the task is executed with no memory contention (C-phase). Decoupling memory and execution phases not only simplifies the timing analysis, but also allows a more efficient (and predictable) pipelining of memory and execution phases through proper co-scheduling algorithms. This paper takes a further step towards the design of smart co-scheduling algorithms for sporadic real-time tasks complying with the memory-computation (M/C) model, by proposing a theoretical framework aimed at tightly characterizing the schedulability improvement obtainable with the adopted M/C task model on single-core systems. In particular, a critical instant is identified for M/C tasks scheduled with fixed priority and an exact response time analysis with pseudo-polynomial complexity is provided. Then, we investigate the problem of priority assignment for M/C tasks, showing that a necessary condition to achieve optimality is to allow different priorities for the two phases. Our experiments show that the proposed techniques provide a significant schedulability improvement with respect to classic execution models, placing an important building block towards the design of more efficient partitioned multi-core systems.
    Print ISSN: 0018-9340
    Electronic ISSN: 1557-9956
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...