Comparative Analysis of Floating-Point Multiplier Algorithms (Booth, Wallace, and Array)
Overview
Floating-point multiplication breaks into three main steps: sign, exponent (add and normalize), and significand (mantissa) multiplication. The core performance, area, and power trade-offs come from the significand multiplier. Common multiplier algorithms include Array (schoolbook), Booth (radix-4 or higher, partial-product recoding), and Wallace (or tree) multipliers. Below is a concise comparison and guidance for choosing between them.
Algorithm summaries
-
Array multiplier
- Structure: Regular grid of adders (partial-product generation and ripple or carry-save accumulation).
- Strengths: Simple, highly regular layout—easy to route and pipeline, predictable timing.
- Weaknesses: O(n^2) partial-product additions → larger area and delay for wide operands.
- Best use: Small bit-widths, simple FPGA/ASIC implementations, strong locality.
-
Booth multiplier (radix-4 or radix-8)
- Structure: Recodes multiplier bits to reduce number of partial products (encodes 2 or 3 bits per booth digit), then uses partial-product generation and reduction (often with carry-save adders).
- Strengths: Fewer partial products → reduced area and faster than naive array for same width; good for signed multiplication.
- Weaknesses: More complex control/encoding logic and sign-extension handling; partial-product generation requires shifting/negation corrections.
- Best use: Medium to wide multipliers where balancing area and speed matters; common in high-performance integer/FP units.
-
Wallace/tree multiplier (e.g., Wallace or Dadda)
- Structure: Generate all partial products, then reduce them using a tree of carry-save adders (CSA) to two rows, then final carry-propagate adder.
- Strengths: Logarithmic reduction depth → lower latency than array; excellent for high-speed designs; amenable to parallelism.
- Weaknesses: Irregular wiring and higher routing complexity and area for small widths; more complex to pipeline efficiently.
- Best use: Wide multipliers where latency is critical (high-performance FP units, CPUs, DSPs).
Comparative table
| Attribute | Array | Booth (radix-⁄8) | Wallace/Tree |
|---|---|---|---|
| Latency (significand mult) | High (linear) | Moderate | Low (logarithmic) |
| Area | Large (quadratic) | Reduced vs array | Medium–large |
| Power | Higher for wide widths | Lower than array | Can be higher due to switching/routing |
| Regularity & simplicity | High | Moderate | Low |
| Scalability to wide widths | Poor | Good | Excellent |
| Routing complexity | Low | Moderate | High |
| Ease of pipelining | Good | Good | Requires careful stage balancing |
| Best for | Small/regular designs, FPGAs | Balanced speed/area | High-speed processors, DSPs |
Precision, Rounding, and Normalization considerations
- Floating-point needs guard, round, and sticky bits. Multipliers typically produce a product with twice the significand width before normalization.
- Use of Booth/Wallace reduces raw multiplication latency but normalization and rounding logic must handle possible carries from LSBs; design pipelining must align stages (significand multiply → normalize → round).
- For IEEE-754 single precision (23-bit significand + implicit 1), implement a multiplier for 24×24 bits producing up to 48 bits; include at least 3 extra bits (guard, round, sticky) before rounding.
Implementation trade-offs & recommendations
- For FPGA prototypes or small FP units: prefer Array for simplicity or Booth with few recoding bits if area matters.
- For ASIC high-performance FP units: prefer Wallace/Dadda tree with Booth recoding for reduced partial products plus fast reduction—common combination is Booth recoding + CSA tree + final CPA.
- Pipeline aggressively: break multiplier into stages (partial-product gen, CSA tree reduction levels, final adder) to meet high clock rates; balance pipeline depth against latency and energy.
- Consider truncated multipliers or approximate methods when full precision not required (embedded ML inference).
Example concrete option (single-precision FP core)
- Use Booth-4 recoding for partial-product reduction (≈12–13 partial products for 24-bit multiplier).
- Reduce with a 3–4 level CSA tree (Wallace style) to two rows.
- Final carry-propagate adder for 48→24 normalization output.
- Provide guard/round/sticky bits and IEEE-754 rounding unit.
- Pipeline stages: PP gen → CSA levels → final CPA → normalization/rounding.
Final note
Combine techniques: real-world high-performance multipliers often use Booth recoding to reduce partial products and a Wallace/Dadda-style reduction tree to minimize latency, trading off routing complexity for speed.
Leave a Reply