ODELIX — M2 internship proposal

WelcomeTeamPublicationsSoftwareEventsJobsInternships

Title: Exploiting hardware matrix accelerations in computer algebra

Topics: high performance computing, fast arithmetic, hardware acceleration

Address

Laboratoire d'informatique de l'École polytechnique (LIX, UMR 7161 CNRS)
Bâtiment Alan Turing, CS35003
1 rue Honoré d'Estienne d'Orves
91120 Palaiseau, France

Research team: MAX, Algebraic modeling and symbolic computation

Contacts

Joris van der Hoeven <vdhoeven@lix.polytechnique.fr>
Grégoire Lecerf <lecerf@lix.polytechnique.fr>

Context

The MAX team is searching for PhD candidates on the themes of the “ODELIX” ERC Advanced Grant. The present M2 internship proposal allows applicants to familiarize themselves with these themes. Upon successful completion of the internship, there will be an opportunity to pursue with a PhD.

Applying

Applications can be sent by email to the above contact persons; they should include a CV, a transcript of records, a letter of motivation, and optional recommendation letters. Note that we have no hard deadlines; new applications will be considered at any time of the year, as long as the Odelix project runs.

Description

Driven by needs from artificial intelligence, modern hardware increasingly include accelerators for matrix multiplication: Intel's AMX extensions [2], Apple's undocumented matrix extensions [1, 3, 10], tensor cores on various GPUs [9], etc.

The specifics of such hardware accelerators vary heavily when it comes to the supported data types (from 8-bit to 64-bit integer and/or floating point types) and the precise kind of operations that are supported (e.g. 8 bit unsigned integer matrix multiply, accumulated into a 32 unsigned integer bit result).

The main question for this internship is whether these new accelerators can also be beneficial for other applications besides AI. For instance, can they be used to develop faster algorithms for polynomial multiplication, multiple precision arithmetic, finite field arithmetic, etc.? More generally, we are motivated by applications to computer algebra [5] and reliable computing.

A first objective will be the design of an interface for matrix accelerators on various types of hardware and to better understand the performance of such accelerators.

We next pursue with the first application for which such accelerators should be beneficial: integer matrix multiplication via Chinese remaindering [4, 8]. We will investigate for which precisions and matrix sizes the hardware accelerations become beneficial.

As a function of the available time, we will next investigate a wider range of hardware and/or other applications such as polynomial multiplication and large integer multiplication [6, 7].

We seek for excellent candidates with a background in computer science. Applicants will be required to have knowledge in algorithms, parallelism, high performance computing and complexity. Programming skills will be useful to achieve efficient implementations.

References

[1]

The elusive Apple matrix coprocessor (AMX). https://research.meekolab.com/the-elusive-apple-matrix-coprocessor-amx.

[2]

What is Intel advanced matrix extensions? https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html.

[3]

P. Cawley. Repository for exploring Apple's counterpart to AMX. https://github.com/corsix/amx?tab=readme-ov-file.

[4]

J. Doliskani, P. Giorgi, R. Lebreton, and É. Schost. Simultaneous conversions with the Residue Number System using linear algebra. Transactions on Mathematical Software, 44(3), 2018. Article 27.

[5]

J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 3rd edition, 2013.

[6]

J. van der Hoeven and G. Lecerf. Faster FFTs in medium precision. In 22nd IEEE Symposium on Computer Arithmetic (ARITH), pages 75–82. June 2015.

[7]

J. van der Hoeven and G. Lecerf. Implementing number theoretic transforms. Technical Report, HAL, 2024. https://hal.science/hal-04841449.

[8]

J. van der Hoeven, G. Lecerf, and G. Quintin. Modular SIMD arithmetic in Mathemagix. ACM Trans. Math. Softw., 43(1):5–1, 2016.

[9]

NVIDIA tensor cores. https://www.nvidia.com/en-us/data-center/tensor-cores/.

[10]

T. Zakharko. Exploring SME performance of Apple M4. https://github.com/tzakharko/m4-sme-exploration.