Junfeng Fan ESAT/COSIC
� ECC implementation methods � Multi-core systems � Coarse-Grained Parallelism (CGP) � Fine-Grained Parallelism (FGP) Fine-Grained Parallelism (FGP) � Two Dimensional Parallelism (TDP) � Results � Conclusions
Q = k ⋅ P NAF, window method Projective coordinates, Projective coordinates, Weighted projective 2 ⋅ P2 P1 + P2 coordinates Montgomery/ Barrett reduction, a -1 mod p a+b mod p a*b mod p Itoh-Tsujii inversion Fast multiplier, systolic array, Super-scalar coprocessor
[Mentens ����������� ]
You name it….. Atmel Diopsis ARM quad core Multi-Core Systems Intel quad core AMD quad core Cell Processor
� Advantages � Powerful platform � Lower clock frequency � Energy efficient � Challenges � Task partitioning � Communication between cores � Concurrency manage
Single�core Multi�core core�2 core�3 core�1 t1 = X2 ⋅ X2 t1 = 3t1 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 t2 = at2 + t1 t4 = X2 ⋅ t3 t1 = t1 ⋅ t3 t1 = 2Y2 t1 = t1 ⋅ Y2 X2 = 2 t4 … X2 = t5 − X2 … t3 = t4 − X2 … Y2 = t2 ⋅ t3−t1
Single�core Multi�core core�2 core�3 core�1 t1 = X2 ⋅ X2 t1 = 3t1 t2 = Z2 ⋅ Z2 t2 = t2 ⋅ t2 t2 = at2 + t1 How to efficiently perform modular t1 = 2Y2 multiplications with a multi�core Z2 = Z2 ⋅ t1 system? t3 = t1 ⋅ t1 …
Mont ( X,Y,M ) = XYR -1 mod M X � � Y Y ���� C_1 C_0 � M’ T=C_0M’ mod R � M ���� Z_1 Z_0 � XYR -1 mod M 0
� x 31 x 30 … x 1 x 0 512�bit MMM on a 16�bit core � y i ������� ����������� �������� ��� �� mod � ������ � ������ � c 32 c 31 c 30 … c; 1 c 0 �������� ��� mod � ���� ← �� ��� ��� ���� �� ��� �� �������� ← �� � �� � � � ��� mod !�� � � m 31 m 30 … m 1 m 0 "������� ← ����� � ��� � div !�� T #�� ������� $�� ��� �%�� ���� � ← ��& ��� ���� z 32 z 31 z 30 … z 1 z 0 ��� ������ '�� ������ �� c 32 c 31 c 30 c 29 … c 0 0
x 31 y i m 31 T … x 1 y i m 1 T x 0 y i m 0 T × × × × × × … + c 0 + c 1 … + c 31 + c 31 c 32 c 31 c 30 c 0 0
core�1 core�2 core�3 core�4 In each iteration �� � ← �� � �� � � � ��� mod !�� T �� � ← ����� � ��� � div !�� c 0 c 1 T … … c 0 c 0 c 31 c 1 T c 32 … c 0 c 31 c 1 T c 32 … c 0 T c 31 c 1 c 0 c 32 … Note: c 1 T c 31 1. Carry is used in local … … c 32 core. �� �������������� ����������
core�1 core�2 core�3 core�4 In each iteration �� � ← �� � �� � � � ��� mod !�� T �� � ← ����� � ��� � div !�� c 0 c 8 c 16 c 24 c 1 c 9 c 17 c 25 … … … … … … … … Carry is not Carry is not c 7 c 15 c 23 c 31 propagated! C arry_7 C arry_15 C arry_23 c 32 T c 0 c 8 c 16 c 24 c 1 c 9 c 17 c 25 Note: … … … … 1. Carry is used in local c 7 c 15 c 23 c 31 core. C arry_7 C arry_15 C arry_23 c 32 �� ��������������� ���������� … … … …
Multi�core system core�2 core�3 core�3 core�1 … … t2 = t2 ⋅ t2 t1 = 3t1 Z2 = Z2 ⋅ t1 t3 = t1 ⋅ t1 t5 = t2 ⋅ t2 … …
(a) Four Vertical Parallelism �� �� (a)Three MMMs in parallel MMM MMM MMM MMM MMM MMM MMM �� �� �� � core core core core core core core core core core core core core core core core ��� ��� ��� (c) Two MMMs in parallel (c) Two MMMs in parallel MMM MMM MMM MMM MMM �� �� �� core core core core core core core core core core core core ��� ��� ��� (d) Single MMM MMM MMM unused �� core core core core core core core core ��� � ���
16 14.5 13.4 Inversion 14 PA/PD chain 12 ce [msec] 10.2 9.9 � ������ : only CGP; � ������ : only CGP; Performance [mse 10 10 � ������� : TDP with up 8 to three�way CGP; � �������� : TDP with up 6 to two�way CGP; 4 � ������� : only FGP. 2 0 case I case II case III case IV Strategy for parallelism ��� �bit ECC on the prototype processor with four 32�bit cores
� Conclusions � We describe a parallel computing method for ECC. � By using two-dimensional parallelism, it is �� % times faster than using only coarse-grained parallelism. faster than using only coarse-grained parallelism. � Applicable to other PKCs. � Future work � Apply this method to off-the-shelf multi-core processors. � Improve the performance further with algorithm- architecture co-design methods .
Recommend
More recommend