Lab6

Introduction

In lab6 we touch on how sound is represented internally in our computers. As we can guess, sound does not get processed as analog signals within our programs. Instead, the typical representation of sound is in the for int16_t (16-bit integers) samples. Should we decide to change the volume, the sound sample would have to be re-scaled, based on the sound factor.

We are going to use 3 different approaches to scale sound samples. For each, I am going to provide the code and explain how these algorithms differ from each other. The algorithms will be tested on two separate architectures: Aarch64 and x86-64.

Alogrithm 1 (Aarch-64)

As per lab instruction I wrote the following programs,

//
//  lab6.c
//  SPO600-Lab6
//
//  Created by Evgeni Kolev on 2017-10-18.
//  Copyright © 2017 Evgeni Kolev. All rights reserved.
//

#include 
#include 
#include 

#define SIZE 250000  // array size
#define LOW -32767  // lowest value in the range
#define HIGH 32767   // highest value in the range

int main(void) {

    int16_t* samples = (int16_t*) malloc(SIZE * sizeof(int16_t));
    float* scaled = (float*) malloc(SIZE * sizeof(float)); // second array
    int16_t div = (HIGH + 1 - LOW); // divider for the random number to be in the range
    float total = 0;
    int x;
    /* Generate the samples. */
    printf("Generating the samples");
    for (x = 0; x < SIZE; x++)
        samples[x] = (LOW + rand() % div);
    /* Scalte the samples */
    printf("Scaling the samples");
    for (x = 0; x < SIZE; x++) {
        scaled[x] = samples[x] * 0.75;  //scale the sample
    }
    for(x = 0; x < SIZE; x++)
            total += scaled[x];
    printf("Total: %.2f\n ", total);
    return 0;
}

First I defined the size of the sound samples arrays as well as their range. After declaring the arrays to hold the data, I initialize the elements in the array to hold the samples. Then, I scale each sample from the array and save them in the array containing scaled samples.

Now, let’s run the solution with the “time” command to see how long it takes for the whole program to execute.

time ./lab6.1

Here are the results.
SC_algorithm1

Algorithm 2 (Aarch-64)

In the second approach, we need to amend the scaling operation. Basically, instead of scaling each individual sample in an array of millions of values we are going to scale all of the possible values with the range and save them in a table. This way, we already have all the scaled results that a sample could have.

Here is the code:

//  main.c
//  SPO600-Lab6
//
//  Created by Evgeni Kolev on 2017-10-17.
//  Copyright © 2017 Evgeni Kolev. All rights reserved.
//

#include 
#include 
#include 

#define SIZE 250000 // array
#define LOW -32767 // lowest value in the range
#define HIGH 32767 // highest value in the range



int main(void) {
    /* Declear a table and initialize arrays. */
    float* scaled = (float*) malloc(SIZE * sizeof(float));
    float* table = (float*) malloc(SIZE * sizeof(float));
    int div = (HIGH + 1 - LOW);
    int16_t* samples = (int16_t*) malloc(SIZE * sizeof(int16_t));
    int x, j, i;
    float total = 0;

    /* Create the scales in the table. */
    printf("Creating a look up table\n");
    for (x = 0; x <65534 ; x++) {
        table[x] = (HIGH - x) * 0.75;
    }
    /* Create samples. */
    printf("Creating samples\n");
    for (x = 0; x < SIZE; x++)
        samples[x] = (LOW + rand() % div);
    printf("Scaling samples\n");
    for(x = 0; x < SIZE; x++){
      scaled[x] = table[HIGH - (samples[x])];
    }
    for(x = 0; x < SIZE; x++)
            total += scaled[x];
    printf("Total: %.2f\n", total);
    return 0;
}

Now, let’s see how this approach is goint to perform. I am going to run again the program using the time command.

time ./lab6.2

Here is the output:

SC2_algorithm 2

We can see that there is a significant improvement in the user time. In the previous approach user time was 0.18 seconds, where with the second algorithm it is 0.09. The second algortihm run 2 times faster than the first. The reason is, using a scaled table we have limited the number of multiplication operation only to the size of the samples’ range.

Algorithm 3 (Aarch-64)

In the third approach, we are going to turn the scaling expression into integer math.  The first thing to do is turn the scaling factor into an integer, multiply 0.75 by 256. When scaling the samples we multiply the sample by the factor and then shift the results 8 bits to the right.

Here is the code:

//
//  main.c
//  lab6.3
//
//  Created by Evgeni Kolev on 2017-10-18.
//  Copyright © 2017 Evgeni Kolev. All rights reserved.
//

#include 
#include 
#include 

#define SIZE 250000 // array size
#define SHIFT 8        // used to shift bits
#define FACTORIAL 0.75
#define HIGH -32767  // lowest value in the range
#define LOW 32767  // highest value in the range


int main() {
    int factorial = FACTORIAL * 256;
    int16_t* samples = (int16_t*) malloc(SIZE * sizeof(int16_t));
    int16_t* scaled = (int16_t*) malloc(SIZE * sizeof(float)); // second array
    // divider for the random number to be in the range
    int16_t div = (int16_t)((HIGH + 1) - LOW); 
    int16_t total = 0;
    int x;
   /* Initialize the samples in the array. */
    printf("Generating samples \n");
    for (x = 0; x < SIZE; x++)
        samples[x] = (LOW + rand() % div);
    /* Sacle the samples */
    printf("Scaling the samples \n");
    for(x = 0; x < SIZE; x++)                       scaled[x] = (samples[x] * factorial) >> SHIFT;
    for(x = 0; x < SIZE; x++)
        total += scaled[x];
    printf("Total: %d\n", total);

    return 0;
}

Here are the ruslts from the test:

SC3_algorithm 3.PNG

x86-64

Let’s see what results we are getting from running the same algorithms on a different architecture.

Algorithm 1

SC4_x-algorithm1

Algorithm 2

SC5_x-algorithm2

Algorithm 3

SC3-x-algorithm3

Conclusion

Based on the results we got from the 3 different algorithms we can see that there are a number of dependencies. First, when we select a given algorithm for our application we need to take into account the architecture our app is going to be run on. Second, we should consider the amount of time it takes for the CPU to perform different operations. In example, multiplication is a more expensive operation than addition. Third, the approach we take should be dependent on the data load our application is going to handle.

Final Project Part 2

The source code for the final project is provided in Part 1. Also, the tool used to produce the graphs for profiling is called “gprof2dort”. Should you desire to use it, you can download the software from here

In the second part of the final project, I will first try to optimise the code. I have a few approaches in mind. First, let’s have another look at the few hot spots of my program.

output2

Clearly, the “transform()” function is where most of the time is spent as this function perform another  1,567,500 function calls to 3 other hashing functions. Now, we should see what amount of time each of these 3 functions takes from the total amount of time of spent in the “transform()” function.

The first one to look at will be “_md5_GG()” function. Here is what it looks like:

output

This function took %42.86 of the total amount of time spent in the transform function, which is nearly half of the time. The next two functions being called from within “transform()” are not that significant.

output - Copy

output - Copy

They both take %14.29 each. However, in total that is another %28.58 from the total time.

Optimization

The first approach of optimization I will take is to make the “_md5_GG()” function a lambda expression within the “transform()” function. Here is the code of the function itself:

inline void _md5_GG(uint32_t &a, uint32_t b, uint32_t c, uint32_t d, uint32_t x, uint32_t s, uint32_t ac) {
		a = _md5_rotate_left(a + _md5_G(b, c, d) + x + ac, s) + b;
	}

Now, here is what the lambda expression that I coded inside the “transform()” function looks like:

auto _md5_GG = [](uint32_t &a, uint32_t b, uint32_t c, uint32_t d, uint32_t x, uint32_t s, uint32_t ac) {
	  return _md5_rotate_left(a + _md5_G(b, c, d) + x + ac, s) + b;
	};

I compiled and profiled the program again and this is what I got in terms of how much time is spent in the lambda expression:

output - Copy (2)

Now, it looks like I have optimized it. However, the total amount of time spent in the “transform()” is still %28 from the total amount that it takes of the program to execute. Another way to simply test if this approach has provided an efficient optimization that significantly affects performance is to increase the data being hashed and simply compare the results produced by the “time” command.

First, I will test the program without the lambda expression.

time ./test

Here are the results:

Screen Shot 2018-01-08 at 11.46.52 пр.об.

Let’s have a look how the optimized solution will preform.

time ./test2

Screen Shot 2018-01-08 at 11.49.55 пр.об..png

Given the results, I would say I have managed to achieve an optimization of 1.053 seconds. Unfortunately, this will not serve as a real-life optimization when this application is used. The MD5 hash algorithm hashes short strings, such as usernames and passwords. In my case, I am hashing a file with hundreds of characters, and I am hashing it 250 000 times. This will never occur in a real scenario due to the purpose of the “Telegram” application(refer to part 1 to download the application and get information on it).  There is also something interesting to compare. Here is the peace of the code used in the “transform()” function that works with “_md5_GG()”.

_md5_GG(a, b, c, d, x[1] , 5 , 0xf61e2562);
_md5_GG(d, a, b, c, x[6] , 9 , 0xc040b340);
_md5_GG(c, d, a, b, x[11], 14, 0x265e5a51);
_md5_GG(b, c, d, a, x[0] , 20, 0xe9b6c7aa);
_md5_GG(a, b, c, d, x[5] , 5 , 0xd62f105d);
_md5_GG(d, a, b, c, x[10], 9 , 0x2441453);
_md5_GG(c, d, a, b, x[15], 14, 0xd8a1e681);
_md5_GG(b, c, d, a, x[4] , 20, 0xe7d3fbc8);
_md5_GG(a, b, c, d, x[9] , 5 , 0x21e1cde6);
_md5_GG(d, a, b, c, x[14], 9 , 0xc33707d6);
_md5_GG(c, d, a, b, x[3] , 14, 0xf4d50d87);
_md5_GG(b, c, d, a, x[8] , 20, 0x455a14ed);
_md5_GG(a, b, c, d, x[13], 5 , 0xa9e3e905);
_md5_GG(d, a, b, c, x[2] , 9 , 0xfcefa3f8);
_md5_GG(c, d, a, b, x[7] , 14, 0x676f02d9);
_md5_GG(b, c, d, a, x[12], 20, 0x8d2a4c8a);

The “_md5_GG()” is being called multiple times and the function was originally declared as inlined. Meaning, the compiler will insert the function code every time a call is made to “_md5_GG()”. However, this technique will be useful as long as the multiple calls to “_md5_GG()” fit in the CPU’s cache. Therefore, this will make the performance of the “transform()” function dependant on the machine it is being run on.

In the original “_md5_GG()” the first parameter is passed as a reference and it has its value reassigned within the function. In my lambda expression I carried out the assignment operation out of the expression, by returning the value. Here is what it looks like:

a = _md5_GG(a, b, c, d, x[1] , 5 , 0xf61e2562);
d = _md5_GG(d, a, b, c, x[6] , 9 , 0xc040b340);
c = _md5_GG(c, d, a, b, x[11], 14, 0x265e5a51);
b = _md5_GG(b, c, d, a, x[0] , 20, 0xe9b6c7aa);
a = _md5_GG(a, b, c, d, x[5] , 5 , 0xd62f105d);
d = _md5_GG(d, a, b, c, x[10], 9 , 0x2441453);
c = _md5_GG(c, d, a, b, x[15], 14, 0xd8a1e681);
b = _md5_GG(b, c, d, a, x[4] , 20, 0xe7d3fbc8);
a = _md5_GG(a, b, c, d, x[9] , 5 , 0x21e1cde6);
d = _md5_GG(d, a, b, c, x[14], 9 , 0xc33707d6);
c = _md5_GG(c, d, a, b, x[3] , 14, 0xf4d50d87);
b = _md5_GG(b, c, d, a, x[8] , 20, 0x455a14ed);
a = _md5_GG(a, b, c, d, x[13], 5 , 0xa9e3e905);
d = _md5_GG(d, a, b, c, x[2] , 9 , 0xfcefa3f8);
c = _md5_GG(c, d, a, b, x[7] , 14, 0x676f02d9);
b = _md5_GG(b, c, d, a, x[12], 20, 0x8d2a4c8a);

Here “_md5_GG()” is the lambda expression, not the inlined “_md5_GG()” function.

More of the story is, both of these approaches are pretty much doing the same thing. Based on how both the application and MD5 is used, no optimization will be experienced on the users’ side.
The other approach I attempted was to improve the “_md5_GG()” function as well as the function being called inside it, using inline Assembler. To my great regret, I failed miserably (I could not get it compiled). Writing equivalent MD5 hashing code in assembly proved to by something out of my skillset.

The third and most efficient optimization I did was using the GNU compiler optimization options.

We have 3 of them:

-O0 -Reduce compilation time and make debugging produce the expected results. This is the default.
-O1 – Optimizing compilation takes somewhat more time, and a lot more memory for a large function.With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
-O2 – Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
-O3 – Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:”

2003. “Using the GNU Compiler Collection” by Free Software Foundation. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

The option that I used is O2. This is the maximum we can go in terms of optimization and having our code still perform exactly what we wrote. With the O3 the compiler changes to an unrecognizable equivalent the code we originally wrote. Therefore, there is a possibility that the data/expected output might be different.

-O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags:

-fthread-jumps 
-falign-functions  -falign-jumps 
-falign-loops  -falign-labels 
-fcaller-saves 
-fcrossjumping 
-fcse-follow-jumps  -fcse-skip-blocks 
-fdelete-null-pointer-checks 
-fdevirtualize -fdevirtualize-speculatively 
-fexpensive-optimizations 
-fgcse  -fgcse-lm  
-fhoist-adjacent-loads 
-finline-small-functions 
-findirect-inlining 
-fipa-cp 
-fipa-bit-cp 
-fipa-vrp 
-fipa-sra 
-fipa-icf 
-fisolate-erroneous-paths-dereference 
-flra-remat 
-foptimize-sibling-calls 
-foptimize-strlen 
-fpartial-inlining 
-fpeephole2 
-freorder-blocks-algorithm=stc 
-freorder-blocks-and-partition -freorder-functions 
-frerun-cse-after-loop  
-fsched-interblock  -fsched-spec 
-fschedule-insns  -fschedule-insns2 
-fstore-merging 
-fstrict-aliasing 
-ftree-builtin-call-dce 
-ftree-switch-conversion -ftree-tail-merge 
-fcode-hoisting 
-ftree-pre 
-ftree-vrp 
-fipa-ra

2003. “Using the GNU Compiler Collection” by Free Software Foundation. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Having said that, let’s do the optimization itself. If you have downloaded the files from part one, all you will have to do is the copy&paste the following command:

g++ -g -O2 -std=c++11 -pg Source.cpp utils.cpp -o test

Run the program with the time command for a quick comparison.

time ./test

Screen Shot 2018-01-08 at 3.24.59 сл.об..png

Now, this is 30 times faster. I think I should stick with this optimization. However, I need to make sure that the program is not altering my data. Let’s hash the file 10 times and see what result we get.

Prove that the optimized code produces equivalent results to the original code

I need to make sure that I am getting identical results every time the optimized program is run. Therefore, I made a very simple bash script to execute the program multiple times. Here is the code:

 #!/bin/bash
  
counter=1
echo "first parameter is $1"
while [ $counter -le $1 ]
do
        ./test
        ((counter++))
done

In the script above, $1 stands for the first parameter we give our script when we run it. Basically, that would be the number of times we want to run the “test” program. I changed the “main()” to print the first 2 hashes of the vector of hashes.I am hashing the same file, so they should all be the same.

I will run the script with the number 10. Meaning, the program will be run 10 times and it should print 20 identical hashes.

./script.sh 10

Here’s the output:

Screen Shot 2018-01-08 at 4.20.31 сл.об.

Let’s run the script again but this time the “test” program is going to have a different file.

Screen Shot 2018-01-08 at 4.25.25 сл.об.

I have hard coded the data file in my “main()”. If you desire to test the program, you can use the header and source file from part one, along with this “main()”.

//
//  Source.cpp
//  SPO600-Project
//
//  Created by Evgeni Kolev on 19.12.17 ..
//  Copyright © 2017 г. Evgeni Kolev. All rights reserved.
//


#include 
#include 
#include 
#include 
#include "HashMd5.h"
#define SIZE 2500
using namespace std;

int main() {
	std::string line;
	string data;
	vectorobjs;
	std::ifstream myfile("doc.txt");
	if (myfile.is_open())
	{
		while (getline(myfile, line)) {
			data += line;
		}
		myfile.close();
	}
	
	for (int i = 0; i < SIZE; i++) {
		objs.push_back(HashMd5(&data, (uint32_t)data.length()));
		objs.at(i).result();
	} 

	for (int i = 0; i < 2; i++) {
		cout << objs.at(i)<<endl;
	}

	
	
}

Run the program on a non-AArch64 platform

I ran the program on xerxes (x86-64). I simply used the time command to compare.

Screen Shot 2018-01-08 at 5.08.07 сл.об..png

This is more than 0.200 seconds faster than the Aarch64. Given the fact that it took 1.399 seconds to execute on Aarch64, that would be around %18 faster on x86-64. Here are the hashes on xerxes after I ran the script:

Screen Shot 2018-01-08 at 4.53.37 сл.об.

Here the data is also not affected by multiple runs of the optimized code. Therefore, the output of the program is not affected by the platform but rather it’s performance slightly.

Conclusion

I can’t say that I achieved an optimization that would be to the overall benefit of the application. Telegram is an application that is being used in the Linux professional world, mainly for texting.  Consequently, I failed in a way. What I mean by that, I chose an application to optimize which is beyond my programming skillset. The results I achieved, match the project requirements rather than the application overall performance.

Nonetheless, this project as well as the labs did teach me a number of things.

  1. How to approach development/improvement of an application from open source community.
  2. How to blog/report programming work.
  3. Profile, measure performance of applications across different platforms
  4. A number of different approaches to take when we wish to improve software performance
  5. How to identify, design and select the best approach to improve software performance/portability.

Special thanks to our professor, Chris Tyler. It was great working with him and the classmates of SPO600.

Final Project

Introduction

In the final project, we were assigned to perform an optimization. For that purpose, I had to find an open source software that was taking advantage of a hash algorithm. Basically, hashed algorithms are used to encrypt data.

I decided to use a texting application for Linux called Telegram that has the hash algorithms CRC32 and HashMd5.

Here is a link to the software https://github.com/telegramdesktop/tdesktop
I would like to point out that it will take you forever to look for a hash function if you are browsing. Their code does not have a lot of comments and the .cpp, as well as .h files, are thousands of lines.
Tip: use this command to first find files that are using anything that has the word “hash”.
grep 'hash' *

There are quite a few using “hash” as a macro, function, or a class name. After I extracted the tar file I went to the folder called “core” that contains the core classes, and their implementations.  There, the application was using CRC32 and MD5 hashing algorithms.

I decided to work on the MD5 hash algorithm.

“The MD5 algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. It can still be used as a checksum to verify data integrity, but only against unintentional corruption.

Like most hash functions, MD5 is neither encryption nor encoding. It can be cracked by brute-force attack and suffers from extensive vulnerabilities as detailed in the security section below.

MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4. The source code in RFC 1321 contains a “by attribution” RSA license. The abbreviation “MD” stands for “Message Digest.” ”

2017 “MD5 – Wikipedia” Wikipedia. https://en.wikipedia.org/wiki/MD5

Find code to optimise

The program does not use a specific function applying this algorithm, but rather a whole class I had to do quite a bit of editing on both the header and source file in order to be able to isolate the class and its implementations. This was due to the reason that the code was using libraries that we do not have installed on the system. The code is very lengthy so you might want to just copy and pace it in a real editor and compile it.

Here is the edited copy of the header file customized to only work with the HashMd5 class

#pragma once
#include 
#include 
#include 

const static uint32_t _md5_block_size = 64;
typedef unsigned char uchar;
class HashMd5 {
public:
	HashMd5(const void *input = 0, uint32_t length = 0);
	void feed(const void *input, uint32_t length);
	int32_t *result();
	std::string hexdigest() const;

private:

	void init();
	void finalize();
	void transform(const uchar *block);

	bool _finalized;
	uchar _buffer[_md5_block_size];
	uint32_t _count[2];
	uint32_t _state[4];
	uchar _digest[16];

};
std::ostream& operator<<(std::ostream& out, HashMd5 md5);

int32_t *hashMd5(const void *data, uint32_t len, void *dest); // dest = ptr to 16 bytes, returns (int32*)dest
inline std::array hashMd5(const void *data, int size) {
	auto result = std::array();
	hashMd5(data, size, result.data());
	return result;
}

char *hashMd5Hex(const int32_t *hashmd5, void *dest); // dest = ptr to 32 bytes, returns (char*)dest
inline char *hashMd5Hex(const void *data, uint32_t len, void *dest) { // dest = ptr to 32 bytes, returns (char*)dest
	return hashMd5Hex(HashMd5(data, len).result(), dest);
}
inline std::array hashMd5Hex(const void *data, int size) {
	auto result = std::array();
	hashMd5Hex(data, size, result.data());
	return result;
}

Here is the implementation file for the above class:

#include "HashMd5.h"
#include 
#include 
#include 
using namespace std;
uint64_t _SharedMemoryLocation[4] = { 0x00, 0x01, 0x02, 0x03 };

#ifdef Q_OS_WIN
#elif defined Q_OS_MAC
#include 
#else
#include 
#endif

// Base types compile-time check
static_assert(sizeof(char) == 1, "Basic types size check failed");
static_assert(sizeof(uchar) == 1, "Basic types size check failed");
static_assert(sizeof(int16_t) == 2, "Basic types size check failed");
static_assert(sizeof(uint16_t) == 2, "Basic types size check failed");
static_assert(sizeof(int32_t) == 4, "Basic types size check failed");
static_assert(sizeof(uint32_t) == 4, "Basic types size check failed");
static_assert(sizeof(int64_t) == 8, "Basic types size check failed");
static_assert(sizeof(uint64_t) == 8, "Basic types size check failed");
 
	inline void _md5_decode(uint32_t *output, const uchar *input, uint32_t len) {
		for (uint32_t i = 0, j = 0; j < len; i++, j += 4) {
	            output[i] = ((uint32_t)input[j]) | (((uint32_t)input[j + 1]) << 8) | 
                                (((uint32_t)input[j + 2]) << 16) | (((uint32_t)input[j + 3]) << 24); 		
                  } 	
        } 	
       inline void _md5_encode(uchar *output, const uint32_t *input, uint32_t len) { 		
                    for (uint32_t i = 0, j = 0; j < len; i++, j += 4) {
                        output[j + 0] = (input[i]) & 0xFF;
                        output[j + 1] = (input[i] >> 8) & 0xFF;
                        output[j + 2] = (input[i] >> 16) & 0xFF;
                        output[j + 3] = (input[i] >> 24) & 0xFF;
                   }
	}

	inline uint32_t _md5_rotate_left(uint32_t x, int n) {
		return (x <> (32 - n));
	}

	inline uint32_t _md5_F(uint32_t x, uint32_t y, uint32_t z) {
		return (x & y) | (~x & z);
	}

	inline uint32_t _md5_G(uint32_t x, uint32_t y, uint32_t z) {
		return (x & z) | (y & ~z);
	}

	inline uint32_t _md5_H(uint32_t x, uint32_t y, uint32_t z) {
		return x ^ y ^ z;
	}

	inline uint32_t _md5_I(uint32_t x, uint32_t y, uint32_t z) {
		return y ^ (x | ~z);
	}

	inline void _md5_FF(uint32_t &a, uint32_t b, uint32_t c, uint32_t d, uint32_t x, uint32_t s, uint32_t ac) {
		a = _md5_rotate_left(a + _md5_F(b, c, d) + x + ac, s) + b;
	}

	inline void _md5_GG(uint32_t &a, uint32_t b, uint32_t c, uint32_t d, uint32_t x, uint32_t s, uint32_t ac) {
		a = _md5_rotate_left(a + _md5_G(b, c, d) + x + ac, s) + b;
	}

	inline void _md5_HH(uint32_t &a, uint32_t b, uint32_t c, uint32_t d, uint32_t x, uint32_t s, uint32_t ac) {
		a = _md5_rotate_left(a + _md5_H(b, c, d) + x + ac, s) + b;
	}

	inline void _md5_II(uint32_t &a, uint32_t b, uint32_t c, uint32_t d, uint32_t x, uint32_t s, uint32_t ac) {
		a = _md5_rotate_left(a + _md5_I(b, c, d) + x + ac, s) + b;
	}

	static uchar _md5_padding[64] = {
		0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
	};


HashMd5::HashMd5(const void *input, uint32_t length) : _finalized(false) {
	init();
	if (input && length > 0) 
		feed(input, length);
}

void HashMd5::feed(const void *input, uint32_t length) {
	uint32_t index = _count[0] / 8 % _md5_block_size;

	const uchar *buf = (const uchar *)input;

	if ((_count[0] += (length << 3)) < (length <> 29);

	uint32_t firstpart = 64 - index;

	uint32_t i;

	if (length >= firstpart) {
		memcpy(&_buffer[index], buf, firstpart);	
		transform(_buffer);
		for (i = firstpart; i + _md5_block_size <= length; i += _md5_block_size) {
			transform(&buf[i]);
		}

		index = 0;
	}
	else {
		i = 0;
	}

	memcpy(&_buffer[index], &buf[i], length - i);
}

int32_t *HashMd5::result() {
	if (!_finalized) finalize();
	return (int32_t*)_digest;
}

void HashMd5::init() {
	_count[0] = 0;
	_count[1] = 0;

	_state[0] = 0x67452301;
	_state[1] = 0xefcdab89;
	_state[2] = 0x98badcfe;
	_state[3] = 0x10325476;
}

void HashMd5::finalize() {
	if (!_finalized) {
		uchar bits[8];
		_md5_encode(bits, _count, 8);

		uint32_t index = _count[0] / 8 % 64, paddingLen = (index < 56) ? (56 - index) : (120 - index);
		feed(_md5_padding, paddingLen);
		feed(bits, 8);
		_md5_encode(_digest, _state, 16);
		_finalized = true;
	}
}
std::string HashMd5::hexdigest() const
{
	if (!_finalized)
		return "";

	char buf[33];
	for (int i = 0; i<16; i++)
		sprintf(buf + i * 2, "%02x", _digest[i]);
	buf[32] = 0;

	return std::string(buf);
}

void HashMd5::transform(const uchar *block) {
	
	uint32_t a = _state[0], b = _state[1], c = _state[2], d = _state[3], x[16];
	_md5_decode(x, block, _md5_block_size);

	_md5_FF(a, b, c, d, x[0] , 7 , 0xd76aa478);
	_md5_FF(d, a, b, c, x[1] , 12, 0xe8c7b756);
	_md5_FF(c, d, a, b, x[2] , 17, 0x242070db);
	_md5_FF(b, c, d, a, x[3] , 22, 0xc1bdceee);
	_md5_FF(a, b, c, d, x[4] , 7 , 0xf57c0faf);
	_md5_FF(d, a, b, c, x[5] , 12, 0x4787c62a);
	_md5_FF(c, d, a, b, x[6] , 17, 0xa8304613);
	_md5_FF(b, c, d, a, x[7] , 22, 0xfd469501);
	_md5_FF(a, b, c, d, x[8] , 7 , 0x698098d8);
	_md5_FF(d, a, b, c, x[9] , 12, 0x8b44f7af);
	_md5_FF(c, d, a, b, x[10], 17, 0xffff5bb1);
	_md5_FF(b, c, d, a, x[11], 22, 0x895cd7be);
	_md5_FF(a, b, c, d, x[12], 7 , 0x6b901122);
	_md5_FF(d, a, b, c, x[13], 12, 0xfd987193);
	_md5_FF(c, d, a, b, x[14], 17, 0xa679438e);
	_md5_FF(b, c, d, a, x[15], 22, 0x49b40821);

	_md5_GG(a, b, c, d, x[1] , 5 , 0xf61e2562);
	_md5_GG(d, a, b, c, x[6] , 9 , 0xc040b340);
	_md5_GG(c, d, a, b, x[11], 14, 0x265e5a51);
	_md5_GG(b, c, d, a, x[0] , 20, 0xe9b6c7aa);
	_md5_GG(a, b, c, d, x[5] , 5 , 0xd62f105d);
	_md5_GG(d, a, b, c, x[10], 9 , 0x2441453);
	_md5_GG(c, d, a, b, x[15], 14, 0xd8a1e681);
	_md5_GG(b, c, d, a, x[4] , 20, 0xe7d3fbc8);
	_md5_GG(a, b, c, d, x[9] , 5 , 0x21e1cde6);
	_md5_GG(d, a, b, c, x[14], 9 , 0xc33707d6);
	_md5_GG(c, d, a, b, x[3] , 14, 0xf4d50d87);
	_md5_GG(b, c, d, a, x[8] , 20, 0x455a14ed);
	_md5_GG(a, b, c, d, x[13], 5 , 0xa9e3e905);
	_md5_GG(d, a, b, c, x[2] , 9 , 0xfcefa3f8);
	_md5_GG(c, d, a, b, x[7] , 14, 0x676f02d9);
	_md5_GG(b, c, d, a, x[12], 20, 0x8d2a4c8a);

	_md5_HH(a, b, c, d, x[5] , 4 , 0xfffa3942);
	_md5_HH(d, a, b, c, x[8] , 11, 0x8771f681);
	_md5_HH(c, d, a, b, x[11], 16, 0x6d9d6122);
	_md5_HH(b, c, d, a, x[14], 23, 0xfde5380c);
	_md5_HH(a, b, c, d, x[1] , 4 , 0xa4beea44);
	_md5_HH(d, a, b, c, x[4] , 11, 0x4bdecfa9);
	_md5_HH(c, d, a, b, x[7] , 16, 0xf6bb4b60);
	_md5_HH(b, c, d, a, x[10], 23, 0xbebfbc70);
	_md5_HH(a, b, c, d, x[13], 4 , 0x289b7ec6);
	_md5_HH(d, a, b, c, x[0] , 11, 0xeaa127fa);
	_md5_HH(c, d, a, b, x[3] , 16, 0xd4ef3085);
	_md5_HH(b, c, d, a, x[6] , 23, 0x4881d05);
	_md5_HH(a, b, c, d, x[9] , 4 , 0xd9d4d039);
	_md5_HH(d, a, b, c, x[12], 11, 0xe6db99e5);
	_md5_HH(c, d, a, b, x[15], 16, 0x1fa27cf8);
	_md5_HH(b, c, d, a, x[2] , 23, 0xc4ac5665);

	_md5_II(a, b, c, d, x[0] , 6 , 0xf4292244);
	_md5_II(d, a, b, c, x[7] , 10, 0x432aff97);
	_md5_II(c, d, a, b, x[14], 15, 0xab9423a7);
	_md5_II(b, c, d, a, x[5] , 21, 0xfc93a039);
	_md5_II(a, b, c, d, x[12], 6 , 0x655b59c3);
	_md5_II(d, a, b, c, x[3] , 10, 0x8f0ccc92);
	_md5_II(c, d, a, b, x[10], 15, 0xffeff47d);
	_md5_II(b, c, d, a, x[1] , 21, 0x85845dd1);
	_md5_II(a, b, c, d, x[8] , 6 , 0x6fa87e4f);
	_md5_II(d, a, b, c, x[15], 10, 0xfe2ce6e0);
	_md5_II(c, d, a, b, x[6] , 15, 0xa3014314);
	_md5_II(b, c, d, a, x[13], 21, 0x4e0811a1);
	_md5_II(a, b, c, d, x[4] , 6 , 0xf7537e82);
	_md5_II(d, a, b, c, x[11], 10, 0xbd3af235);
	_md5_II(c, d, a, b, x[2] , 15, 0x2ad7d2bb);
	_md5_II(b, c, d, a, x[9] , 21, 0xeb86d391);

	_state[0] += a;
	_state[1] += b;
	_state[2] += c;
	_state[3] += d;
}

int32_t *hashMd5(const void *data, uint32_t len, void *dest) {
	HashMd5 md5(data, len);
	memcpy(dest, md5.result(), 16);

	return (int32_t*)dest;
}

char *hashMd5Hex(const int32_t *hashmd5, void *dest) {
	char *md5To = (char*)dest;
	const uchar *res = (const uchar*)hashmd5;

	for (int i = 0; i < 16; ++i) { 		uchar ch(res[i]), high = (ch >> 4) & 0x0F, low = ch & 0x0F;
		md5To[i * 2 + 0] = high + ((high > 0x09) ? ('a' - 0x0A) : '0');
		md5To[i * 2 + 1] = low + ((low > 0x09) ? ('a' - 0x0A) : '0');
	}

	return md5To;
}
std::ostream& operator<<(std::ostream& out, HashMd5 md5){
	return out << md5.hexdigest();
}

The next thing to do is to test the isolated code. I wrote a small main program that will read a text file and pass it as a string to the hash class. The program is hashing 2500 stings Here is my main:

#include 
#include "HashMd5.h"
#define SIZE 2500
using namespace std;

int main() {
	std::string line;
	string data;
	vectorobjs;
	std::ifstream myfile("doc.txt");
	if (myfile.is_open())
	{
		while (getline(myfile, line)) {
			data += line;
		}
		myfile.close();
	}
	
	for (int i = 0; i < SIZE; i++) {
		objs.push_back(HashMd5(&data, data.length()));
		objs.at(i).result();
	} 

	//for (int i = 0; i < SIZE; i++) {
		//cout << objs.at(i)<<endl;
	//}

	
	
}

Profiling

In software engineering we refer to profiling as a dynamic analysis which we use to identify the use of different resources such as memory, CPU time, etc. This will help us identify where and what optimisation we could do.

Now, let’s compile the program and make sure we have the -pg otpion on when we do so.
g++ -std=c++11 -pg Source.cpp utils.cpp -o test

The first step to optimising our hash class would be to find out where we have the hot spot. For that purpose, I will use the “gprof” command with “gprof2dot” that will create an image from the analysis.The “gprof” can create a file with different statistics that are hard to read so I decided to just use simple graph.

You can get the program from  here

gprof ./test | ./gprof2dot.py | dot -Tpng -o output.png

Let’s have a look at which functions took the most time.
output2

The functions we see in red are our program’s hot spots.

The “feed” function was called 7500 times, but as a result of it, another 47500 other function calls happened. However, this is not the point where the hashing is happening so we need to look deeper. The “transform(…)” function is calling the inlined functions that perform bit operations and constantly switching the arguments in the calls. The “transform(…)” function made another 1 567 500 calls to the bit operation functions. My intention is to optimise the “transfrom(…)” function as clearly this is the one that decreasing preformance.

Optimization

In order to optimise the “transform(…)” function I will take into consideration the following practiceses

  1. Inline assembler
  2. Lambda expression

Inline assembly – I will try to replace the code in the inlined bit operation functions with inline assembler. However, it is really hard for me to work with this approach and becuse it is also architecture dependat might not be the best soilution.

Lambda expression – the other approach I might take is replace the bit opeartion functions within the “transform(…)” function. I will replace the bit functions with lambda expressions as their are very short.

Lab7

In Lab7 we are to be introduced as to how to write assembly code inside our C programs, also called inline assembler The reason why we would want to do something, in most cases complicated, is because of performance reasons. A piece of an Assembly code in the hot spot (a code block where most time is spent during the program’s execution) of a program can significantly increase performance. However, this can be a burden due to a number of factors. Assembler is dependant on the hardware of the machine where it runs. Hard to maintain, read, and write.

Let’s first get started with how to switch in Assembler mode within a C source file.

__asm__(...);

Inside the parentheses is where we will put our difficult to read and should not write Assembly code. Let’s use text to indicate the structure of the code inside the parentheses.

__asm__(assemble template:output operands:input operands:clobbers)

In a real program, it would look like this:

__asm__("mov %1,%0;":"=r"(y):"r"(x): )

Assembly template (Mandatory) – in here we put code that is to be pre-processed, this would be the instruction we would like to execute. Next, we quote the registers in example %0,%1,%2,%3. We would use these in the input and output portion of our code. Basically, we give the instruction we want to execute and the registers we are going to use.
Input and Output operands(optional) – in these two portions we use Constrains in a strings and right next to it a C expression. Here are some commonly used constrains:

  1. “r” : Register operand constraint, look table given above.
  2. “q” : Registers a, b, c or d.
  3. “I” : Constant in range 0 to 31 (for 32-bit shifts).
  4. “J” : Constant in range 0 to 63 (for 64-bit shifts).
  5. “K” : 0xff.
  6. “L” : 0xffff.
  7. “M” : 0, 1, 2, or 3 (shifts for lea instruction).
  8. “N” : Constant in range 0 to 255 (for out instruction).
  9. “f” : Floating point register
  10. “t” : First (top of stack) floating point register
  11. “u” : Second floating point register
  12. “A” : Specifies the `a’ or `d’ registers. This is primarily useful for 64-bit integer values intended to be returned with the `d’ register holding the most significant bits and the `a’ register holding the least significant bits.w

2003 “GCC-Inline-Asssembly-HOWTO” by Sandeep S, https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

We combine the above constrains with modifiers constrain modifiers:

  1. = – output-only register – previous contents are discarded and replaced with output value (this does not preclude use as in input register)
  2. + – input and output register – this register will be used to both pass input data to the asm code, and to receive a value from the asm code
  3. & – earlyclobber register – this value may be overwritten before input is processed, therefore it must not be used for input
  4. % – in addition to one of the symbols above, declares that this operand and the following operand are commutable (interchangeable) for optimization. Only one commutable pair may be specified.

2017 “Inline Assembly Language” by Chris Tyler, https://wiki.cdot.senecacollege.ca/wiki/Inline_Assembly_Language

The content provided should be helpful enough for to read and follow lab 7. In the lab, we will a do a bit of exploration and see how we can use inline assembly in open source. Let’s dive in!

Part A

First, we will need to download a tgz file, extract and compile the program in it. Here is the link to the file: matrix.senecacollege.ca/~chris.tyler/spo600/spo600_20173_inline_assembler_lab.tgz

Once we are done with downloading, extracting and compiling the program, we are to do a comparison. As the program provided is a reflection on the third algorithm from lab6 (the one with fixed-point integer math), let’s compare it to my version.

My Source code:

//  main.c
//  lab6.3
//  Created by Evgeni Kolev on 2017-10-18.
//  Copyright © 2017 Evgeni Kolev. All rights reserved.

#include 
#include 
#include 

#define SIZE 250000 // array size
#define SHIFT 8        // used to shift bits
#define FACTORIAL 0.75
#define HIGH -32767  // lowest value in the range
#define LOW 32767  // highest value in the range


int main() {
    int factorial = FACTORIAL * 256;
    int16_t* samples = (int16_t*) malloc(SIZE * sizeof(int16_t));
    // second array
    int16_t* scaled = (int16_t*) malloc(SIZE * sizeof(float));
    // divider for the random number to be in the range
    int16_t div = (int16_t)((HIGH + 1) - LOW); 
    int16_t total = 0;
    int x;
   /* Initialize the samples in the array. */
    printf("Generating samples \n");
    for (x = 0; x < SIZE; x++)         samples[x] = (LOW + rand() % div);     /* Sacle the samples */     printf("Scaling the samples \n");     for(x = 0; x > SHIFT;
    for(x = 0; x < SIZE; x++)
        total += scaled[x];
    printf("Total: %d\n", total);
    
    return 0;
}

Professor’s source code:

// vol_simd.c :: volume scaling in C using AArch64 SIMD
// Chris Tyler 2017.11.29

#include 
#include 
#include 
#include 
#include "vol.h"

int main() {

	int16_t*		in;		// input array
	int16_t*		limit;		// end of input array
	int16_t*		out;		// output array

	// these variables will be used in our assembler code, so we're going
	// to hand-allocate which register they are placed in
	// Q: what is an alternate approach?
	register int16_t*	in_cursor 	asm("r20");// input cursor
	register int16_t*	out_cursor	asm("r21");// output cursor
	register int16_t	vol_int		asm("r22");// volume as 

	int			x;		// array interator
	int			ttl;		// array total

	in = (int16_t*)calloc(SAMPLES, sizeof(int16_t));
	out = (int16_t*)calloc(SAMPLES, sizeof(int16_t));

	srand(1);
	printf("Generating sample data.\n");
	for (x = 0; x < SAMPLES; x++) {
		in[x] = (rand() % 65536) - 32768;
	}

	// ----------------------------------------------------------

	in_cursor = in;
	out_cursor = out;
	limit = in + SAMPLES;

	// set vol_int to fixed-point representation of 0.5 
	// Q: should we use 32767 or 32768 in next line? why?
	vol_int = (int16_t)(0.5 * 32767.0);

	printf("Scaling samples.\n");

	// Q: what does it mean to "duplicate" values here?
	__asm__("dup v1.8h,w22"); // duplicate vol_int into v1.8h
	while (in_cursor < limit) {
	    __asm__(
		"ldr q0, [x20],#16		\n\t"
		// load eight samples into q0 (v0.8h)
		// from in_cursor, and post-increment
		// in_cursor by 16 bytes

		"sqdmulh v0.8h, v0.8h, v1.8h	\n\t"
		// multiply each lane in v0 by v1*2
		// saturate results
		// store upper 16 bits of results into v0

		"str q0, [x21],#16		\n\t"
		// store eight samples to out_cursor
		// post-increment out_cursor by 16 bytes

		// Q: what happens if we remove the following
		// lines? Why?
		: "=r"(in_cursor)
		: "r"(limit), "r"(in_cursor), "r"(out_cursor)
		);
	}

	// ---------------------------------------------------------

	printf("Summing samples.\n");
	for (x = 0; x < SAMPLES; x++) {
		ttl = (ttl + out[x]) % 1000;
	}

	// Q: are the results usable? are they correct?
	printf("Result: %d\n", ttl);

	return 0;

}

One thing to notice is that in the professor’s version we have used inline assembly to allocate registers for the “in_cursor”,”out_cursor” and “vol_int”. The assembly for the above variables is only telling the registers where they should be placed. The more interesting part is where we use inline assembly for scaling the volume samples.

Now let’s run a simple test with the “time” command and see the results.

Here is how much time it took my program to execute:
Capture 1

Here are the results from the instructor’s version:
Capture 2

Clearly, mine is a bit slower.

Now let’s start answering the questions (Q) within the comment’s section of the instructor’s code.

Q1 What is an alternate approach?
int a=10, y;
__asm__ ("mov %1,%0;"
: "=r"(y)

: "r"(a)
:
);

We could use inline assembly to move the values of the variables into one of the given input and output registers. However, the compiler will decide which of the registers we have given to use for the variables. Therefore, this is not an explicit way to say variable “in_cursor” to be moved into “r20”.

Q2 Should we use 32767 or 32768 in next line? why?

We should use 32767. The reason is because we are using int16_t, the maximum value we can have in this data type is 32767.

Q3 What does it mean to “duplicate” values here?

__asm__("dup v1.8h,w22");

In this scenerio we only duplicate the 8 high-numbered lanes of V1 into register ‘w22’.

Reflection on Part A:

As seen in the short examples above, the use of the inline assembly approach can improve the performance of our program. However, there are is a number of things we need to take into consideration before coding that way. This is due to the fact that having such code is hard to maintain.

  1. Is the increase in performance worth it?
  2. Is the software going to be run on only one type of an architecture?
  3. What will be the cost to have inline assembler in your code, as you might have to change it quite often.

There is no doubt that one would need an extended knowledge in different computers’ architectures, to program this way. It is not only being able to read and write such code, but to understand and accurately evaluate its impact on your applications.

Part B

In part be we are to look for assembler in open source distributed audio packages. For the purpose, I have chosen libmad.

Once you are done extracting the files, you will have a directory containing all the files you need to install the audio software. We need to find the files which have inline assembler code, for that use the following command:

grep -r 'asm'

There are two header files that mainly contains inline assembler, these are “fixed.h” and “mad.h”. I am not going to place the source code here, because it might be hard to read as they are very lengthy and the code is quite complex.Therefore, let’s summarize what the code in these files is doing. In the “fixed.h” the code is dealing with how should fixed point integer math be done on different architectures: Intel, ARM, MIPS, SPARC, PowerPC,

I tried to compile the program on a my Mac but it produces a number of errors.
Screen Shot 2017-12-12 at 8.00.51 пр.об.

Reflection Part B
I find the inline assembler difficult to read,write, as well as maintain. The people writing such code for open source software should consider the amount of programers having knowledge in assembler is quite limited. Even more, this drastically decrease the software portability as well the number of people being able to contribute to a particular project.

Lab 5

In the first part of the lab, we had to write and compile a simple C program. Basically, the program has 3 arrays, 1000 integer elements each. The elements of the first two are being randomly initialized to values from the range of 1000 to -1000. Elements of the third array represent the total value of the values sitting at the same index in array one and two. Finally, we have an integer that holds the total value for all the elements in array three.

Here is my code to achieve the first part:


1 #include
2 #include
3 #include
4
5 #define SIZE 1000 // array size
6 #define LOW -1000 // lowest value in the range
7 #define HIGH 1000 // highest value in the range
8
9 int main(void) {
10
11 int a[SIZE]; // first array
12 int b[SIZE]; // second array
13 int c[SIZE]; // array for totals from a and b
14 int div = (HIGH + 1 - LOW); // in range divider
15 int total = 0;
16 int x;
17 srand(time(NULL)); // plant the seed for the random()
18 for (x = 0; x < 1000; x++) {
19 a[x] = LOW + rand() % div; // get a random number
20 b[x] = LOW + rand() % div; // get a random number
21 c[x] = a[x] + b[x];
22 total += c[x];
23 }
24 printf("%d\n ", total);
25 return 0;
26 }

In the second part of the lab, we are to compile the program and have the code auto-vectorized. That means the compiler will optimize our code wherever optimization can be made. For comparison let us first compile without any optimization.

Screen Shot 2017-10-11 at 2.32.12 PM

Now let’s have a look at what the compiler has done for us:

Screen Shot 2017-10-11 at 2.42.33 PM

We pay special attention to what is happening in the for loop. Why? Because this is where most of the computations happen and might be a block that needs optimization. In the code above, the compiler is not doing any optimization, but rather providing the equivalent assembly code for the C code instructions to be executed. Now let’s vectorized the code and see if there is any difference.

We include the “O3” option for maximum optimization, which also includes auto-vectorization.

Screen Shot 2017-10-11 at 3.08.36 PM

Here it the for loop again:

Screen Shot 2017-10-11 at 3.08.04 PM

In the vectorized code above we can see the loop unrolling optimization. By doing this, the compiler reduces the number of times the loop is going to loop.

Before optimization, we should evaluate the pros and cons of the optimization itself. Depending on available resources, as well as a type accomplishment a programmer would like to make with a program, he/she should decide the level of optimization if any is needed.

 

Lab 3

In lab3 we are given the task to write(copy) a loop and the in addition to print the index number. This turned out to be quite a challenge due to the number of instructions that had to be given in order for “simple task”. Programming languages like C, C++ have left me with the impression that strings, numbers and data in general is something simple for the computer/processor is to put out on screen as long as you tell it what type the data is. However, I was amazed by the number of primitive operations we need to declare in order to tell the actual hardware how, when, and what to do.

Here is the sample of the loop we had to use to for the lab:

.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $start,%r15         /* loop index */

loop:
    /* ... body of the loop ... do something useful here ... */

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

This is the start point of lab3. Here we first had to add modify the loop so it actually prints a string.

Here is the code that only prints a string “loop”.

 

  1 .text

2 .globl    _start

3

4 start = 0                       /* starting value for the loop index; note that this is a symbol       (constant), not a vari    able */

5 max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

6 stdout = 1

7

8 _start:

9     mov     $start,%r15         /* loop index */

10

11 loop:

12     /* … body of the loop … do something useful here … */

13

14     inc     %r15                /* increment index */

15     cmp     $max,%r15           /* see if we’re done */

16     mov    $len,%rdx            /* message length */

17     mov    $msg,%rsi            /* message location */

18     mov    $stdout,%rdi         /* file descriptor stdout */

19     mov    $1,%rax              /* syscall sys_write */

20     syscall

21     jne     loop                /* loop if we’re not */

22

23     mov     $0,%rdi             /* exit status */

24     mov     $60,%rax            /* syscall sys_exit */

25     syscall

26

27 msg:    .ascii      “Loop\n”

28 .set len , . – msg


 

Screen Shot 2017-10-02 at 9.12.30 AM

Here is the output of the above program: The other part of the lab is where the challenge was. We had to modify the code so we the index of the loop is printed. Further, the index had to be printed as a two digit character which required extra logic (a subroutine). Also the operations required a reference to the ASCII table. Here is the solution:

 1 .text
 2 .globl _start
 3 
 4 start = 0 
 5 max = 30           /*number of times to go thorugh the loop*/
 6 stdout = 1
 7 
 8 _start:
 9 mov $start,%r15    /* loop index */
 10 
 11 loop:
 12 
 13 mov $48, %r11     /*move the value of 48 as a starting point of integer characters*/
 14 mov $48, %r13
 15 mov $0, %rdx     /*move the value of 0 in rdx before devision*/
 16 mov $10, %r10    /*move the devisor "10" into r10*/
 17 mov %r15, %rax   /*move the value of the index in rax for devision*/
 18 div %r10         /*devide rax by r10*/
 19 add %rdx, %r13   /*move the value of the devision in r13*/
 20 add %rax, %r11   /*move the value of the remainder in r11*/
 21 
 22 cmp $0, %r11     /*check to see if reminder is greater than 0*/
 23 jg subr
 24 
 25 subr:
 26 mov %r11b,first   /*move the first digint to the string*/
 27 mov %r13b,second  /*move the second digit to the string*/
 28 
 29 
 30 mov $len,%rdx
 31 mov $msg,%rsi
 32 mov $stdout,%rdi
 33 mov $1,%rax
 34 syscall
 35 
 36 inc %r15         /*increment the index*/
 37 cmp $max,%r15    /*compare to max*/
 38 jne loop         /*continue the loop*/
 39 
 40 
 41 mov $0,%rdi      /* exit status */
 42 mov $60,%rax     /*syscall sys_exit */
 43 syscall
 44 
 45 .data
 46 
 47 msg: .ascii "loop: \n"
 48 .set len , . - msg
 49 .set first, msg + 6
 50 .set second, msg + 7    

 

Screen Shot 2017-10-02 at 12.26.13 PM

Reflection:

  • This lab gave me an inside on how many instructions need to be provided for the processor to execute a “simple operation” such as looping.
  • I was introduced to a starter pack of assembly instructions

Lab4

Build an open source software package

The free software package I downloaded is “GNU Aspell”.

“GNU Aspell is a spell checker that can be used either as a library or as an independent spell checker. It does a much better job of coming up with possible suggestions than other English language spell checkers. Other technical enhancements over Ispell include shared memory for dictionaries and intelligent handling of personal dictionaries when more than one Aspell process is open.”

The Library is downloaded as a TAR file with the source, header, make as well as binary files.  In order to build the software package, I first had to run the “configure” command. The “./configure” command (binary file) checked what the software package required in order to be built. It also checked if my system had the necessary files, libraries, and compilers available for the build process.

 

Screen Shot 2017-09-26 at 3.30.26 PM

I had the required programs, therefore proceeded. After the configure program finished, it was time to build the software by invoking the “make” command. This resulted in invoking a number of other commands and compiling a great number of C++ and C files/programs.

Reflection:

  1. This part of the lab helped me learn how software is actually configured, compiled, and then installed on the system.
  2. Also, the process demonstrated what exactly the commands/programs do in order to set up for the installation of the actual software.

Build and test glibc

Building the glibc was a little bit different. As the Standard C library needs to have its source in one directory and its built objects need to be in a separate directory. We cannot configure and build in the same directory. Therefore, the “configure” command along with the “make” command had to be run form the build directory.

Before configuring, I introduced a bug in one of the functions from the <stdio.h> library. In particular, the “printf(const char* format)” function.

  1 /* Copyright (C) 1991-2017 Free Software Foundation, Inc.

2    This file is part of the GNU C Library.

3

4    The GNU C Library is free software; you can redistribute it and/or

5    modify it under the terms of the GNU Lesser General Public

6    License as published by the Free Software Foundation; either

7    version 2.1 of the License, or (at your option) any later version.

8

9    The GNU C Library is distributed in the hope that it will be useful,

10    but WITHOUT ANY WARRANTY; without even the implied warranty of

11    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU

12    Lesser General Public License for more details.

13

14    You should have received a copy of the GNU Lesser General Public

15    License along with the GNU C Library; if not, see

16    <http://www.gnu.org/licenses/&gt;.  */

17

18 #include <libioP.h>

19 #include <stdarg.h>

20 #include <stdio.h>

21

22 #undef printf

23

24 /* Write formatted output to stdout from the format string FORMAT.  */

25 /* VARARGS1 */

26 int

27 __printf (const char *format, …)

28 {

29   va_list arg;

30   int done=0;

31

32  va_start (arg, format);

33

 34  /* Commented out the call to the vfprintf function. */

35

 36   //done = vfprintf (stdout, format, arg);

37   va_end (arg);

38

39   return done;

40 }

41

42 #undef _IO_printf

43 ldbl_strong_alias (__printf, printf);

44 /* This is for libg++.  */

45 ldbl_strong_alias (__printf, _IO_printf);

Commenting out the “vfprintf (stdout, format, arg)” call, prevents the printf function from sending any output. Saved the file, configured and compiled the glibc.

In order to test, I wrote a “hello world” program.

  1 #include <stdio.h>

2

3 int main(void){

4  char h[20]=”hello world\n \0″;

5  /*using const char* to make sure that the exact printf function is called*/

6  const char* p = h;

7  printf(“%s”,p);

8  return 0;

9 }

After compilation, I tested the program with the already installed glibc on the system as well as with the one I had just built. To test with my version of the library I run the program through the “testrun.sh” loader.

Screen Shot 2017-09-27 at 10.47.36 AM

From the screenshot above we can see two calls to the hello world program. The first one is using the system glibc, no problems there, we got the output. The second one is using the glibc I built and we can see that the printf function is not producing any output to the terminal.

Reflection:

I find this lab to be quite insightful due to the below experience:

  1. I was introduced to the glibc, the structure of it as well as the actual size of it.
  2. In order to find a function, it takes some investigation to find where the file for it is. This helped me to get more familiar with a number of directories we have in the source directory.