Fundamentals 18 min read

Data Structures from Scratch: From Arrays to Graphs (BigTech Level)

Published on May 12, 2026

Data Structures from Scratch: From Arrays to Graphs (BigTech Level)

Most courses and tutorials teach data structures as purely academic or mathematical concepts to pass an interview or an exam. They make you memorize that a tree search is O(log n) and give you a pat on the back. But in real-world software engineering, choosing the wrong container can turn a sub-millisecond operation into a catastrophic production crash under high concurrency.

When you work in high-level languages like JavaScript, Python, or Java, data structures tend to look like magic boxes. An Array in JavaScript grows dynamically, can contain strings mixed with numbers and objects, and you never worry about where those bytes live. But beneath that comfort lies a real cost in performance and memory.

In this article we’re going to dissect the fundamental data structures from their foundation. To truly understand them, we’ll use modern C++: the language of choice for game engines, databases, and operating systems — one that gives us absolute, transparent control over pointers, the stack, the heap, and memory allocation.

Why C++ to Learn Data Structures?

In C++ there’s no Garbage Collector cleaning up your messes, and no hidden-cost abstractions. If you request memory on the heap using new, you are responsible for freeing it with delete (or using smart pointers like std::unique_ptr).

This forces us to understand two critical concepts that dictate performance on modern hardware:

  1. Memory Management (Stack vs Heap): The stack is fast, contiguous, and automatically cleaned up, but limited in size. The heap is gigantic and dynamic, but requesting memory there is expensive for the operating system and prone to fragmentation.
  2. Cache Locality: The CPU is thousands of times faster than RAM. To avoid waiting, the CPU loads entire contiguous blocks of memory into its ultra-fast caches (L1, L2, L3). If your data is packed together in memory, your code flies. If your data is scattered (jumping through pointers), the CPU stalls waiting for RAM — known as a Cache Miss.

Let’s see how data structures behave under this lens.

1. Static and Dynamic Arrays (arrays and std::vector)

What Are They and What Happens in Memory?

An array is the simplest data structure and the one most mechanically aligned with hardware. It’s a completely contiguous block of memory where each element of the same type is placed exactly next to the previous one.

Being contiguous, accessing any position is instantaneous (O(1)) via simple pointer arithmetic: base_address + (index * element_size).

Why Are They the CPU’s Best Friend?

Because of cache locality. When you access index 0 of an array, the CPU’s prefetcher automatically loads the following elements (1, 2, 3…) into the L1 cache. Iterating over an array sequentially is the fastest operation a computer can execute.

What Are They Used For and Who Uses Them?

  • Game Engines (Unity, Unreal Engine): In ECS (Entity Component System) architecture, components are stored in contiguous arrays to process millions of entities at 60+ FPS without cache misses.
  • High-Frequency Trading (HFT) Systems: Financial algorithms where microsecond latency means losing millions.
  • Network and Image Buffers: Storing the pixels of a frame or raw TCP packets.

When to Use Them and When to Avoid Them?

  • ✅ When to use them: When you need ultra-fast iteration, constant random access by index, and you know the approximate size in advance.
  • ❌ When to avoid them: When you need to insert or delete elements frequently in the middle or at the beginning. Being contiguous memory, inserting at index 0 forces shifting all existing elements one position to the right (O(n)).

Code Example in C++

Let’s implement our own simplified dynamic vector to see how it handles automatic growth by doubling capacity in the heap:

#include <iostream>
#include <stdexcept>

template <typename T>
class CustomVector {
private:
    T* data;              // Pointer to the heap memory block
    size_t vec_size;      // Current number of elements
    size_t vec_capacity;  // Total reserved capacity

    void realloc(size_t new_capacity) {
        T* new_data = new T[new_capacity];
        
        for (size_t i = 0; i < vec_size; i++) {
            new_data[i] = std::move(data[i]);
        }
        
        delete[] data;
        data = new_data;
        vec_capacity = new_capacity;
    }

public:
    CustomVector() : data(nullptr), vec_size(0), vec_capacity(0) {}

    ~CustomVector() {
        delete[] data;
    }

    void push_back(const T& value) {
        if (vec_size == vec_capacity) {
            realloc(vec_capacity == 0 ? 2 : vec_capacity * 2);
        }
        data[vec_size++] = value; // O(1) amortized insert
    }

    T& operator[](size_t index) {
        if (index >= vec_size) throw std::out_of_range("Index out of range");
        return data[index]; // O(1) direct access
    }

    size_t size() const { return vec_size; }
    size_t capacity() const { return vec_capacity; }
};

int main() {
    CustomVector<int> numbers;
    numbers.push_back(10);
    numbers.push_back(20);
    numbers.push_back(30); // Capacity doubling happens here

    std::cout << "Element [1]: " << numbers[1] << std::endl;
    std::cout << "Current capacity: " << numbers.capacity() << std::endl;
    return 0;
}

2. Linked Lists

What Are They and What Happens in Memory?

A linked list abandons the idea of contiguous memory. It consists of scattered nodes anywhere in the heap. Each node stores its data and a pointer to the memory address of the next node (Singly Linked) or to both the next and previous nodes (Doubly Linked).

What Are They Used For and Who Uses Them?

  • LRU Caches (Least Recently Used): Moving elements to the front of the usage queue in O(1) by combining them with a hash map.
  • Operating Systems (Linux Kernel): Managing process lists in the scheduler or free blocks in the filesystem.
  • Action History: Simple Undo/Redo implementations in text editors.

When to Use Them and When to Avoid Them?

  • ✅ When to use them: When you need constant O(1) insertions and deletions anywhere in the collection as long as you already have the pointer to that node, without relocating large memory blocks.
  • ❌ When to avoid them: Virtually any time raw read performance matters. Linked lists are a nightmare for the CPU: every next pointer is a random jump to another region of RAM, causing constant Cache Misses. They also waste memory by storing extra pointers for every element.

Code Example in C++

A clean implementation of a doubly linked list:

#include <iostream>

template <typename T>
class DoublyLinkedList {
private:
    struct Node {
        T data;
        Node* next;
        Node* prev;
        Node(const T& val) : data(val), next(nullptr), prev(nullptr) {}
    };

    Node* head;
    Node* tail;

public:
    DoublyLinkedList() : head(nullptr), tail(nullptr) {}

    ~DoublyLinkedList() {
        Node* current = head;
        while (current != nullptr) {
            Node* next_node = current->next;
            delete current;
            current = next_node;
        }
    }

    // O(1) real front insert
    void push_front(const T& value) {
        Node* new_node = new Node(value);
        if (!head) {
            head = tail = new_node;
        } else {
            new_node->next = head;
            head->prev = new_node;
            head = new_node;
        }
    }

    // O(1) removal assuming we already have the pointer
    void remove_node(Node* node) {
        if (!node) return;
        
        if (node->prev) node->prev->next = node->next;
        else head = node->next;

        if (node->next) node->next->prev = node->prev;
        else tail = node->prev;

        delete node;
    }

    void print() const {
        Node* curr = head;
        while (curr) {
            std::cout << curr->data << " <-> ";
            curr = curr->next;
        }
        std::cout << "NULL\n";
    }
};

int main() {
    DoublyLinkedList<int> list;
    list.push_front(3);
    list.push_front(2);
    list.push_front(1);
    
    list.print(); // Output: 1 <-> 2 <-> 3 <-> NULL
    return 0;
}

3. Stacks & Queues

What Are They?

More than ground-up structures, they are abstract data types (adapters) that impose strict rules on how elements are added and removed from an underlying container (which can be a vector, a list, or a deque).

  • Stack: Follows the LIFO rule (Last In, First Out). The last element in is the first one out.
  • Queue: Follows the FIFO rule (First In, First Out). The first element in is the first one out.

Who Uses Them and What For?

Stacks:

  • The Interpreter/Compiler Call Stack: Managing function calls and their local variables in any programming language.
  • Expression evaluation and parsing: Converting infix to postfix notation, validating balanced parentheses.
  • Depth-First Search (DFS) algorithms: Traversing trees or graphs by exploring one branch to the end before backtracking.

Queues:

  • Message Brokers (RabbitMQ, Kafka): Queuing async events or tasks so workers consume them in order.
  • Event Loop (Node.js / Browsers): Processing I/O operation callbacks, promises, and timers in arrival order.
  • Breadth-First Search (BFS) algorithms: Exploring neighbors level by level in a graph (ideal for shortest path in unweighted maps).

How to Know Which One to Use?

The choice is purely dictated by the business logic of the problem:

  • Need to process things fairly in arrival order? Queue.
  • Need to backtrack, evaluate nested hierarchies, or process the most recent thing first? Stack.

Code Example in C++

Instead of using scattered pointers, let’s implement an ultra-efficient Circular Queue backed by a static array. This guarantees O(1) for enqueue and dequeue without cache misses or memory reallocation:

#include <iostream>
#include <stdexcept>

template <typename T, size_t MaxSize>
class CircularQueue {
private:
    T buffer[MaxSize];
    size_t head;
    size_t tail;
    size_t current_size;

public:
    CircularQueue() : head(0), tail(0), current_size(0) {}

    void enqueue(const T& item) {
        if (current_size == MaxSize) {
            throw std::overflow_error("Queue is full");
        }
        buffer[tail] = item;
        tail = (tail + 1) % MaxSize; // Wrap index using modulo
        current_size++;
    }

    T dequeue() {
        if (current_size == 0) {
            throw std::underflow_error("Queue is empty");
        }
        T item = buffer[head];
        head = (head + 1) % MaxSize;
        current_size--;
        return item;
    }

    bool empty() const { return current_size == 0; }
    size_t size() const { return current_size; }
};

int main() {
    CircularQueue<std::string, 3> tasks;
    tasks.enqueue("Render UI");
    tasks.enqueue("Query DB");
    
    std::cout << "Processing: " << tasks.dequeue() << std::endl; // Render UI
    tasks.enqueue("Send Email");
    std::cout << "Processing: " << tasks.dequeue() << std::endl; // Query DB
    return 0;
}

4. Hash Tables (std::unordered_map)

What Are They and How Do They Work in Memory?

A hash table offers the holy grail of data structures: average O(1) search, insertion, and deletion.

It works by mapping a key (string, number, etc.) to a specific index in an internal array (called a bucket array). To do this, it passes the key through a deterministic hash function that generates an integer, then applies modulo to the array size: index = hash(key) % num_buckets.

The Inevitable Problem: Collisions

Two completely different keys can produce the same index. There are two main techniques to handle this:

  1. Chaining: Each bucket stores a linked list (or small vector). On collision, the new key-value pair is appended to that list.
  2. Open Addressing: If the bucket is occupied, the algorithm probes sequentially for the next free bucket (linear or quadratic probing). Much more cache-friendly.

What Are They Used For and Who Uses Them?

  • In-Memory Databases (Redis, Memcached): Massive, ultra-fast key-to-value mapping for caching systems.
  • Relational Database Engines: Implementing Hash Joins to rapidly cross giant tables in memory.
  • Deduplication and frequency counting: Checking if a user already voted, counting words, or storing active user sessions.

When to Use Them and When to Avoid Them?

  • ✅ When to use them: When key-lookup speed is everything and element order doesn’t matter at all.
  • ❌ When to avoid them:
    • When you need to iterate over keys in alphabetical or numerical order (hash tables randomize everything chaotically).
    • In hard real-time systems where the worst case O(n) (if all keys collide in one bucket or a massive rehashing occurs) is unacceptable.
    • When memory is very limited (hash tables must maintain low load factors and reserve more memory than they use).

Code Example in C++

Simplified hash table implementation using chaining with vectors:

#include <iostream>
#include <vector>
#include <list>
#include <string>

class SimpleHashTable {
private:
    std::vector<std::list<std::pair<std::string, int>>> buckets;
    size_t num_buckets;

    size_t hash_function(const std::string& key) const {
        size_t hash = 0;
        for (char c : key) {
            hash = hash * 31 + c;
        }
        return hash % num_buckets;
    }

public:
    SimpleHashTable(size_t size = 10) : num_buckets(size) {
        buckets.resize(num_buckets);
    }

    void insert(const std::string& key, int value) {
        size_t index = hash_function(key);
        for (auto& pair : buckets[index]) {
            if (pair.first == key) {
                pair.second = value;
                return;
            }
        }
        buckets[index].push_back({key, value});
    }

    bool search(const std::string& key, int& out_value) const {
        size_t index = hash_function(key);
        for (const auto& pair : buckets[index]) {
            if (pair.first == key) {
                out_value = pair.second;
                return true; // Found in O(1) average
            }
        }
        return false;
    }
};

int main() {
    SimpleHashTable age_map(10);
    age_map.insert("Alice", 28);
    age_map.insert("Bob", 34);
    age_map.insert("Charlie", 40);

    int bob_age = 0;
    if (age_map.search("Bob", bob_age)) {
        std::cout << "Bob's age: " << bob_age << std::endl;
    }
    return 0;
}

5. Trees (BST and AVL / Red-Black Trees)

What Are They?

A tree is a hierarchical, non-linear data structure. It consists of nodes connected by pointers that impose a Parent-Child relationship. It starts at a single Root node and branches downward to nodes with no children, called Leaves.

The most common and important type is the Binary Search Tree (BST), which enforces one key rule: for any node, all elements in its left subtree are smaller than it, and all elements in its right subtree are larger.

What Are They Used For and Who Uses Them?

  • Database Engines (PostgreSQL, MySQL): They use heavily optimized disk-oriented variants called B-Trees and B+ Trees to maintain sorted indexes enabling hyper-fast range queries (WHERE age BETWEEN 20 AND 30).
  • Web Browsers (Chrome, Firefox): In-memory representation of the DOM (Document Object Model) and CSSOM.
  • Compilers (GCC, Clang): Building the AST (Abstract Syntax Tree) to analyze and compile source code.
  • 3D Rendering Engines: Using Octrees or BVH trees to efficiently compute physical collisions and spatial visibility.

When to Use Them and Which One?

  • ✅ When to use them: When you need to store elements in sorted order at all times and perform fast O(log n) searches, insertions, and deletions — plus prefix or range queries.
  • ❌ The deadly danger (Why you need balanced trees): If you insert sequentially ordered data (1, 2, 3, 4, 5) into a plain BST, the tree degenerates into a right-leaning linked list, destroying its performance and dropping to O(n). That’s why in real production you always use self-balancing trees like AVL or Red-Black Trees (the internal structure behind std::map and std::set in C++), which automatically restructure their pointers after each insertion to guarantee optimal constant height.

Code Example in C++

A simple Binary Search Tree with insertion and sorted traversal (In-Order):

#include <iostream>

template <typename T>
class BinarySearchTree {
private:
    struct Node {
        T data;
        Node* left;
        Node* right;
        Node(const T& val) : data(val), left(nullptr), right(nullptr) {}
    };

    Node* root;

    Node* insert_impl(Node* node, const T& value) {
        if (node == nullptr) {
            return new Node(value);
        }
        if (value < node->data) {
            node->left = insert_impl(node->left, value);
        } else if (value > node->data) {
            node->right = insert_impl(node->right, value);
        }
        return node;
    }

    // In-Order traversal (Left - Root - Right) produces sorted data
    void inorder_impl(Node* node) const {
        if (node != nullptr) {
            inorder_impl(node->left);
            std::cout << node->data << " ";
            inorder_impl(node->right);
        }
    }

    void destroy_tree(Node* node) {
        if (node != nullptr) {
            destroy_tree(node->left);
            destroy_tree(node->right);
            delete node;
        }
    }

public:
    BinarySearchTree() : root(nullptr) {}

    ~BinarySearchTree() {
        destroy_tree(root);
    }

    void insert(const T& value) {
        root = insert_impl(root, value);
    }

    void print_sorted() const {
        inorder_impl(root);
        std::cout << std::endl;
    }
};

int main() {
    BinarySearchTree<int> bst;
    bst.insert(50);
    bst.insert(30);
    bst.insert(70);
    bst.insert(20);
    bst.insert(40);

    std::cout << "Sorted elements in O(n): ";
    bst.print_sorted(); // Output: 20 30 40 50 70
    return 0;
}

6. Graphs

What Are They?

A graph is the most general and flexible data structure that exists (in fact, a tree is simply a connected graph with no cycles). It consists of a set of Vertices (nodes) connected by Edges.

They can be:

  • Directed vs Undirected: Edges are one-way arrows (Twitter/X: following someone doesn’t mean they follow you back) or bidirectional connections (Facebook: friendship is mutual).
  • Weighted vs Unweighted: Edges have a numeric cost or weight (distance in kilometers, network latency) or they all cost exactly the same.

Memory Representation

Since nodes can connect to any other node, there are two native ways to map them:

  1. Adjacency Matrix: A V×V 2D matrix of booleans or weights. Checking if a connection exists between A and B is O(1). Terrible for memory when the graph is sparse (few connections), consuming O(V²).
  2. Adjacency List: An array or map of vertices where each vertex stores a list of the neighbors it connects to. The standard, efficient way to represent real-world graphs in the industry.

What Are They Used For and Who Uses Them?

  • Social Networks: Mapping the massive social graph to suggest friends or detect communities.
  • Navigation and Routing Systems: Google Maps or internet routers running algorithms like Dijkstra or A* to find optimal routes considering traffic or latency.
  • Package Managers (npm, pnpm, cargo): Dependency tree resolution by building a Directed Acyclic Graph (DAG) and applying topological sorting to install or compile libraries in the correct order.
  • Recommendation Engines (Netflix, Amazon): Connecting users to viewed products to infer shared preferences through bipartite graphs.

Code Example in C++

Let’s implement an undirected graph using an Adjacency List and run a BFS traversal to find hop distances:

#include <iostream>
#include <vector>
#include <unordered_map>
#include <queue>

class Graph {
private:
    std::unordered_map<int, std::vector<int>> adj_list;

public:
    void add_edge(int u, int v) {
        adj_list[u].push_back(v);
        adj_list[v].push_back(u); // Remove this line for a directed graph
    }

    void bfs_traversal(int start_vertex) {
        std::unordered_map<int, bool> visited;
        std::queue<int> queue;

        queue.push(start_vertex);
        visited[start_vertex] = true;

        std::cout << "BFS traversal from node " << start_vertex << ": ";

        while (!queue.empty()) {
            int curr = queue.front();
            queue.pop();
            std::cout << curr << " ";

            for (int neighbor : adj_list[curr]) {
                if (!visited[neighbor]) {
                    visited[neighbor] = true;
                    queue.push(neighbor);
                }
            }
        }
        std::cout << std::endl;
    }
};

int main() {
    Graph social_network;
    // 1 connects to 2 and 3
    social_network.add_edge(1, 2);
    social_network.add_edge(1, 3);
    // 2 connects to 4
    social_network.add_edge(2, 4);
    // 3 connects to 4
    social_network.add_edge(3, 4);

    social_network.bfs_traversal(1); // Output: 1 2 3 4
    return 0;
}

Summary and Practical Comparison (Cheat Sheet)

Here’s the definitive guide to theoretical performance versus real hardware behavior — so you know which structure to pick for your next architecture design:

Data StructureAccess (Index)SearchInsertDeleteCache LocalityPrimary Use Case
ArrayO(1)O(n)O(n)O(n)ExcellentIndex lookup, buffers, fast mass iteration
Linked ListO(n)O(n)O(1)*O(1)*PoorLRU caches, OS queues, constant reordering
Stack / QueueN/AN/AO(1)O(1)DependsCall stacks, message brokers, BFS/DFS
Hash TableN/AO(1)O(1)O(1)Medium/LowInstant key-value mapping, caches, deduplication
BST (Balanced)N/AO(log n)O(log n)O(log n)Medium/LowDB indexes, sorted range searches
Graph (Adj. List)N/AO(V + E)O(1)O(V)VariableNetworks, maps, routes, dependencies

*Note: In a linked list, insertion and deletion are O(1) only if you already hold a direct pointer to the target node.

My Personal Take

Learning data structures deeply isn’t about acing obscure BigTech technical interviews or rewriting your own hash table in C++ every day for your web APIs. It’s about building a robust, deep mental model of how software interacts with the physics of hardware.

When you understand how bytes are laid out in memory, you stop seeing containers as interchangeable tools. You start automatically asking yourself:

  • Is this structure going to jump all over the heap and destroy my CPU cache?
  • Do I really need a heavy hash table here, or would a flat sorted array with binary search be enough?
  • What will happen to my pointers if this container grows and reallocates its memory elsewhere in RAM under high concurrency?

That ability to reason about spatial and temporal trade-offs is what distinguishes a programmer who simply writes code from a true Software Architect and Engineer. Invest the time to understand these mechanical foundations without abstractions getting in the way — the rest of your career in any high-level language will have an incomparable clarity.