Walkthrough

Overview

pcpaml standardizes network traffic analysis tasks at the dataset level. Rather than focus on a standardized methodology, feature set, or library for combining traffic traces and metadata (such as labels for machine learning tasks), pcapml provides a system for directly coupling raw traffic traces and metadata by using the Next Generation PCAP (pcpang) format. pcapng files can still be read by libraries such as libpcap, and inspected using tools such as tcpdump or tshark. Whereas a pcap represents a linked-list of packets, a pcapng represents a linked list of blocks, which we can use to directly couple metadata and raw packets.

Sample IDs

pcapml attaches a sampleID to each packet, enabling us to group packets arbitrarily. With arbitrary packet groupings, we can attach metadata to any set of packets, such as a traffic flow, device, application, anomaly, or time window. a sampleID is created by hashing the metadata associated with a given packet.

Usage

Metadata Files

pcapml metadata files are structured CSV files containing three columns. No header is expected:

traffic_filter,metadata,hash_key

In detail:

The traffic_filter column designates a filter that a set of packets will hit. Unless denoted otherwise by the hash_key column, every packet that hits this filter belongs to a single traffic sample.
The metadata column designates the metadata that will be attached to every packet that hits the traffic_filter defined.
The hash_key is an optional column that enables users to build traffic samples out of multiple lines of a metadata file. By default, the sampleID for each line is generated by hashing the entire csv line. When a hash_key is provided, this behavior is overridden and the sampleID for every packet that hits the traffic filter is generated by hashing the hash_key value.

Lines beginning with # are skipped when processing the metadata file. No header is expected in a metadata file.

PcapML Directory Mode

pcapml can attach a sample ID and metadata to a directory of pcaps. Let’s walk through an example of this usage type using the snowflake fingerprintability dataset. This dataset contains a set of DTLS handshakes from four applications, Facebook Messenger, Discord, Google Hangouts, and Snowflake.

Metadata File

When using a metadata file in directory mode, each traffic_filter must be pre-empted with FILE:

# traffic_filter,metadata,hash_key 
FILE:facebook-handshake-1.pcap,facebook
FILE:discord-handshake-1.pcap,discord
FILE:discord-handshake-2.pcap,discord
FILE:snowflake-handshake-1.pcap,snowflake
...
...

Directory Usage

We can then attach the metadata for each handshake in the dataset with a unique sampleID and it’s corresponding application metadata in a pcapng with using pcapml in a single command.

$ pcapml -D dataset/ -L metadata.csv -W snowflake-dataset.pcapng

This results in a pcapng that can be examined with tcpdump.

$ tcpdump -r snowflake-dataset.pcapng -c 10
reading from file dtls-dataset.pcapng, link-type EN10MB (Ethernet)
58:52.562021 IP 74.125.250.71.19305 > 192.168.7.222.55937: UDP, length 161
58:52.562788 IP 192.168.7.222.55937 > 74.125.250.71.19305: UDP, length 618
58:52.585452 IP 74.125.250.71.19305 > 192.168.7.222.55937: UDP, length 1119
58:52.586333 IP 192.168.7.222.55937 > 74.125.250.71.19305: UDP, length 962
07:34.459150 IP 74.125.250.26.19305 > 192.168.7.222.54537: UDP, length 161
07:34.460771 IP 192.168.7.222.54537 > 74.125.250.26.19305: UDP, length 617
07:34.486225 IP 74.125.250.26.19305 > 192.168.7.222.54537: UDP, length 1119
07:34.487034 IP 192.168.7.222.54537 > 74.125.250.26.19305: UDP, length 962
12:42.435787 IP 74.125.250.71.19305 > 192.168.7.222.54510: UDP, length 161
12:42.438214 IP 192.168.7.222.54510 > 74.125.250.71.19305: UDP, length 705

Upon further inspection using tshark, we see the sampleID and met directly encoded in the output file, where each handshake receives a unique sampleID, leaving no ambiguity on how the metadata is attached to the traffic.

$ tshark -r snowflake-dataset.pcapng -T fields  -E header=y -e frame.comment -c 10
9003219589747928972,google
9003219589747928972,google
9003219589747928972,google
9003219589747928972,google
18186043603218801379,google
18186043603218801379,google
18186043603218801379,google
18186043603218801379,google
14792257769479651673,google
14792257769479651673,google

PcapML PCAP Mode

pcapml can attach a sample ID and metadata to traffic in a given pcap or tag live traffic by leveraging BPF filters, time windows, or any combination of the two.

Metadata File

When ussing pcapml in PCAP mode, traffic_filters are pre-empted by one of three tags: BPF: for BPF filters, TS_START: for denoting a start time, TS_END: for denoting an end time. These filters can be combined with the | delimiter. For example, if we wanted to attach a piece of metadata to every packet from a few IP addresses in a given traffic capture, the metadata file may resemble the one below.

# traffic_filter,metadata,hash_key
BPF:src 1.2.3.4,windows_device
BPF:src 5.6.7.8,mac_device
BPF:src 4.3.2.1,linux_device
...

If we wanted to only attach metadata to packets from that IP address in a given time frame we could combine any BPF filter with the timestamp options.

# traffic_filter,metadata,hash_key 
BPF:src 1.2.3.4|TS_START:2345678|TS_END:2346789,windows_device
BPF:src 5.6.7.8,mac_device
BPF:src 4.3.2.1,linux_device
...

If we wanted to attach metadata to traffic in an arbitrary time window, say an anomaly, we can simply not supply a BPF filter for the metadata to be attached.

# traffic_filter,metadata,hash_key 
TS_START:345678|TS_END:2346789,anomaly
TS_START:3456789|TS_END:4567890,anomaly
TS_START:234567,benign
...

Single PCAP Usage

We can then attach the metadata for each line in our metadata file in a pcapng using pcapml in a single command.

$ pcapml -P traffic.pcap -L metadata.csv -W encoded-dataset.pcapng

PcapML Sorting

By default, any pcapml encoded file that is encoded using single pcap mode is left in the original order the traffic was capture in. In many cases, it is beneficial to instead group the packets first by the sampleID that they are associated with and then in time order. pcapml can sort the packets of any pcapml encoded dataset by sampleID -> timestamp in the same command that metadata encoding is performed.

$ pcapml -P traffic.pcap -L metadata.csv -W encoded-dataset.pcapng -s

PcapML Extraction Mode

pcapml can transform any pcapng file encoded using pcapml into a directory of pcap files, one file per sampleID, using a single command.

$ pcapml -M snowflake-dataset.pcapng -O output_dir/

The associated output directory is below.

12868490791586055289_google.pcap     1567624542436120405_google.pcap       1912376493094597460_facebook.pcap     4676138587463220727_discord.pcap     7254850485921062848_snowflake.pcap  9982255779078537418_facebook.pcap
12869046855586552312_google.pcap      15679484754686191639_snowflake.pcap  1913144610008489287_snowflake.pcap   4681195497095943526_google.pcap      7256930666031261978_facebook.pcap    9985167304055034928_discord.pcap
1286997930834027255_snowflake.pcap   15679812277643271301_facebook.pcap   1916232746973137357_discord.pcap      4681553453692518285_firefox_facebook.pcap   7259083765404759561_google.pcap     9987359559073017848_discord.pcap
12872114975008852282_google.pcap     15680488968942900640_discord.pcap    1923291248933451411_google.pcap      4682231651136175711_firefox_snowflake.pcap  7270091391588454401_facebook.pcap   9987474006825771582_discord.pcap
1287296956060682578_snowflake.pcap   15684992164591678892_discord.pcap     1925192065339906372_discord.pcap      4683353161259521763_chrome_discord.pcap     7275055456267078471_discord.pcap     9988384164514239661_discord.pcap
12873202627492535975_facebook.pcap   15686837623379429946_snowflake.pcap  1926070980564693651_facebook.pcap    4686331860154165481_chrome_google.pcap      7275947196122656266_facebook.pcap   9988671347025180223_google.pcap
metadata.csv

Also note that a metadata.csv file is generated which maps each individual pcap to the metadata associated with the traffic in that file.

$ head metadata.csv

File,Metadata
14944434813179707824_google.pcap,google
14395580548679227705_google.pcap,google
14489979562741152699_google.pcap,google
870078443570293459_google.pcap,google
6809604472343037417_google.pcap,google
9649013506394351716_google.pcap,google
16984261106149530861_google.pcap,google
12493399449979137519_google.pcap,google
7073271527767585992_google.pcap,google

Analysis

Although pcapml output can be read by tools such as tshark or tcpudmp, we realize that the crux of traffic analysis tasks involves extracting identifying information from traffic samples. As such, we have built and released pcapml-FE to easily and directly interact with pcapml encoded datasets.