Walkthrough
Overview
pcpaml
standardizes network traffic analysis tasks at the dataset level. Rather than focus on a standardized methodology, feature set, or library for combining traffic traces and metadata (such as labels for machine learning tasks), pcapml
provides a system for directly coupling raw traffic traces and metadata by using the Next Generation PCAP (pcpang
) format. pcapng
files can still be read by libraries such as libpcap
, and inspected using tools such as tcpdump
or tshark
. Whereas a pcap
represents a linked-list of packets, a pcapng
represents a linked list of blocks, which we can use to directly couple metadata and raw packets.
Sample IDs
pcapml
attaches a sampleID to each packet, enabling us to group packets arbitrarily. With arbitrary packet groupings, we can attach metadata to any set of packets, such as a traffic flow, device, application, anomaly, or time window. a sampleID is created by hashing the metadata associated with a given packet.
Usage
Metadata Files
pcapml
metadata files are structured CSV files containing three columns. No header is expected:
traffic_filter,metadata,hash_key
In detail:
- The
traffic_filter
column designates a filter that a set of packets will hit. Unless denoted otherwise by thehash_key
column, every packet that hits this filter belongs to a single traffic sample. - The
metadata
column designates the metadata that will be attached to every packet that hits thetraffic_filter
defined. - The
hash_key
is an optional column that enables users to build traffic samples out of multiple lines of a metadata file. By default, the sampleID for each line is generated by hashing the entire csv line. When ahash_key
is provided, this behavior is overridden and the sampleID for every packet that hits the traffic filter is generated by hashing thehash_key
value.
Lines beginning with #
are skipped when processing the metadata file. No header is expected in a metadata file.
PcapML Directory Mode
pcapml
can attach a sample ID and metadata to a directory of pcap
s. Let’s walk through an example of this usage type using the snowflake fingerprintability dataset. This dataset contains a set of DTLS handshakes from four applications, Facebook Messenger, Discord, Google Hangouts, and Snowflake.
Metadata File
When using a metadata file in directory mode, each traffic_filter must be pre-empted with FILE:
# traffic_filter,metadata,hash_key
FILE:facebook-handshake-1.pcap,facebook
FILE:discord-handshake-1.pcap,discord
FILE:discord-handshake-2.pcap,discord
FILE:snowflake-handshake-1.pcap,snowflake
...
...
Directory Usage
We can then attach the metadata for each handshake in the dataset with a unique sampleID and it’s corresponding application metadata in a pcapng
with using pcapml
in a single command.
$ pcapml -D dataset/ -L metadata.csv -W snowflake-dataset.pcapng
This results in a pcapng
that can be examined with tcpdump
.
$ tcpdump -r snowflake-dataset.pcapng -c 10
reading from file dtls-dataset.pcapng, link-type EN10MB (Ethernet)
12:58:52.562021 IP 74.125.250.71.19305 > 192.168.7.222.55937: UDP, length 161
12:58:52.562788 IP 192.168.7.222.55937 > 74.125.250.71.19305: UDP, length 618
12:58:52.585452 IP 74.125.250.71.19305 > 192.168.7.222.55937: UDP, length 1119
12:58:52.586333 IP 192.168.7.222.55937 > 74.125.250.71.19305: UDP, length 962
13:07:34.459150 IP 74.125.250.26.19305 > 192.168.7.222.54537: UDP, length 161
13:07:34.460771 IP 192.168.7.222.54537 > 74.125.250.26.19305: UDP, length 617
13:07:34.486225 IP 74.125.250.26.19305 > 192.168.7.222.54537: UDP, length 1119
13:07:34.487034 IP 192.168.7.222.54537 > 74.125.250.26.19305: UDP, length 962
17:12:42.435787 IP 74.125.250.71.19305 > 192.168.7.222.54510: UDP, length 161
17:12:42.438214 IP 192.168.7.222.54510 > 74.125.250.71.19305: UDP, length 705
Upon further inspection using tshark
, we see the sampleID and met directly encoded in the output file, where each handshake receives a unique sampleID
, leaving no ambiguity on how the metadata is attached to the traffic.
$ tshark -r snowflake-dataset.pcapng -T fields -E header=y -e frame.comment -c 10
9003219589747928972,google
9003219589747928972,google
9003219589747928972,google
9003219589747928972,google
18186043603218801379,google
18186043603218801379,google
18186043603218801379,google
18186043603218801379,google
14792257769479651673,google
14792257769479651673,google
PcapML PCAP Mode
pcapml
can attach a sample ID and metadata to traffic in a given pcap
or tag live traffic by leveraging BPF filters, time windows, or any combination of the two.
Metadata File
When ussing pcapml
in PCAP mode, traffic_filters
are pre-empted by one of three tags: BPF:
for BPF filters, TS_START:
for denoting a start time, TS_END:
for denoting an end time. These filters can be combined with the |
delimiter. For example, if we wanted to attach a piece of metadata to every packet from a few IP addresses in a given traffic capture, the metadata file may resemble the one below.
# traffic_filter,metadata,hash_key
BPF:src 1.2.3.4,windows_device
BPF:src 5.6.7.8,mac_device
BPF:src 4.3.2.1,linux_device
...
If we wanted to only attach metadata to packets from that IP address in a given time frame we could combine any BPF filter with the timestamp options.
# traffic_filter,metadata,hash_key
BPF:src 1.2.3.4|TS_START:2345678|TS_END:2346789,windows_device
BPF:src 5.6.7.8,mac_device
BPF:src 4.3.2.1,linux_device
...
If we wanted to attach metadata to traffic in an arbitrary time window, say an anomaly, we can simply not supply a BPF filter for the metadata to be attached.
# traffic_filter,metadata,hash_key
TS_START:345678|TS_END:2346789,anomaly
TS_START:3456789|TS_END:4567890,anomaly
TS_START:234567,benign
...
Single PCAP Usage
We can then attach the metadata for each line in our metadata file in a pcapng
using pcapml
in a single command.
$ pcapml -P traffic.pcap -L metadata.csv -W encoded-dataset.pcapng
PcapML Sorting
By default, any pcapml
encoded file that is encoded using single pcap mode is left in the original order the traffic was capture in. In many cases, it is beneficial to instead group the packets first by the sampleID that they are associated with and then in time order. pcapml
can sort the packets of any pcapml
encoded dataset by sampleID -> timestamp in the same command that metadata encoding is performed.
$ pcapml -P traffic.pcap -L metadata.csv -W encoded-dataset.pcapng -s
PcapML Extraction Mode
pcapml
can transform any pcapng
file encoded using pcapml
into a directory of pcap
files, one file per sampleID, using a single command.
$ pcapml -M snowflake-dataset.pcapng -O output_dir/
The associated output directory is below.
12868490791586055289_google.pcap 1567624542436120405_google.pcap 1912376493094597460_facebook.pcap 4676138587463220727_discord.pcap 7254850485921062848_snowflake.pcap 9982255779078537418_facebook.pcap
12869046855586552312_google.pcap 15679484754686191639_snowflake.pcap 1913144610008489287_snowflake.pcap 4681195497095943526_google.pcap 7256930666031261978_facebook.pcap 9985167304055034928_discord.pcap
1286997930834027255_snowflake.pcap 15679812277643271301_facebook.pcap 1916232746973137357_discord.pcap 4681553453692518285_firefox_facebook.pcap 7259083765404759561_google.pcap 9987359559073017848_discord.pcap
12872114975008852282_google.pcap 15680488968942900640_discord.pcap 1923291248933451411_google.pcap 4682231651136175711_firefox_snowflake.pcap 7270091391588454401_facebook.pcap 9987474006825771582_discord.pcap
1287296956060682578_snowflake.pcap 15684992164591678892_discord.pcap 1925192065339906372_discord.pcap 4683353161259521763_chrome_discord.pcap 7275055456267078471_discord.pcap 9988384164514239661_discord.pcap
12873202627492535975_facebook.pcap 15686837623379429946_snowflake.pcap 1926070980564693651_facebook.pcap 4686331860154165481_chrome_google.pcap 7275947196122656266_facebook.pcap 9988671347025180223_google.pcap
metadata.csv
Also note that a metadata.csv
file is generated which maps each individual pcap
to the metadata associated with the traffic in that file.
$ head metadata.csv
File,Metadata
14944434813179707824_google.pcap,google
14395580548679227705_google.pcap,google
14489979562741152699_google.pcap,google
870078443570293459_google.pcap,google
6809604472343037417_google.pcap,google
9649013506394351716_google.pcap,google
16984261106149530861_google.pcap,google
12493399449979137519_google.pcap,google
7073271527767585992_google.pcap,google
Analysis
Although pcapml
output can be read by tools such as tshark
or tcpudmp
, we realize that the crux of traffic analysis tasks involves extracting identifying information from traffic samples. As such, we have built and released pcapml-FE to easily and directly interact with pcapml
encoded datasets.